NYUTron is a large language model-based system that was developed with the objective of integrating clinical workflows centered around structured and unstructured notes and placing electronic orders in real time. The development team queried electronic health records from all NYU Langone facilities to generate two types of datasets: pre-training datasets ("NYU Notes", "NYU Notes–Manhattan", "NYU Notes–Brooklyn") which contain a total of 10 years of unlabelled inpatient clinical notes (387,144 patients, 4.1 billion words) and five fine-tuning datasets ("NYU Readmission", "NYU Readmission–Manhattan", "NYU Readmission–Brooklyn", "NYU Mortality", "NYU Binned LOS", "NYU Insurance Denial", "NYU Binned Comorbidity"), each containing 1 to 10 years of inpatient clinical notes (55,791 to 413,845 patients, 51 to 87 million words) with task-specific labels (2 to 4 classes). In addition, the team utilized two publicly available datasets, i2b2-2012 and MIMIC-III, for testing and fine-tuning.

To assess the model's predictive capabilities, NYUTron was applied to a battery of five tasks: three clinical and two operational tasks (30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay (LOS) prediction and insurance denial prediction). In addition, a detailed analysis of our 30-day readmission task was performed to investigate data efficiency, generalizability, deployability, and potential clinical impact. NYUTron demonstrated an area under the curve (AUC) of 78.7–94.9%, with an improvement of 5.36–14.7% compared with traditional models.

The investigators have shared code to replicate the pretraining, fine-tuning and testing of the predictive models obtained with NYU Langone electronic health records, as well as preprocessing code for the i2b2-2012 dataset and implementation steps for MIMIC-III.

Subject of Study
Subject Domain


Free to All
Synthetic data, experimental data, and modeling code can be accessed through the project's GitHub repository. The clinical data used for the pretraining, fine-tuning, validation and test sets were collected from the NYU Langone Health System EHR and cannot be made publicly available.
Associated Publications
Data Type
Software Used
deepspeed v0.8.0
Hugging Face Datasets v2.2.2
HuggingFace Evaluate v0.1.1
HuggingFace Transformers v4.19.2
matplotlib v3.5.2
nltk v3.6.3
pandas 1.4.2
ray v2.0.0
seaborn v0.12.2
sklearn v1.1.1
wandb v0.12.17
XGBoost v1.6.1
Grant Support
W. M. Keck Research Program Grant/W.M. Keck Foundation
Other Resources

Unstructured notes from the Research Patient Data Registry at Partners Healthcare


Deidentified patient data from Beth Israel Deaconess Medical Center in Boston, MA