NYUTron: Health System Scale Language Models Are All-purpose Prediction Engines
- Description
NYUTron is a large language model-based system that was developed with the objective of integrating clinical workflows centered around structured and unstructured notes and placing electronic orders in real time. The development team queried electronic health records from all NYU Langone facilities to generate two types of datasets: pre-training datasets ("NYU Notes", "NYU Notes–Manhattan", "NYU Notes–Brooklyn") which contain a total of 10 years of unlabelled inpatient clinical notes (387,144 patients, 4.1 billion words) and five fine-tuning datasets ("NYU Readmission", "NYU Readmission–Manhattan", "NYU Readmission–Brooklyn", "NYU Mortality", "NYU Binned LOS", "NYU Insurance Denial", "NYU Binned Comorbidity"), each containing 1 to 10 years of inpatient clinical notes (55,791 to 413,845 patients, 51 to 87 million words) with task-specific labels (2 to 4 classes). In addition, the team utilized two publicly available datasets, i2b2-2012 and MIMIC-III, for testing and fine-tuning.
To assess the model's predictive capabilities, NYUTron was applied to a battery of five tasks: three clinical and two operational tasks (30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay (LOS) prediction and insurance denial prediction). In addition, a detailed analysis of our 30-day readmission task was performed to investigate data efficiency, generalizability, deployability, and potential clinical impact. NYUTron demonstrated an area under the curve (AUC) of 78.7–94.9%, with an improvement of 5.36–14.7% compared with traditional models.
The investigators have shared code to replicate the pretraining, fine-tuning and testing of the predictive models obtained with NYU Langone electronic health records, as well as preprocessing code for the i2b2-2012 dataset and implementation steps for MIMIC-III.
Access
- Restrictions
-
Free to All
- Instructions
- Synthetic data, experimental data, and modeling code can be accessed through the project's GitHub repository. The clinical data used for the pretraining, fine-tuning, validation and test sets were collected from the NYU Langone Health System EHR and cannot be made publicly available.