NYUTron: Health System Scale Language Models Are All-purpose Prediction Engines

Description

NYUTron is a large language model-based system that was developed with the objective of integrating clinical workflows centered around structured and unstructured notes and placing electronic orders in real time. The development team queried electronic health records from all NYU Langone facilities to generate two types of datasets: pre-training datasets ("NYU Notes", "NYU Notes–Manhattan", "NYU Notes–Brooklyn") which contain a total of 10 years of unlabelled inpatient clinical notes (387,144 patients, 4.1 billion words) and five fine-tuning datasets ("NYU Readmission", "NYU Readmission–Manhattan", "NYU Readmission–Brooklyn", "NYU Mortality", "NYU Binned LOS", "NYU Insurance Denial", "NYU Binned Comorbidity"), each containing 1 to 10 years of inpatient clinical notes (55,791 to 413,845 patients, 51 to 87 million words) with task-specific labels (2 to 4 classes). In addition, the team utilized two publicly available datasets, i2b2-2012 and MIMIC-III, for testing and fine-tuning.

To assess the model's predictive capabilities, NYUTron was applied to a battery of five tasks: three clinical and two operational tasks (30-day all-cause readmission prediction, in-hospital mortality prediction, comorbidity index prediction, length of stay (LOS) prediction and insurance denial prediction). In addition, a detailed analysis of our 30-day readmission task was performed to investigate data efficiency, generalizability, deployability, and potential clinical impact. NYUTron demonstrated an area under the curve (AUC) of 78.7–94.9%, with an improvement of 5.36–14.7% compared with traditional models.

The investigators have shared code to replicate the pretraining, fine-tuning and testing of the predictive models obtained with NYU Langone electronic health records, as well as preprocessing code for the i2b2-2012 dataset and implementation steps for MIMIC-III.

Subject of Study

Subject Domain

Electronic Health Records

Keywords

Big Data

Electronic Health Records

Access

Restrictions: Free to All
Instructions: Synthetic data, experimental data, and modeling code can be accessed through the project's GitHub repository. The clinical data used for the pretraining, fine-tuning, validation and test sets were collected from the NYU Langone Health System EHR and cannot be made publicly available.; Access via GitHub

Associated Publications

Jiang LY, Liu XC, Nejatian NP, Nasir-Moin M, Wang D, Abidin A, Eaton K, Riina HA, Laufer I, Punjabi P, Miceli M, Kim NC, Orillac C, Schnurman Z, Livia C, Weiss H, Kurland D, Neifert S, Dastagirzada Y, Kondziolka D, Cheung ATM, Yang G, Cao M, Flores M, Costa AB, Aphinyanaphongs Y, Cho K, Oermann EK. Health system-scale language models are all-purpose prediction engines. Nature. 2023 Jul;619(7969):357-362.

Data Type

Electronic Health Record

Software Used

deepspeed v0.8.0

Hugging Face Datasets v2.2.2

HuggingFace Evaluate v0.1.1

HuggingFace Transformers v4.19.2

matplotlib v3.5.2

nltk v3.6.3

NVIDIA Apex

pandas v1.4.2

ray v2.0.0

seaborn v0.12.2

sklearn v1.1.1

wandb v0.12.17

XGBoost v1.6.1

Grant Support: P30 CA016087-41S1/NCI NIH

W. M. Keck Research Program Grant/W.M. Keck Foundation
Other Resources: i2b2-2012
Unstructured notes from the Research Patient Data Registry at Partners Healthcare

MIMIC-III
Deidentified patient data from Beth Israel Deaconess Medical Center in Boston, MA