ChatGPT Predicts In-Hospital All-Cause Mortality for Sepsis: In-Context Learning with the Korean Sepsis Alliance Database
Article information
Abstract
Objectives
Sepsis is a leading global cause of mortality, and predicting its outcomes is vital for improving patient care. This study explored the capabilities of ChatGPT, a state-of-the-art natural language processing model, in predicting in-hospital mortality for sepsis patients.
Methods
This study utilized data from the Korean Sepsis Alliance (KSA) database, collected between 2019 and 2021, focusing on adult intensive care unit (ICU) patients and aiming to determine whether ChatGPT could predict all-cause mortality after ICU admission at 7 and 30 days. Structured prompts enabled ChatGPT to engage in in-context learning, with the number of patient examples varying from zero to six. The predictive capabilities of ChatGPT-3.5-turbo and ChatGPT-4 were then compared against a gradient boosting model (GBM) using various performance metrics.
Results
From the KSA database, 4,786 patients formed the 7-day mortality prediction dataset, of whom 718 died, and 4,025 patients formed the 30-day dataset, with 1,368 deaths. Age and clinical markers (e.g., Sequential Organ Failure Assessment score and lactic acid levels) showed significant differences between survivors and non-survivors in both datasets. For 7-day mortality predictions, the area under the receiver operating characteristic curve (AUROC) was 0.70–0.83 for GPT-4, 0.51–0.70 for GPT-3.5, and 0.79 for GBM. The AUROC for 30-day mortality was 0.51–0.59 for GPT-4, 0.47–0.57 for GPT-3.5, and 0.76 for GBM. Zero-shot predictions using GPT-4 for mortality from ICU admission to day 30 showed AUROCs from the mid-0.60s to 0.75 for GPT-4 and mainly from 0.47 to 0.63 for GPT-3.5.
Conclusions
GPT-4 demonstrated potential in predicting short-term in-hospital mortality, although its performance varied across different evaluation metrics.
I. Introduction
Sepsis is a life-threatening organ dysfunction caused by a dysregulated host response to infection [1]. Globally, sepsis is a leading cause of mortality and poses a significant challenge for health systems [2]. Predicting the outcome of sepsis patients is crucial for guiding treatment decisions, allocating resources, and improving patient care [3]. Although various regression and machine learning models have been developed to estimate mortality risk in sepsis patients, their practical use in clinical settings is limited. These models often require extensive datasets, lack sufficient interpretability and explainability, and depend on more features than necessary [4–6]. Accordingly, they are seldom used in practical clinical settings.
ChatGPT is a state-of-the-art large language model (LLM) with over 100 billion parameters that can perform various tasks such as text generation, summarization, and question-answering [7]. Since its launch in late 2022, ChatGPT has shown impressive capabilities in numerous fields, particularly in medicine. Here, it has achieved results that are comparable to or even surpass those of medical experts in answering medical questions and exams [8–10]. However, the application of LLMs in real-world clinical settings involves more than just retrieving information; it also requires clinical reasoning and decision support. The recently developed NYUTron, an LLM enhanced with medical knowledge, has shown significant promise in predicting in-hospital mortality and 30-day all-cause readmission. Nonetheless, the advancements it offers are limited by the substantial additional investment needed for such refinement.
In this study, we investigated the potential of ChatGPT to overcome these challenges by utilizing its pre-trained parameters and extensive dataset to generate natural language responses to structured prompts. We did not require additional training or fine-tuning of ChatGPT; instead, we employed in-context learning to tailor it to the specific task. We evaluated ChatGPT’s effectiveness in predicting in-hospital mortality among sepsis patients using clinical data and scenarios.
II. Methods
1. Dataset
This study was a secondary analysis of data prospectively collected from the Korean Sepsis Alliance (KSA) database between September 2019 and December 2021. The KSA database serves as a nationwide registry of sepsis cases from 16 tertiary or university-affiliated hospitals across South Korea [11]. The Institutional Review Boards of all participating hospitals including Samsung Medical Center (Approval No. 2018-05-108) waived the requirement for informed consent due to the study’s observational nature. The research was conducted in accordance with the principles outlined in the Declaration of Helsinki.
Adult patients (over the age of 18) who were admitted to the intensive care unit (ICU) were included in the study. Clinical characteristics, including age (stratified by the Charlson comorbidity index), sex, Sequential Organ Failure Assessment (SOFA) score, serum lactic acid levels, and in-hospital mortality, were collected. All-cause mortality was assessed from the first day of the ICU stay up to the 30th day. Patients who were transferred to other hospitals were excluded from the analysis because their final outcomes, whether death or survival, were unknown.
The research employed two sub-datasets from the KSA database, each aimed at predicting all-cause mortality before discharge at two different time points: day 7 and day 30. In the dataset used for predicting 7-day mortality, patients who were transferred out before day 7 were excluded. Similarly, in the dataset for 30-day mortality prediction, patients transferred before day 30 were also excluded.
2. Task
The objective of this study is to experimentally determine whether ChatGPT can predict various patients’ future outcomes based on provided clinical data. We assessed its ability to predict survival or death at 7 days and 30 days following the initial day of ICU admission, utilizing the SOFA score and lactic acid level from the first day. This analysis was conducted using two sub-datasets extracted from the KSA database. Specifically, for the 30-day dataset, we collected data on survival or death at multiple time points post-ICU admission (0, 1, 2, 3, 4, 5, 6, 7, 14, 21, 28, and 30 days). Additionally, we assessed how the accuracy of predictions, based on the first day’s data, varied over time.
3. Prompt
The prompts were structured into two parts, providing examples for in-context learning and then asking for a prediction for a new case.
Examples were provided as follows:
“sex (male or female), age score in Charlson Comorbidity Index (1–5), SOFA score (X), lactic acid level (Y) at the first day of ICU. The patient was observed (survived or died) after 7 days from ICU admission.”
And for the prediction, the questions for new cases were as follows:
“(male or female), (Charlson age), (SOFA score), (lactic acid level) at the first day of ICU. What is your prediction on the outcome of survival or death at the 7 days after ICU admission?”
Figure 1 offers a schematic illustration of the interaction between the user and ChatGPT. It shows how examples for in-context learning and questions were presented, followed by the corresponding responses. This figure helps clarify the process by which the model uses the provided data to make predictions.
4. In-Context Learning
To evaluate ChatGPT’s predictive capabilities under various conditions, we manipulated the number of examples provided as follows: in the zero-shot scenario, ChatGPT was tasked with predicting a patient’s survival or death without any prior examples. In the one-shot scenario, a single representative patient example was provided, based on which ChatGPT made its prediction. In the two-shot scenario, two patient examples were presented, and in the few-shot scenario, six patient examples were used.
The examples of patients selected for in-context learning consisted of representative cases from those included in the study. Since the SOFA score was normally distributed, the mean and standard deviation (SD) were used (mean – SD, mean, mean + SD). Conversely, because the distribution of lactic acid levels was skewed, representative values were selected using the first quartile, median, and third quartile (Supplementary Table S1, Figure S2).
5. Experiment and Comparison
The predictions were conducted using ChatGPT-3.5-turbo and ChatGPT-4 through the OpenAI API. Additionally, a gradient boosting model (GBM) was employed to demonstrate the performance of a conventional machine learning model. The GBM was trained using the scikit-learn package, with an 8:2 dataset split for training and validation purposes. To ensure a fair comparison, the performance of ChatGPT was assessed using a randomly selected sample of 100 patients from the validation set, while the examples of patients used for in-context learning were drawn from the training set. For each validation, we calculated accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUROC) score to compare the performance of ChatGPT-3.5-turbo, ChatGPT-4, and gradient boosting (Figure 2).
III. Results
1. Patients and Baseline Characteristics
The KSA database included a total of 11,981 adult sepsis patients, of whom 4,890 were admitted to the ICU during their hospital stay (Supplementary Figure S1). By day 7, 104 patients had been transferred to another hospital, and by day 30, the number of transfers had increased to 865. Consequently, the dataset used for predicting all-cause mortality before discharge by day 7 included 4,786 patients (7D dataset), while the dataset for the 30-day mortality prediction comprised 4,025 patients (30D dataset).
Out of the 4,786 patients in the 7D dataset, 718 died within 7 days (Table 1). The sex distribution among these patients was not significantly different (p = 0.825). Among survivors, patients in their 70s were the most common, comprising 30% (1,221/4,068). However, in the mortality group, individuals in their 80s were the most prevalent, at 36.8% (264/718), with a significant difference in age distribution between the two groups (p < 0.001). The mortality group also had a higher SOFA score (12.6 ± 3.7) than the survivors (9.3 ± 3.6; p < 0.001). Additionally, lactic acid levels were higher in the mortality group (median [interquartile range], 7.1 [3.9–11.8]) than in the survivors (2.4 [1.5–4.4]; p < 0.001).
Of the 4,025 patients in the 30D dataset, 1,368 died within 30 days (Table 2). The male distribution was 58.0% (1,540/2,657) in the survivors and 61.8% (845/1,368) in the mortality group, showing a significant difference (p = 0.022). In the survivor group, the largest proportion constituted patients in their 70s (39.6%; 787/2,657). In contrast, the largest proportion of patients in the mortality group consisted of individuals in their 80s (33.0%; 452/1,368), with a significant difference in age distribution between the groups (p < 0.001). The SOFA score was higher in the mortality group (12.0 ± 3.8) than in the survivors (9.0 ± 3.5) (p < 0.001). Similarly, the lactic acid level was higher in the mortality group, at 5.0 (2.5–9.9), compared to 2.3 (1.5–4.1) in the survivor group (p < 0.001).
2. Predicting All-Cause Mortality before Discharge by Day 7
The predictive performance of GPT-4 for 7-day mortality showed accuracy values ranging from 0.48 to 0.77, precision from 0.98 to 1.00, recall from 0.40 to 0.76, F1-scores from 0.57 to 0.86, and AUROC values from 0.70 to 0.83. The model with the highest recall (0.76), F1-score (0.86), and AUROC (0.83) was the one-shot model (7D_S_3). In contrast, GPT-3.5 demonstrated a wide range of prediction performance, with accuracy, precision, recall, F1-score, and AUROC values ranging from 0.13 to 0.67, 0.92 to 1.00, 0.02 to 0.69, 0.04 to 0.79, and 0.51 to 0.70, respectively (Table 3, Figure 3A). GBM achieved an AUROC of 0.79 (Supplementary Table S2).
3. Predicting All-Cause Mortality before Discharge by Day 30
The predictive performance of GPT-4 in predicting 30-day mortality showed accuracy values ranging from 0.44 to 0.57, precision from 0.65 to 0.73, recall from 0.21 to 0.67, F1-scores from 0.32 to 0.68, and AUROC values from 0.51 to 0.59. The model with the highest recall (0.67), F1-score (0.68), and AUROC (0.59) was the one-shot model (7D_ S_3). Meanwhile, the predictive performance of GPT-3.5 for predicting 30-day mortality exhibited accuracy values ranging from 0.35 to 0.57, precision ranging from 0.00 to 1.00, recall ranging from 0.00 to 0.61, F1-scores ranging from 0.00 to 0.64, and AUROC values ranging from 0.47 to 0.57 (Table 4, Figure 3B). GBM achieved an AUROC of 0.76 (Supplementary Table S3).
4. Predicting All-Cause Mortality from ICU Admission to Day 30 with a Zero-Shot Approach
ChatGPT was tasked with predicting mortality on the first day of ICU admission using the SOFA score and lactic acid levels from that day, in a zero-shot scenario without any patient examples. The performance of GPT-4 was shown by an AUROC of 0.75, compared to 0.49 for GPT-3.5. Similarly, the models’ performance was evaluated up to the seventh day of admission using the initial SOFA score and lactic acid levels. GPT-4’s performance showed AUROCs of 0.73, 0.73, 0.71, 0.71, 0.72, 0.70, and 0.69, while GPT-3.5’s were 0.54, 0.53, 0.63, 0.49, 0.57, 0.53, and 0.54. For predicting mortality at 14, 21, 28, and 30 days, GPT-4 maintained AUROCs of 0.66, while GPT-3.5 showed AUROCs of 0.47, 0.59, 0.58, and 0.54 (refer to Table 5, Figure 4).
IV. Discussion
This study aimed to investigate the potential of ChatGPT for predicting in-hospital mortality among sepsis patients using clinical data from the KSA database. The findings indicate that ChatGPT can accurately forecast 7-day and 30-day clinical outcomes from data collected on a patient’s first day in the ICU. Among the models tested, GPT-4 exhibited superior performance in predicting 7-day mortality with a one-shot example, achieving an AUROC of 0.83, an F1-score of 0.876, a precision of 0.98, and a recall of 0.76. Remarkably, GPT-4 also demonstrated strong predictive ability using a zero-shot approach, with an AUROC of 0.81, an F1-score of 0.76, a precision of 1.00, and a recall of 0.61. This level of performance, attained without any tailored training and relying solely on pre-trained knowledge, is comparable to that of specialized machine learning models such as GBM, which required training on 80% of the dataset to achieve an AUROC of 0.79.
Predicting 30-day mortality based solely on data from the first day of ICU admission proved to be challenging. The AUROC for GPT-4 ranged from 0.51 to 0.59, significantly lower than its performance in 7-day predictions. In contrast, the GBM machine learning model demonstrated a more robust performance, with an AUROC of 0.76. Further analysis indicated a temporal dependency in the predictions: GPT-4’s predictive accuracy declined as the time between the data collection and the targeted prediction date increased (Figure 4). A similar trend was observed in the GBM model. These findings suggest that the relevance of specific features in model predictions may vary over time, a characteristic inherent to time-series data [12,13]. This highlights potential limitations in using initial ICU data for long-term predictions.
A central methodological feature of this study was the use of in-context learning. Generally, when language models are prompted with examples, their performance tends to improve due to their capacity to identify underlying patterns, a phenomenon known as “language models are few-shot learners” [14–16]. This was particularly noticeable in GPT-3.5, especially in its predictions of 7-day mortality. However, GPT-4 demonstrated superior performance in a zero-shot scenario. Contrary to expectations, as the number of examples increased, the model’s performance did not improve but instead declined, as illustrated in Figure 2. While the precise cause of this trend is unclear, one hypothesis is that GPT-4, with its extensive training data and increased parameters, might already have a significant inherent ability for inference. The examples provided during in-context learning could inadvertently introduce a negative bias, which might explain the observed decrease in predictive performance.
Throughout the study, we utilized GBM as a representative benchmark for classical machine learning predictions. This decision was based on comparative evaluations in which GBM consistently outperformed logistic regression, random forest, and decision tree (Supplementary Tables S2, S3). When predicting 7-day and 30-day mortality across various datasets, GBM generally surpassed GPT-4. Given GBM’s specific design for predictive tasks, its proficiency in extracting and leveraging information from the data is to be expected [17,18]. This robustness of GBM as a predictive tool was clearly demonstrated in our study. In contrast, ChatGPT was not originally trained for predictive tasks but was developed to generate coherent text sequences [19]. However, it was noteworthy that in certain tasks, GPT-4 showed a predictive capability that rivaled that of GBM.
The prediction of mortality in sepsis patients is a crucial component of personalized medicine, facilitating tailored treatment strategies and optimal resource allocation [20–22]. Recent advances in machine learning and artificial intelligence have enhanced the use of these technologies for prognostic assessments. Notably, the fine-tuning of the large language model NYUTron has been reported to improve in-hospital mortality predictions [4,6,23–26]. However, such AI models generally require large datasets and substantial computational expertise for their development. In contrast, ChatGPT, a pretrained large language model, simplifies user interaction by providing direct responses to prompts, thereby eliminating the need for specialized computer science knowledge. Available through a web-based platform, Chat-GPT also offers interpretable explanations for its mortality predictions, making it more accessible and useful for users without technical backgrounds [8,19,27].
This study represents one of the initial efforts to validate ChatGPT’s predictive capabilities in the clinical domain. Despite its objectives, the study faces several limitations. The performance of GPT-4 and GPT-3.5 shows considerable variation depending on the evaluation metric used, with notable discrepancies in precision and recall. This variation can be attributed to the uneven distribution of mortality cases among the patients included in this study (less than 30%). The study attempted to predict in-hospital mortality using only the SOFA score and lactic acid levels. However, relying on just two variables from a vast array of clinical factors is a significant limitation. In clinical practice, numerous factors influence patient outcomes, and incorporating a broader range of variables would likely improve the model’s predictive accuracy and reliability. While predicting mortality is crucial, it is equally important to evaluate ChatGPT’s predictive capacity across a diverse range of clinical scenarios. Due to constraints related to the rate limit and cost of the OpenAI API, not all cases were verified, and the use of random sampling may have compromised the study’s reliability. Moreover, focusing exclusively on ChatGPT, given the availability of various other LLMs, raises questions about the generalizability of the findings to other models. Nevertheless, this study explored the predictive capacity of ChatGPT using clinical data from the KSA database and demonstrated its potential in interpreting clinical data and predicting future clinical outcomes.
In conclusion, this experimental study evaluated ChatGPT’s ability to predict all-cause in-hospital mortality among sepsis patients. GPT-4 showed promise in forecasting short-term in-hospital mortality, although its performance differed across various evaluation metrics. Therefore, additional research is necessary to fully ascertain its capabilities, limitations, and optimal uses in the medical field.
Data Availability
Raw data were generated from the Korean Sepsis Alliance (KSA) and the data are available on request from the KSA. The data are not publicly available because of privacy or ethical restrictions.
Acknowledgments
The following people and institutions participated in the Korean Sepsis Alliance (KSA): Steering Committee, Chae-Man Lim (Chair), Kyeongman Jeon, Dong Kyu Oh, Sunghoon Park, Yeon Joo Lee, Sang-Bum Hong, Gee Young Suh, Young-Jae Cho, Ryoung-Eun Ko, and Sung Yoon Lim; Participating Persons and Centers, Kangwon National University Hospital, Jeongwon Heo; Korea University Anam Hospital, Jae-myeong Lee; Daegu Catholic University Hospital, Kyung Chan Kim; Seoul National University Bundang Hospital, Yeon Joo Lee; Inje University Sanggye Paik Hospital, Youjin Chang; Samsung Medical Center, Kyeongman Jeon; Seoul National University Hospital, Sang-Min Lee; Asan Medical Center, Chae-Man Lim and Suk-Kyung Hong; Pusan National University Yangsan Hospital, Woo Hyun Cho; Chonnam National University Hospital, Sang Hyun Kwak; Jeonbuk National University Hospital, Heung Bum Lee; Ulsan University Hospital, Jong-Joon Ahn; Jeju National University Hospital, Gil Myeong Seong; Chungnam National University Hospital, Song-I Lee; Hallym University Sacred Heart Hospital, Sunghoon Park; Hanyang University Guri Hospital, Tai Sun Park; Severance Hospital, Su Hwan Lee; Yeungnam University Medical Center, Eun Young Choi; Chungnam National University Sejong Hospital, Jae Young Moon.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.
This work was supported by the “Future Medicine 2030 Project” of the Samsung Medical Center (No. SMX1230771), “Bio&Medical Technology Development Program” of Korean government (MSIT) (No. RS-2023-00222838), and Research Program funded by the Korea Disease Control and Prevention Agency (Fund Code No. 2019E280500, 2020E280700, and 2021-10-026).
Supplementary Materials
Supplementary materials can be found via https://doi.org/10.4258/hir.2024.30.3.266.