Machine Learning for Benchmarking Critical Care Outcomes

Article information

Healthc Inform Res. 2023;29(4):301-314
Publication date (electronic) : 2023 October 31
doi :
1Clinical Integration and Insights, Philips, Cambridge, MA, USA
2Clinical Integration and Insights, Philips, Eindhoven, The Netherlands
3EMR & Care Management, Philips, Cambridge, MA, USA
Corresponding Author: Louis Atallah, Clinical Integration and Insights, Philips, 222 Jacobs Street, 7th Floor, Cambridge, MA 02141, USA. Tel: +1-617-798-8244, E-mail: (
Received 2022 December 9; Revised 2023 August 23; Accepted 2023 September 25.



Enhancing critical care efficacy involves evaluating and improving system functioning. Benchmarking, a retrospective comparison of results against standards, aids risk-adjusted assessment and helps healthcare providers identify areas for improvement based on observed and predicted outcomes. The last two decades have seen the development of several models using machine learning (ML) for clinical outcome prediction. ML is a field of artificial intelligence focused on creating algorithms that enable computers to learn from and make predictions or decisions based on data. This narrative review centers on key discoveries and outcomes to aid clinicians and researchers in selecting the optimal methodology for critical care benchmarking using ML.


We used PubMed to search the literature from 2003 to 2023 regarding predictive models utilizing ML for mortality (592 articles), length of stay (143 articles), or mechanical ventilation (195 articles). We supplemented the PubMed search with Google Scholar, making sure relevant articles were included. Given the narrative style, papers in the cohort were manually curated for a comprehensive reader perspective.


Our report presents comparative results for benchmarked outcomes and emphasizes advancements in feature types, preprocessing, model selection, and validation. It showcases instances where ML effectively tackled critical care outcome-prediction challenges, including nonlinear relationships, class imbalances, missing data, and documentation variability, leading to enhanced results.


Although ML has provided novel tools to improve the benchmarking of critical care outcomes, areas that require further research include class imbalance, fairness, improved calibration, generalizability, and long-term validation of published models.

I. Introduction

Performance comparison is an important aspect of benchmarking in critical care, whether to observe a critical care unit over time or to compare units, hospitals, or even health systems across geographic regions [1,2]. Benchmarking outcomes in critical care, such as mortality or length of stay, allows a risk-adjusted comparison with healthcare leaders as a proxy for quality and efficacy of care. Risk adjustment models have been the cornerstone for benchmarking outcomes in critical care. These models allow the prediction of outcomes to enable the benchmarking or comparison of actual versus predicted outcomes among peers. Outcomes are difficult to interpret unless they are risk-stratified for diagnosis groups, severity of illness, and other patient characteristics [3].

Several taskforces worldwide have recommended the use of quality indicators that are measurable, comparable, and relevant across critical care units [1,46]. Regarding outcomes, several measures have been proposed [3,7]. An example is mortality, which is utilized as a quality indicator in intensive care units (ICUs) due to its direct reflection of patient outcomes; it serves to measure the effectiveness of medical interventions and the overall quality of care provided. Mortality is usually assessed using the standardized mortality ratio, which compares actual hospital mortality to predicted mortality through risk-adjusted scoring systems. Morbidity and complications, such as acute renal failure, hemodialysis, and prolonged mechanical ventilation, are more prevalent than mortality events and are also used as outcome measures [8]. Length of stay, encompassing both hospital and ICU durations, is commonly employed as an indicator of cost and efficiency; however, it is influenced by variables like structural factors and patient transfers [9]. Variation in ICU readmissions can also highlight opportunities for enhancement and is potentially influenced by ICU discharge practices [10]. Ventilation outcomes, including mechanical ventilation duration [11] and probability [12], facilitate the comparison of ventilator practices across ICUs. Ventilation outcomes are also valuable for controlling patient disparities in clinical trials or weaning techniques and for advancing quality improvement endeavors. Patient-reported outcomes, used to a lesser degree, cover a range of aspects, such as cognition, fatigue, pain, psychological well-being, activities of daily living, sleep, appetite, and alcohol consumption [13].

Machine learning (ML) constitutes a field within computer science where statistical techniques are employed to analyze data, which facilitates classification, prediction, and optimization by leveraging past data observations. It can help address issues such as imbalanced classes (such as deaths versus surviving patients), missing data, and variation in documentation. This narrative review is meant for clinicians and scientists who would like to understand some of the most important directions in developing these models for benchmarking clinical outcomes. We also highlight the most important sources of bias and variations in performance, aiming to give researchers a concrete list of factors to consider when planning benchmarking studies.

II. Methods

This article reviews ML approaches for benchmarking clinical outcomes in the ICU with a focus on mortality, length of stay, and mechanical ventilation. The literature search was conducted on PubMed, including all articles and reviews between January 1, 2003 and August 1, 2023. Search terms for mortality were “mortality,” “ICU” AND (“machine learning” OR “artificial Intelligence”). For length of stay, they were “length of stay,” “ICU” AND (“machine learning” OR “artificial Intelligence”). For ventilation, the search terms were “ventilation,” “ICU” AND (“machine learning” OR “artificial Intelligence”). Only articles related to adult critical care in English were included. The searches above were also conducted in Google Scholar to ensure that relevant works were not excluded, and to add any missing articles. The initial search yielded 592 articles on mortality, 143 on length of stay, and 195 on ventilation. After a meticulous review, 26, 12, and nine pertinent papers were chosen for each respective domain. For mortality and length of stay, we eliminated articles focusing on specific patient groups and focused on approaches applicable to all critical care patients. An added condition for mortality was the use of a dataset of more than 10,000 patients to enable a fair comparison of results between different studies. In this narrative review, we focus on the important directions for ML in each outcome area rather than providing an exhaustive listing of prior work.

III. Outcome Benchmarking with ML

1. Mortality Benchmarking with ML

Mortality prediction models are applied to critical care patients for benchmarking and stratification into different risk categories. The most widely used models are the Acute Physiology and Chronic Health Evaluation (APACHE) models, the Simplified Acute Physiology Score (SAPS) I–III, and the Mortality Prediction Model (MPM) [14]. However, other models have been developed for improved calibration in particular regions, such as the Intensive Care National Audit & Research Centre (ICNARC) in the UK [15].

Several reviews have covered mortality models: Keuning et al. [16] surveyed predictive mortality models and focused mostly on statistical linear models. An earlier review by Strand et al. [17] reviewed articles focusing on prognostic, single-organ failure, trauma scores and organ dysfunction scores. Siontis et al. [18] evaluated predictive mortality models with a focus on specific patient groups. Promising approaches for particular groups such as brain injury [19] and coronavirus disease 2019 (COVID-19) patients [20] have also been explored. The 2012 PhysioNet/Computing in Cardiology Challenge focused on the prediction of in-hospital mortality of ICU patients leading to several new prediction models [21]. In a more recent review by Barboi et al. [22], the authors highlighted that ML-based models can accurately predict ICU mortality as an alternative to traditional scoring models. However, they concluded that the results cannot be generalized due to the high degree of heterogeneity and that clinicians should only select models with sufficient validation for use in a practice environment.

Table 1 summarizes several relevant articles on ML for predicting mortality [15,2346]. Note that there are several aspects of mortality prediction: at the ICU level or hospital level, within 48 or 72 hours after discharge, and 28-day and 90-day mortality, among others. The periods used are variable. For example, mortality may be predicted on admission to the ICU, during the first 6 hours [32], 24 hours after arrival (similar to APACHE), during the last ICU day [28], or even in a continuous manner [36].

Summary of studies that use machine learning to predict mortality

Although it is challenging to compare the approaches in Table 1, since many were developed on different datasets and predict different types of mortality (ICU or in-hospital versus post-release), we may summarize the main observations:

Improved interpretability: Numerous ML algorithms have faced criticism for their “black box” nature, which limits interpretability. This concern is particularly evident in deep learning models, where a balance between predictive accuracy and interpretability must be struck. Deep learning, a subset of ML, organizes algorithms into layers, forming an artificial neural network capable of learning from data. Methods such as Shapley values [47], used in Thorsen-Meyer et al. [23] and Caicedo-Torres et al. [24], can convey the importance (or weighting) that the deep model assigns to each input feature, which offers improved interpretability for these networks.

Features used: The approaches summarized vary between models that use features similar to existing models (APACHE) as well as some novel features. The benefits of using simple features such as demographics, labs, and vitals are their availability, reliability, and ease of use. Even a reduced set of features, such as the 15 selected by Kim et al. [31], showed a good area under the curve (AUC) when used with ML models. However, when combined with static features, physiological time series such as vitals and interventions offer an improved means of continuous mortality prediction [23]. Another promising direction is to use semi-structured data, such as those present in diagnosis and inspection reports [37]. Methods such as topic modeling from clinical notes can be added to traditional variables to improve prediction [37,45]. Grnarova et al. [26] proposed a convolutional document-embedding approach applied to clinical notes showing high AUC values. However, variations in clinical annotation practices across health systems may affect how benchmarking may be applied to this type of model. Purushotham et al. [46] compared hand-picked features (such as those used for SAPS-II), raw values of features, and inputs without pre-processing. They showed that when using models that can learn data representations (such as deep learning models), unprocessed inputs provided the best results. Although premorbid functional status and diagnosis are known predictors of ICU–relevant study outcomes, they are not regularly implemented in established scoring systems. Moser et al. [34] included this information showing increased predictive model performance compared to predictions from established risk scoring systems.

Model choice: A one-fits-all model is unlikely, since model selection depends on the type of features used (raw data or clinical notes versus hand-picked clinical features) and outcomes required (continuous mortality prediction versus in-hospital and post-discharge). However, several promising approaches have addressed mortality prediction in different ways. Purushotham et al. [46] benchmarked the performance of deep learning models with respect to ensemble ML models and prognostic scoring systems, showing improved performance of deep learning models. Deep learning also offered promising results in Caicedo-Torres et al. [24], who used multi-scale deep convolutional neural networks. It was also used in Aczon et al. [25] regarding pediatric mortality risk. A convolutional document-embedding approach based on the textual content of clinical notes was proposed by Grnarova et al. [26]. Another popular approach is that of using ensemble classifiers to leverage the power of different groups of classifiers. Guo et al. [30] proposed a dynamic ensemble-learning algorithm based on k-means (DELAK) for mortality prediction. They used k-means sampling to generate several data subsets on which base classifiers could learn the classification boundary. El-Rashidy et al. [35] used a stacking ensemble classifier, leading to a high AUC for in-hospital mortality, whereas Awad et al. [32] used an ensemble-learning random forest model.

Class imbalance: A common problem with mortality prediction is that of class imbalance, with a rather low mortality versus survival rate. Several strategies are deployed commonly, either to pre-process imbalanced data (re-sampling, optimizing feature space) or to provide new algorithms that can address this problem. Bhattacharya et al. [29], for ex ample, proposed a binary classifier consisting of skewness-based transformation of input features and statistical hypothesis tests to obtain the final classification. The balanced random forest (BRF) algorithm was used by Li et al. [48] to address class imbalance with promising results.

2. Length of Stay Benchmarking Using ML

Length of stay (LOS) predictions can be used for planning, identifying individuals with unexpectedly long (or short) LOS, and benchmarking. The models for LOS prediction enable case-mix correction when comparing LOS between ICUs, hospitals, or even health systems across geographic regions. Verburg et al. conducted a systematic review of models that can be used for predicting LOS [9]. They identified 11 studies describing the development of 31 prediction models and three studies describing the external validation of one of these models. For benchmarking, they concluded that none of the models satisfied their criteria for performance with the exception of the original [49] and the second-order recalibration of APACHE [50]. However, none of the models considered fulfilled their requirements for moderate calibration. It is worth noting that the models reviewed were multivariable linear models, which assume linearity between LOS and its covariates or predictors. This assumption might not capture the complexity of the relationship; although patients with more severe illness tend to have a longer LOS, they also have a higher mortality risk, which could lead to shorter stays. Another important observation highlighted by Verburg et al. [9] is that LOS distributions are asymmetrical (right-skewed) and present multimodality, since patient discharge tends to occur at particular times of day.

Given the above, nonlinear models have shown promising results in LOS prediction versus linear/statistical models. The recent review by Peres et al. [51] covers several approaches that have been proposed to address some disadvantages of using linear models. Another recent review [52] focused on the use of ML for predicting medical inpatient LOS with a focus on non-ICU patients. In this report, we focus on the use of ML for predicting LOS for ICU patients.

It is important to highlight that methods can be categorized into two types: regression, which involves predicting LOS as a continuous result, and classification, which revolves around categorizing patients into distinct groups. These categories might encompass distinctions like extended stays versus brief ones. The results for LOS regression models are usually assessed using the R-squared error, root-mean-squared error (RMSE), and mean absolute error (MAE). The concordance correlation coefficient is also presented in some studies. The classification results (a long stay, for example) are usually presented using the AUC metric, as well as sensitivity, specificity, and prediction accuracy. Recent studies have proposed the use of classification models that convert length of stay into a binary or multi-class problem and classify LOS into smaller buckets [53].

Table 2 shows a set of different ML approaches for the prediction of both ICU and hospital LOS for critical care patients [38,5363]. Below we summarize some of our observations:

Summary of studies using machine learning to predict length of stay (LOS)

Preprocessing: To deal with the asymmetric nature of the LOS distribution, preprocessing in some studies included log transformation [54] and Z-score normalization. In a regression application, a log transformation can be seen as modeling LOS via a Poisson or negative binomial regression model, among others. Studies in other areas have used methods that can deal with a skewed distribution; an example is the use of gamma mixture models that were applied to maternal hospital LOS [64].

Features: In most of the studies above, the features or predictors used focused on data readily available in the ICU, such as labs and patient demographics. However, Houthouft et al. [54] combined the raw data available in the first 5 ICU days with sequential organ failure assessment scores, as well as sub-scores created to assess the performance of different physiological systems, such as renal, cardiovascular, and respiratory systems. Recently, Peres et al. [65] surveyed risk factors that have been used in ICU LOS prediction and suggested that a list of risk factors should be considered in prediction models for ICU LOS. These factors included severity scores, mechanical ventilation, hypomagnesemia, delirium, malnutrition, infection, trauma, red blood cell count, and PaO2:FiO2 ratios.

Models: As in the mortality prediction case, it is difficult to compare model performance because the datasets used were different, with different sizes, patient groups, and geographic regions. However, turning the problem into a classification issue showed better results. Harutyunyan et al. [53] showed an AUC of 0.84 for predicting ICU LOS >7 days using channel-wise long short-term memory units (LSTMs) and multitask training, whereas Ma et al. [58] had an AUC of 0.85 for predicting LOS >10 days using just-in-time learning (JITL) and one-class extreme learning machine (note that their study included only 4000 patients). In the studies of Iwase et al. [38] and Peres et al. [62], the authors used random forest models and achieved good classification accuracy for short and long ICU stays with an AUC larger than 0.87. Houthouft et al. [54] approached the issue by initially transforming it into a classification task to identify patients with extended stays (beyond 10 days). Subsequently, they tackled the stays shorter than 10 days as a regression problem, achieving a MAE of 1.79 days for this subgroup.

3. Mechanical Ventilation Management Benchmarking

Although less investigated than either mortality or LOS, the last few years have seen several new approaches for the prediction of both probability and duration of mechanical ventilation (Table 3). Note that unlike mortality and LOS, where our focus was on models for generic ICU patients, few papers predicted ventilation for all groups, so we present work that focused on certain cohorts (such as acute respiratory distress syndrome and COVID-19 patients). Other important areas where ML is used are ventilation weaning and extubation outcomes. However, they are out of the scope of this review.

Summary of studies using machine learning to predict ventilation (probability and duration)

Much like the case of predicting mortality, drawing direct comparisons between the methods outlined in Table 3 proves challenging when considering the results alone [11,12,6672]. The variability in cohorts and target variables, such as distinguishing between ventilation duration and daily/entire-stay probabilities, contributes to this complexity. Some observations are as follows:

Features: The use of imaging (X-rays), especially for COVID-19 patients [68], provides a new source of data that can be leveraged for increased precision, especially when combined with deep learning approaches. Other studies depended on more standard features, such as patient characteristics, baseline comorbidities, vital signs, laboratory values, medication administration records, and processes of care.

Models: For ventilation duration, gradient boosting showed promising results [66,72], as did simpler methods like multivariable log regression models [67]. For ventilation probability, the choice of methods depended on the targets and features. Deep learning was used for X-ray imaging datasets as well as for clinical features, with promising results in both cases.

IV. Discussion

ML has provided a novel means of benchmarking critical care through utilizing the power of large datasets and improved algorithms for outcome prediction. However, despite the plethora of articles appearing in the last two decades, the comparison of results and performance remains challenging. Despite some attempts to offer unified datasets for comparison [21], many of the models are developed on different databases, which may be country-specific, be disease/cohort-specific, or even target different outcomes (such as mortality in the ICU, hospital, or after release). Several ICU databases have recently been shared publicly, which can facilitate the comparison of modeling approaches [73].

The studies reviewed showed a variety of inputs or predictors used. Traditionally, features were hand-crafted and included demographics, characteristics, input diagnoses, labs, and vitals. However, we have recently seen more studies that devote less effort to fine-tuning, features yet achieve good results based on learning from raw data [32,33]. New predictors have also been added, such as imaging [68], clinical notes, and premorbid functional status [34], which show improvements in outcome prediction.

In terms of the models selected, the studies show a large variety, including support vector machines, gradient boosting, hidden Markov models, and deep learning. Model selection is affected by performance, data size, the handling of missing/erroneous data, and interpretability. Multi-task learning is an interesting direction because it improves generalization by leveraging the domain-specific information contained in the training signals of related tasks. Harutyunyan et al. [53] applied a deep learning multi-task learning framework to cover a range of clinical problems, including modeling risk of mortality, forecasting LOS, detecting physiologic decline, and classifying phenotype. In terms of interpretability, methods such as Shapley values can be used to convey the importance that an ML model assigns to input features [23,24].

Second, a significant concern during the training and evaluation of benchmarking models is class imbalance, a phenomenon evident across all the clinical outcomes examined for the current study. This imbalance is particularly pronounced in cases of mortality, as a relatively small subset of critical care patients experience death. Furthermore, this issue extends to recent studies that assess the efficacy of established LOS models. Interestingly, these models do not distinguish between patients who have survived and those who have not, leading to the overrepresentation of surviving and lower-risk patients [74]. We presented some approaches that addressed class imbalance, such as skewness-based transformations [29] and balanced random forest algorithms [48]. We point the reader to several reviews on this active research area [75,76].

Third, one factor contributing to differences among studies relates to how stays are defined and consolidated. While in a hospital setting, patients could experience multiple instances of being discharged and readmitted to the ICU. To maintain uniformity in model development, it becomes essential to define standardized criteria for classifying these occurrences as either one continuous stay or several separate stays. This effort aims to reduce variation in the resulting models. An associated subject pertains to ICU type, encompassing different patient groups, treatments, and results, such as cardiac care versus neurological cases. Despite its significance warranting deeper exploration, the published literature shows limited emphasis in this domain.

A further issue that could affect model performance is that some sub-populations, such as ethnic minorities, may be underrepresented even in large datasets. Other sources of bias that could influence performance are related to variations in documentation across sites and geographic regions, due mostly to subjective evaluation. Both the reason for admission to the ICU and the Glasgow Coma Score, for example, may incorporate subjective evaluation from clinicians [77].

Most existing benchmarking models were developed on country-specific databases. The APACHE scores, for example, were United States-trained and tested. However, clinical practice, documentation, and patient diversity differ across geographic regions, requiring model recalibration and training.

As in other fields, ML in benchmarking ICU outcomes has focused on developing models with improved performance on retrospective data. However, little work has occurred on long-term validation post-deployment, which would observe data drift, model drift, and performance over time. Existing models merit recalibration every few years due to data drift. Some reasons for data drift in critical care include changes in data due to seasonality, changes in documentation practices, the addition of new devices, missing data, and changes in clinical practices over time. The same applies to bias, model generalizability, and fairness [78]. Federated learning, a distributed technique for training ML models without exchanging data, presents an intriguing paradigm for locations where data sharing is not feasible or for refining models using local datasets for updates [79].

Although the models highlighted in the current review attempt to adjust for measured risk factors, unobserved patient attributes mean that risk adjustment is never perfect [80]. Areas such as medication adherence, social support, or mobility before admission can be considered as unmeasured factors. Even when models are accurately calibrated to the collected data, the influence of these factors continues to impact the results. Finding ways to incorporate some of these factors, possibly through clinical notes and patient interactions, remains crucial. This could emerge as a thriving research domain for large language models or generative artificial intelligence (AI) methods to offer a potential solution that would bridge this gap.

In conclusion, ML has provided novel tools for benchmarking critical care outcomes, leading to improved results as well as addressing important drawbacks of previous methods, such as reducing biases due to documentation, missing data, and class imbalance, as well as modeling non-linear relationships between variables and outcomes. Prospects exist for using ML to encompass a broader array of data types, including imaging, medical notes, and diagnoses. The utilization of multi-national datasets via techniques like federated learning could also prove advantageous in developing models that find broader relevance across diverse patient groups and geographic regions where data sharing is not possible. Generative AI and large language models present a fresh approach for scrutinizing extensive datasets, including medical notes, thereby enhancing the efficacy of future ML models within this domain. In clinical contexts, we suggest that healthcare practitioners opt for well-validated models tailored to their specific geographic and patient-demographic considerations.


The authors would like to thank Dr Omar Badawi and Robin French for their ideas on bottle necks in benchmarking critical care outcomes based on their extensive experience.


Conflict of Interest

This work was funded by Philips Healthcare and all authors are fully employed by Philips. The authors have no competing interests.


1. Rhodes A, Moreno RP, Azoulay E, Capuzzo M, Chiche JD, Eddleston J, et al. Prospectively defined indicators to improve the safety and quality of care for critically ill patients: a report from the Task Force on Safety and Quality of the European Society of Intensive Care Medicine (ESICM). Intensive Care Med 2012;38(4):598–605.
2. Vincent JL, Marshall JC, Namendys-Silva SA, Francois B, Martin-Loeches I, Lipman J, et al. Assessment of the worldwide burden of critical illness: the intensive care over nations (ICON) audit. Lancet Respir Med 2014;2(5):380–6.
3. Higgins TL.. Quantifying risk and benchmarking performance in the adult intensive care unit. J Intensive Care Med 2007;22(3):141–56.
4. Braun JP, Kumpf O, Deja M, Brinkmann A, Marx G, Bloos F, et al. The German quality indicators in intensive care medicine 2013: second edition. Ger Med Sci 2013. 11Doc09.
5. Kumpf O, Braun JP, Brinkmann A, Bause H, Bellgardt M, Bloos F, et al. Quality indicators in intensive care medicine for Germany: third edition 2017. Ger Med Sci 2017. 15Doc10.
6. Brown SE, Ratcliffe SJ, Halpern SD.. An empirical comparison of key statistical attributes among potential ICU quality indicators. Crit Care Med 2014;42(8):1821–31.
7. Salluh JIF, Soares M, Keegan MT.. Understanding intensive care unit benchmarking. Intensive Care Med 2017;43(11):1703–7.
8. Higgins TL, Stark MM, Henson KN, Freeseman-Freeman L.. Coronavirus disease 2019 ICU patients have higher-than-expected Acute Physiology and Chronic Health Evaluation-adjusted mortality and length of stay than viral pneumonia ICU patients. Crit Care Med 2021;49(7):e701–6.
9. Verburg IW, Atashi A, Eslami S, Holman R, Abu-Hanna A, de Jonge E, et al. Which models can I use to predict adult ICU length of stay? A systematic review. Crit Care Med 2017;45(2):e222–31.
10. van Sluisveld N, Bakhshi-Raiez F, de Keizer N, Holman R, Wester G, Wollersheim H, et al. Variation in rates of ICU readmissions and post-ICU in-hospital mortality and their association with ICU discharge practices. BMC Health Serv Res 2017;17(1):281.
11. Seneff MG, Zimmerman JE, Knaus WA, Wagner DP, Draper EA. Predicting the duration of mechanical ventilation. The importance of disease and patient characteristics. Chest 1996;110(2):469–79.
12. Shashikumar SP, Wardi G, Paul P, Carlile M, Brenner LN, Hibbert KA, et al. Development and prospective validation of a deep learning algorithm for predicting need for mechanical ventilation. Chest 2021;159(6):2264–73.
13. Malmgren J, Waldenstrom AC, Rylander C, Johannesson E, Lundin S.. Long-term health-related quality of life and burden of disease after intensive care: development of a patient-reported outcome measure. Crit Care 2021;25(1):82.
14. Higgins TL, Teres D, Nathanson B.. Outcome prediction in critical care: the Mortality Probability Models. Curr Opin Crit Care 2008;14(5):498–505.
15. Ferrando-Vivas P, Jones A, Rowan KM, Harrison DA.. Development and validation of the new ICNARC model for prediction of acute hospital mortality in adult critical care. J Crit Care 2017;38:335–9.
16. Keuning BE, Kaufmann T, Wiersema R, Granholm A, Pettila V, Moller MH, et al. Mortality prediction models in the adult critically ill: a scoping review. Acta Anaesthesiol Scand 2020;64(4):424–42.
17. Strand K, Flaatten H.. Severity scoring in the ICU: a review. Acta Anaesthesiol Scand 2008;52(4):467–78.
18. Siontis GC, Tzoulaki I, Ioannidis JP.. Predicting death: an empirical evaluation of predictive tools for mortality. Arch Intern Med 2011;171(19):1721–6.
19. Raj R, Luostarinen T, Pursiainen E, Posti JP, Takala RS, Bendel S, et al. Machine learning-based dynamic mortality prediction after traumatic brain injury. Sci Rep 2019;9(1):17672.
20. Subudhi S, Verma A, Patel AB, Hardin CC, Khandekar MJ, Lee H, et al. Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19. NPJ Digit Med 2021;4(1):87.
21. Silva I, Moody G, Scott DJ, Celi LA, Mark RG.. Predicting in-hospital mortality of ICU patients: the PhysioNet/computing in cardiology challenge 2012. Comput Cardiol (2010) 2012;39:245–8.
22. Barboi C, Tzavelis A, Muhammad LN.. Comparison of severity of illness scores and artificial intelligence models that are predictive of intensive care unit mortality: meta-analysis and review of the literature. JMIR Med Inform 2022;10(5):e35293.
23. Thorsen-Meyer HC, Nielsen AB, Nielsen AP, Kaas-Hansen BS, Toft P, Schierbeck J, et al. Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. Lancet Digit Health 2020;2(4):e179–91.
24. Caicedo-Torres W, Gutierrez J.. ISeeU: visually interpretable deep learning for mortality prediction inside the ICU. J Biomed Inform 2019;98:103269.
25. Aczon MD, Ledbetter DR, Laksana E, Ho LV, Wetzel RC.. Continuous prediction of mortality in the PICU: a recurrent neural network model in a single-center dataset. Pediatr Crit Care Med 2021;22(6):519–29.
26. Grnarova P, Schmidt F, Hyland SL, Eickhoff C. Neural document embeddings for intensive care patient mortality prediction [Internet] Ithaca (NY):; 2016. [cited at 2023 Sep 30]. Available from:
27. Ghassemi M, Pimentel MA, Naumann T, Brennan T, Clifton DA, Szolovits P, et al. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in ICU with sparse, heterogeneous clinical data. Proc AAAI Conf Artif Intell 2015;2015:446–53.
28. Badawi O, Breslow MJ.. Readmissions and death after ICU discharge: development and validation of two predictive models. PLoS One 2012;7(11):e48758.
29. Bhattacharya S, Rajan V, Shrivastava H.. ICU mortality prediction: a classification algorithm for imbalanced datasets. Proc AAAI Conf Artif Intell 2017;31(1):1288–94.
30. Guo C, Liu M, Lu M.. A dynamic ensemble learning algorithm based on K-means for ICU mortality prediction. Appl Soft Comput 2021;103:107166.
31. Kim S, Kim W, Park RW.. A comparison of intensive care unit mortality prediction models through the use of data mining techniques. Healthc Inform Res 2011;17(4):232–43.
32. Awad A, Bader-El-Den M, McNicholas J, Briggs J.. Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. Int J Med Inform 2017;108:185–95.
33. Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der Laan MJ.. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. Lancet Respir Med 2015;3(1):42–52.
34. Moser A, Reinikainen M, Jakob SM, Selander T, Pettila V, Kiiski O, et al. Mortality prediction in intensive care units including premorbid functional status improved performance and internal validity. J Clin Epidemiol 2022;142:230–41.
35. El-Rashidy N, El-Sappagh S, Abuhmed T, Abdelrazek S, El-Bakry HM.. Intensive care unit mortality prediction: an improved patient-specific stacking ensemble model. IEEE Access 2020;8:133541–64.
36. Badawi O, Liu X, Hassan E, Amelung PJ, Swami S.. Evaluation of ICU risk models adapted for use as continuous markers of severity of illness throughout the ICU stay. Crit Care Med 2018;46(3):361–7.
37. Chiu CC, Wu CM, Chien TN, Kao LJ, Qiu JT.. Predicting the mortality of ICU patients by topic model with machine-learning techniques. Healthcare (Basel) 2022;10(6):1087.
38. Iwase S, Nakada TA, Shimada T, Oami T, Shimazui T, Takahashi N, et al. Prediction algorithm for ICU mortality and length of stay using machine learning. Sci Rep 2022;12(1):12912.
39. Pang K, Li L, Ouyang W, Liu X, Tang Y.. Establishment of ICU mortality risk prediction models with machine learning algorithm using MIMIC-IV database. Diagnostics (Basel) 2022;12(5):1068.
40. Safaei N, Safaei B, Seyedekrami S, Talafidaryani M, Masoud A, Wang S, et al. E-CatBoost: an efficient machine learning framework for predicting ICU mortality using the eICU Collaborative Research Database. PLoS One 2022;17(5):e0262895.
41. Stenwig E, Salvi G, Rossi PS, Skjaervold NK.. Comparative analysis of explainable machine learning prediction models for hospital mortality. BMC Med Res Methodol 2022;22(1):53.
42. Zhao S, Tang G, Liu P, Wang Q, Li G, Ding Z.. Improving mortality risk prediction with routine clinical data: a practical machine learning model based on eICU patients. Int J Gen Med 2023;16:3151–61.
43. Meiring C, Dixit A, Harris S, MacCallum NS, Brealey DA, Watkinson PJ, et al. Optimal intensive care outcome prediction over time using machine learning. PLoS One 2018;13(11):e0206862.
44. Davoodi R, Moradi MH.. Mortality prediction in intensive care units (ICUs) using a deep rule-based fuzzy classifier. J Biomed Inform 2018;79:48–59.
45. Marafino BJ, Park M, Davies JM, Thombley R, Luft HS, Sing DC, et al. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data. JAMA Netw Open 2018;1(8):e185097.
46. Purushotham S, Meng C, Che Z, Liu Y.. Benchmarking deep learning models on large healthcare datasets. J Biomed Inform 2018;83:112–34.
47. Shapley LS. A value for n-person games. In : Kuhn AW, Tucker HW, eds. Contributions to the theory of games IIPrinceton (NJ): Princeton University Press; 1953. p. 307–18.
48. Li L, Liu G. In-hospital mortality prediction for ICU patients on large healthcare MIMIC datasets using class imbalance learning. In : Proceedings of 2020, 5th IEEE International Conference on Big Data Analytics (ICBDA); 2020 May 8–11; Xiamen, China. p. 90–3.
49. Zimmerman JE, Kramer AA, McNair DS, Malila FM, Shaffer VL.. Intensive care unit length of stay: benchmarking based on Acute Physiology and Chronic Health Evaluation (APACHE) IV. Crit Care Med 2006;34(10):2517–29.
50. Vasilevskis EE, Kuzniewicz MW, Cason BA, Lane RK, Dean ML, Clay T, et al. Mortality probability model III and simplified acute physiology score II: assessing their value in predicting length of stay and comparison to APACHE IV. Chest 2009;136(1):89–101.
51. Peres IT, Hamacher S, Oliveira FLC, Bozza FA, Salluh JIF.. Prediction of intensive care units length of stay: a concise review. Rev Bras Ter Intensiva 2021;33(2):183–7.
52. Bacchi S, Tan Y, Oakden-Rayner L, Jannes J, Kleinig T, Koblar S.. Machine learning in the prediction of medical inpatient length of stay. Intern Med J 2022;52(2):176–85.
53. Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A.. Multitask learning and benchmarking with clinical time series data. Sci Data 2019;6(1):96.
54. Houthooft R, Ruyssinck J, van der Herten J, Stijven S, Couckuyt I, Gadeyne B, et al. Predictive modelling of survival and length of stay in critically ill patients using sequential organ failure scores. Artif Intell Med 2015;63(3):191–207.
55. Li C, Chen L, Feng J, Wu D, Zimeng W, Liu J, et al. Prediction of length of stay on the intensive care unit based on least absolute shrinkage and selection operator. IEEE Access 2019;7:110710–21.
56. Sotoodeh M, Ho JC.. Improving length of stay prediction using a hidden Markov model. AMIA Jt Summits Transl Sci Proc 2019;2019:425–34.
57. Gentimis T, Ala’J A, Durante A, Cook K, Steele R. Predicting hospital length of stay using neural networks on MIMIC III data. In : Proceedings of 2017 IEEE 15th International Conference on Dependable, Autonomic and Secure Computing, 15th International Conference on Pervasive Intelligence and Computing, 3rd International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech); 2017 Nov 6–10; Orlando, FL. p. 1194–201.
58. Ma X, Si Y, Wang Z, Wang Y.. Length of stay prediction for ICU patients using individualized single classification algorithm. Comput Methods Programs Biomed 2020;186:105224.
59. Muhlestein WE, Akagi DS, Davies JM, Chambless LB.. Predicting inpatient length of stay after brain tumor surgery: developing machine learning ensembles to improve predictive performance. Neurosurgery 2019;85(3):384–93.
60. Wu J, Lin Y, Li P, Hu Y, Zhang L, Kong G.. Predicting prolonged length of ICU stay through machine learning. Diagnostics (Basel) 2021;11(12):2242.
61. Alghatani K, Ammar N, Rezgui A, Shaban-Nejad A.. Predicting intensive care unit length of stay and mortality using patient vital signs: machine learning model development and validation. JMIR Med Inform 2021;9(5):e21347.
62. Peres IT, Hamacher S, Cyrino Oliveira FL, Bozza FA, Salluh JI.. Data-driven methodology to predict the ICU length of stay: a multicentre study of 99,492 admissions in 109 Brazilian units. Anaesth Crit Care Pain Med 2022;41(6):101142.
63. Weissman GE, Hubbard RA, Ungar LH, Harhay MO, Greene CS, Himes BE, et al. Inclusion of unstructured clinical text improves early prediction of death or prolonged ICU stay. Crit Care Med 2018;46(7):1125–32.
64. Williford E, Haley V, McNutt LA, Lazariu V.. Dealing with highly skewed hospital length of stay distributions: the use of Gamma mixture models to study delivery hospitalizations. PLoS One 2020;15(4):e0231825.
65. Peres IT, Hamacher S, Oliveira FLC, Thome AMT, Bozza FA.. What factors predict length of stay in the intensive care unit? Systematic review and meta-analysis. J Crit Care 2020;60:183–94.
66. Sayed M, Riano D, Villar J.. Predicting duration of mechanical ventilation in acute respiratory distress syndrome using supervised machine learning. J Clin Med 2021;10(17):3824.
67. Kramer AA, Gershengorn HB, Wunsch H, Zimmerman JE.. Variations in case-mix-adjusted duration of mechanical ventilation among ICUs. Crit Care Med 2016;44(6):1042–8.
68. Kulkarni AR, Athavale AM, Sahni A, Sukhal S, Saini A, Itteera M, et al. Deep learning model to predict the need for mechanical ventilation using chest X-ray images in hospitalised patients with COVID-19. BMJ Innov 2021;7(2):261–70.
69. Yu L, Halalau A, Dalal B, Abbas AE, Ivascu F, Amin M, et al. Machine learning methods to predict mechanical ventilation and mortality in patients with COVID-19. PLoS One 2021;16(4):e0249285.
70. Douville NJ, Douville CB, Mentz G, Mathis MR, Pancaro C, Tremper KK, et al. Clinically applicable approach for predicting mechanical ventilation in patients with COVID-19. Br J Anaesth 2021;126(3):578–89.
71. Karri R, Chen YP, Burrell AJC, Penny-Dimri JC, Broadley T, Trapani T, et al. Machine learning predicts the short-term requirement for invasive ventilation among Australian critically ill COVID-19 patients. Plops One 2022;17(10):e0276509.
72. Parreco J, Hidalgo A, Parks JJ, Kozol R, Rattan R.. Using artificial intelligence to predict prolonged mechanical ventilation and tracheostomy placement. J Surg Res 2018;228:179–87.
73. Sauer CM, Dam TA, Celi LA, Faltys M, de la Hoz MAA, Adhikari L, et al. Systematic review and comparison of publicly available ICU data sets: a decision guide for clinicians and data scientists. Crit Care Med 2022;50(6):e581–8.
74. Liu X, Badawi O.. 369: ICU length-of-stay models should account for the interaction between survival and patient severity. Crit Care Med 2020;48(1):166.
75. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N.. A survey on addressing high-class imbalance in big data. J Big Data 2018;5(1):42.
76. Johnson JM, Khoshgoftaar TM.. Survey on deep learning with class imbalance. J Big Data 2019;6(1):27.
77. Zimmerman JE, Kramer AA, McNair DS, Malila FM.. Acute Physiology and Chronic Health Evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med 2006;34(5):1297–310.
78. Roosli E, Bozkurt S, Hernandez-Boussard T.. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci Data 2022;9(1):24.
79. Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F.. Federated learning for healthcare informatics. J Healthc Inform Res 2021;5(1):1–19.
80. Lane-Fall MB, Neuman MD.. Outcomes measures and risk adjustment. Int Anesthesiol Clin 2013;51(4):10–21.

Article information Continued

Table 1

Summary of studies that use machine learning to predict mortality

Study, year Outcome Number of patients/stays Method Main results
Ferrando-Vivas et al. [15], 2017 Acute hospital mortality, including deaths that occurred after transfer of the patient from the original hospital to another acute hospital. Training: 155,239 admissions;
Validation: 90,017 admissions
Multivariate logistic regression AUC = 0.8853
Thorsen-Meyer et al. [23], 2020 90-day mortality 14,190 admissions of 11,492 patients Recurrent neural network trained on a temporal resolution of 1 hour AUC = 0.73 at admission;
AUC = 0.82 after 24 hours;
AUC = 0.85 after 72 hours;
AUC = 0.88 at the time of discharge
Caicedo-Torres et al. [24], 2019 ICU mortality 22,413 patients Multi-scale deep convolutional neural network AUC = 0.8735
Aczon et al. [25], 2021 Pediatric mortality risk 12 hours after admission and prior to discharge 9,070 children Recurrent neural network AUC = 0.94
Grnarova et al. [26], 2016 In-hospital, 30-day and 1 year mortality 46,520 patients Convolutional document embedding approach based on textual content of clinical notes In-hospital (AUC = 0.963);
30-day (AUC = 0.858);
1-year mortality (AUC = 0.853)
Purushotham et al. [46], 2018 In-hospital, short term (2–3 days), 30-day and 1-year post discharge 58,576 admissions Benchmarked the performance of deep learning models with respect to machine learning models and prognostic scoring systems For deep learning models:
AUC = 0.92 (in-hospital mortality);
AUC = 0.8872 (1-year post-discharge)
Badawi et al. [28], 2012 Mortality within 48 hours of release from the ICU 469,976 patients (development);
234,987 patients (validation)
Multivariable logistic regression AUC = 0.92
Bhattacharya et al. [29], 2017 In-hospital mortality 4,000 patients from the PhysioNet 2012 challenge [21] A binary classifier consisting of skewness-based transformation of input features and statistical hypothesis tests to obtain the final classification (aiming to address class imbalance). AUC = 0.867 (0.031)
Ghassemi et al. [27], 2015 In-hospital mortality on discharge and 1-year post-discharge 10,202 patients Multi-task Gaussian process (MTGP) Transforming ICU patient clinical notes into timeseries and using MTGP hyperparameters from these timeseries as features to predict mortality probability. AUC = 0.812 (hospital mortality);
AUC = 0.686 (1-year post-discharge)
Guo et al. [30], 2021 72 hours mortality, in-hospital mortality, 30 days mortality and 1 year mortality 42,145 patients Dynamic ensemble learning algorithm based on K-means AUC = 0.87 (72 hours mortality);
AUC = 0.842 (in-hospital mortality);
AUC = 0.861 (30 days mortality);
AUC = 0.829 (1-year mortality)
Kim et al. [31], 2011 Mortality at ICU discharge 38,474 admissions Decision tree (DT) algorithm, artificial neural network (ANN), and support vector machine (SVM) DT (AUC= 0.892);
ANN (AUC = 0.874);
SVM (AUC = 0.876)
Awad et al. [32], 2017 In-hospital mortality using only the first 6 hours in the ICU 11,722 patients Ensemble learning random forest model AUC = 0.82
Pirracchio et al. [33], 2015 In-hospital mortality 24,508 patients Bayesian additive regression tree (BART) AUC = 0.88
Moser et al. [34], 2021 In-hospital mortality 61,224 admissions Hierarchical logistic regression model AUC = 0.886
El-Rashidy et al. [35], 2020 In-hospital mortality at 24 hours 10,664 patients Stacking ensemble classifier AUC = 0.933
Badawi et al. [36], 2018 Mortality within 24 hours in the ICU 563,470 patients Multivariable logistic regression AUC = 0.84
Chiu et al. [37], 2022 In-hospital, within 48 or 72 hours, 30 days, 1 year 46,520 patients Latent Dirichlet allocation (LDA) model to classify text in the semi-structured data of some particular topics, followed by classification (gradient boosting) AUC = 0.93 for 48 hours mortality;
AUC = 0.87 for 30-day mortality
Iwase et al. [38], 2022 ICU mortality 12,747 patients Random forest classifier AUC = 0.945
Pang et al. [39], 2022 ICU mortality 67,748 patients Boosting (XGBoost) AUC = 0.918
Safaei et al. [40], 2022 Mortality on discharge (analyzed per disease group) 200,000 patients Boosting AUC = 0.86–0.92
Stenwig et al. [41], 2022 Hospital mortality 129,794 patients Random forest (among other comparison methods) AUC = 0.87
Zhao et al. [42], 2023 Mortality within 1 week 12,393 patients Boosting (XGBoost) AUC =0.97
Meiring et al. [43], 2018 ICU mortality 22,514 patients Deep learning AUC = 0.883 (after 1st day);
AUC = 0.895 (after 2nd day)
Davoodi et al. [44], 2018 ICU mortality 10,972 patients Deep rule based fuzzy classifier AUC = 0.739 (using first 48 hours)
Marafino et al. [45], 2018 In-hospital mortality 101,196 patients Augmenting labs and vitals with clinical trajectory and NLP-derived terms AUC = 0.922

The area under the curve (AUC) metric is widely used to assess the performance of mortality prediction models.

ICU: intensive care unit, NLP: natural language processing.

Table 2

Summary of studies using machine learning to predict length of stay (LOS)

Study, year Outcome Number of patients/stays Method Main results
Houthouft et al. [54], 2015 Long LOS* (over 10 days) plus ICU LOS prediction** 14,480 patients SVM: This work uses data from the first 5 days of ICU stay For predicting patient mortality and a prolonged stay (>10 days), the best performing model is a SVM with an AUC = 0.82. In terms of LOS regression, the best performing model is support vector regression, with MAE of 1.79 days for patients surviving a non-prolonged stay.
Li et al. [55], 2019 ICU LOS** 1,214 unplanned ICU admissions Least absolute shrinkage and selection operator (LASSO) algorithm 0.88 day for RMSE, 0.87 day for MAE, and 0.35 ± 0.09 for R-squared
Sotoodeh et al. [57], 2019 ICU LOS** 4,000 ICU patients Hidden Markov models RMSE = 9.48 days
Harutunyan et al. [53], 2019 ICU LOS** 42,276 ICU stays of 33,798 unique patients Recurrent neural network framework (channel-wise LSTMs and multitask training) AUC = 0.84 for predicting extended LOS (>7 days) at 24 hours after admission
Gentimis et el. [57], 2017 ICU LOS* (>5 days), or short (≤5 days) 31,018 patients Neural networks 80% prediction accuracy
Ma et al. [58], 2020 Hospital LOS* (more or less than 10 days) 4,000 patients Just-in-time learning (JITL) and one-class extreme learning machine AUC = 0.85 (accuracy, specificity, and sensitivity were 0.82, 1, and 0.62 respectively)
Muhlestein et al. [59], 2019 Hospital LOS** following brain surgery 41,222 patients Ensemble model: Top-performing algorithms were the gradient-boosted tree (GBT) and SVR; these models were combined with an elastic net to create an ensemble model The ensemble model predicted LOS with RMSE of 0.56 days on internal validation and 0.63 days on external validation
Wu et al. [60], 2021 ICU LOS* 139,367 patients (eICU dataset), external validation (MIMIC); 38,597 adult patients Comparison-best results obtained by a gradient boosting decision tree AUC = 0.742
Iwas et al. [38], 2022 ICU LOS* 12,747 patients Random forest Predictive value for long ICU stays (AUC = 0.881), short ICU stays (AUC = 0.889)
Alghatani et al. [61], 2021 ICU LOS* 53,423 patients Random forest AUC = 0.65 (binary classification as less than 2.64 days or more)
Peres et al. [62], 2022 ICU LOS* 99,492 admissions Stacking model combining random forests and linear regression AUC = 0.87 for short and long stays
Weissman et al. [63], 2018 ICU LOS* 25,947 admissions Gradient boosting including unstructured clinical text data AUC = 0.89 (for stays > 7 days)

ICU: intensive care unit, SVM: support vector machine, MAE: mean absolute error, RMSE: root mean square error, LSTM: long short-term memory, MIMIC: Medical Information Mart for Intensive Care, AUC: area under the curve.

“*” indicates classification and “**” regression.

Table 3

Summary of studies using machine learning to predict ventilation (probability and duration)

Study, year Outcome Number of patients/stays Method Main results
Sayed et al. [66], 2021 MV duration after ARDS onset Two cohorts from different databases:
Set 1: 2,466 (MIMIC-III),
Set 2: 5,153 (eICU database)
Light-gradient boosting machine RMSE:
Set 1: 6.10 days,
Set 2: 5.87 days
Seneff et al. [11], 1996 MV duration 42 ICU, 40 hospitals, 17,400 ICU admission, 6,000 patients with MV Multivariate regression analysis RMSE: 8.01 days
Kramer et al. [67], 2016 MV duration 56,336 patients Multivariable logistic regression model For individual patients, the difference between observed and predicted mean duration of MV: 3.3 hours (95% CI, 2.8–3.9) with R-squared equal to 21.6%
Kulkarni et al. [68], 2021 Probability of MV for COVID-19 patients 528 patients (X-ray images) Deep learning 90% accuracy
Yu et al. [69], 2021 Probability of MV for COVID-19 patients based on ER data 1,980 patients Boosting (XG-Boost) 85% accuracy
Shashikumar et al. [12], 2021 Probability of MV (including COVID-19 patients) 30,000 ICU patients Deep learning AUC = 0.895 vs. 0.882, development and validation sites
Douville et al. [70], 2021 Probability of MV for COVID-19 patients 398 patients Random forest model AUC = 0.858
Karri et al. [71], 2022 Probability of MV for COVID-19 patients 300 admissions Random forest model/Gradient boosting AUC = 0.69 (Random forest);
AUC = 0.68 (Gradient boosting)
Parreco et al. [72], 2018 Predicting prolonged mechanical ventilation (over 7 days) for ICU patients 20,262 ICU stays Gradient boosting algorithms AUC = 0.852

MV: mechanical ventilation, ICU: intensive care unit, MIMIC: Medical Information Mart for Intensive Care, RMSE: root mean square error, COVID-19: coronavirus disease 2019, ER: emergency room, AUC: area under the curve, CI: confidence interval.