Machine Learning for Benchmarking Critical Care Outcomes
Article information
Abstract
Objectives
Enhancing critical care efficacy involves evaluating and improving system functioning. Benchmarking, a retrospective comparison of results against standards, aids risk-adjusted assessment and helps healthcare providers identify areas for improvement based on observed and predicted outcomes. The last two decades have seen the development of several models using machine learning (ML) for clinical outcome prediction. ML is a field of artificial intelligence focused on creating algorithms that enable computers to learn from and make predictions or decisions based on data. This narrative review centers on key discoveries and outcomes to aid clinicians and researchers in selecting the optimal methodology for critical care benchmarking using ML.
Methods
We used PubMed to search the literature from 2003 to 2023 regarding predictive models utilizing ML for mortality (592 articles), length of stay (143 articles), or mechanical ventilation (195 articles). We supplemented the PubMed search with Google Scholar, making sure relevant articles were included. Given the narrative style, papers in the cohort were manually curated for a comprehensive reader perspective.
Results
Our report presents comparative results for benchmarked outcomes and emphasizes advancements in feature types, preprocessing, model selection, and validation. It showcases instances where ML effectively tackled critical care outcome-prediction challenges, including nonlinear relationships, class imbalances, missing data, and documentation variability, leading to enhanced results.
Conclusions
Although ML has provided novel tools to improve the benchmarking of critical care outcomes, areas that require further research include class imbalance, fairness, improved calibration, generalizability, and long-term validation of published models.
I. Introduction
Performance comparison is an important aspect of benchmarking in critical care, whether to observe a critical care unit over time or to compare units, hospitals, or even health systems across geographic regions [1,2]. Benchmarking outcomes in critical care, such as mortality or length of stay, allows a risk-adjusted comparison with healthcare leaders as a proxy for quality and efficacy of care. Risk adjustment models have been the cornerstone for benchmarking outcomes in critical care. These models allow the prediction of outcomes to enable the benchmarking or comparison of actual versus predicted outcomes among peers. Outcomes are difficult to interpret unless they are risk-stratified for diagnosis groups, severity of illness, and other patient characteristics [3].
Several taskforces worldwide have recommended the use of quality indicators that are measurable, comparable, and relevant across critical care units [1,4–6]. Regarding outcomes, several measures have been proposed [3,7]. An example is mortality, which is utilized as a quality indicator in intensive care units (ICUs) due to its direct reflection of patient outcomes; it serves to measure the effectiveness of medical interventions and the overall quality of care provided. Mortality is usually assessed using the standardized mortality ratio, which compares actual hospital mortality to predicted mortality through risk-adjusted scoring systems. Morbidity and complications, such as acute renal failure, hemodialysis, and prolonged mechanical ventilation, are more prevalent than mortality events and are also used as outcome measures [8]. Length of stay, encompassing both hospital and ICU durations, is commonly employed as an indicator of cost and efficiency; however, it is influenced by variables like structural factors and patient transfers [9]. Variation in ICU readmissions can also highlight opportunities for enhancement and is potentially influenced by ICU discharge practices [10]. Ventilation outcomes, including mechanical ventilation duration [11] and probability [12], facilitate the comparison of ventilator practices across ICUs. Ventilation outcomes are also valuable for controlling patient disparities in clinical trials or weaning techniques and for advancing quality improvement endeavors. Patient-reported outcomes, used to a lesser degree, cover a range of aspects, such as cognition, fatigue, pain, psychological well-being, activities of daily living, sleep, appetite, and alcohol consumption [13].
Machine learning (ML) constitutes a field within computer science where statistical techniques are employed to analyze data, which facilitates classification, prediction, and optimization by leveraging past data observations. It can help address issues such as imbalanced classes (such as deaths versus surviving patients), missing data, and variation in documentation. This narrative review is meant for clinicians and scientists who would like to understand some of the most important directions in developing these models for benchmarking clinical outcomes. We also highlight the most important sources of bias and variations in performance, aiming to give researchers a concrete list of factors to consider when planning benchmarking studies.
II. Methods
This article reviews ML approaches for benchmarking clinical outcomes in the ICU with a focus on mortality, length of stay, and mechanical ventilation. The literature search was conducted on PubMed, including all articles and reviews between January 1, 2003 and August 1, 2023. Search terms for mortality were “mortality,” “ICU” AND (“machine learning” OR “artificial Intelligence”). For length of stay, they were “length of stay,” “ICU” AND (“machine learning” OR “artificial Intelligence”). For ventilation, the search terms were “ventilation,” “ICU” AND (“machine learning” OR “artificial Intelligence”). Only articles related to adult critical care in English were included. The searches above were also conducted in Google Scholar to ensure that relevant works were not excluded, and to add any missing articles. The initial search yielded 592 articles on mortality, 143 on length of stay, and 195 on ventilation. After a meticulous review, 26, 12, and nine pertinent papers were chosen for each respective domain. For mortality and length of stay, we eliminated articles focusing on specific patient groups and focused on approaches applicable to all critical care patients. An added condition for mortality was the use of a dataset of more than 10,000 patients to enable a fair comparison of results between different studies. In this narrative review, we focus on the important directions for ML in each outcome area rather than providing an exhaustive listing of prior work.
III. Outcome Benchmarking with ML
1. Mortality Benchmarking with ML
Mortality prediction models are applied to critical care patients for benchmarking and stratification into different risk categories. The most widely used models are the Acute Physiology and Chronic Health Evaluation (APACHE) models, the Simplified Acute Physiology Score (SAPS) I–III, and the Mortality Prediction Model (MPM) [14]. However, other models have been developed for improved calibration in particular regions, such as the Intensive Care National Audit & Research Centre (ICNARC) in the UK [15].
Several reviews have covered mortality models: Keuning et al. [16] surveyed predictive mortality models and focused mostly on statistical linear models. An earlier review by Strand et al. [17] reviewed articles focusing on prognostic, single-organ failure, trauma scores and organ dysfunction scores. Siontis et al. [18] evaluated predictive mortality models with a focus on specific patient groups. Promising approaches for particular groups such as brain injury [19] and coronavirus disease 2019 (COVID-19) patients [20] have also been explored. The 2012 PhysioNet/Computing in Cardiology Challenge focused on the prediction of in-hospital mortality of ICU patients leading to several new prediction models [21]. In a more recent review by Barboi et al. [22], the authors highlighted that ML-based models can accurately predict ICU mortality as an alternative to traditional scoring models. However, they concluded that the results cannot be generalized due to the high degree of heterogeneity and that clinicians should only select models with sufficient validation for use in a practice environment.
Table 1 summarizes several relevant articles on ML for predicting mortality [15,23–46]. Note that there are several aspects of mortality prediction: at the ICU level or hospital level, within 48 or 72 hours after discharge, and 28-day and 90-day mortality, among others. The periods used are variable. For example, mortality may be predicted on admission to the ICU, during the first 6 hours [32], 24 hours after arrival (similar to APACHE), during the last ICU day [28], or even in a continuous manner [36].
Although it is challenging to compare the approaches in Table 1, since many were developed on different datasets and predict different types of mortality (ICU or in-hospital versus post-release), we may summarize the main observations:
Improved interpretability: Numerous ML algorithms have faced criticism for their “black box” nature, which limits interpretability. This concern is particularly evident in deep learning models, where a balance between predictive accuracy and interpretability must be struck. Deep learning, a subset of ML, organizes algorithms into layers, forming an artificial neural network capable of learning from data. Methods such as Shapley values [47], used in Thorsen-Meyer et al. [23] and Caicedo-Torres et al. [24], can convey the importance (or weighting) that the deep model assigns to each input feature, which offers improved interpretability for these networks.
Features used: The approaches summarized vary between models that use features similar to existing models (APACHE) as well as some novel features. The benefits of using simple features such as demographics, labs, and vitals are their availability, reliability, and ease of use. Even a reduced set of features, such as the 15 selected by Kim et al. [31], showed a good area under the curve (AUC) when used with ML models. However, when combined with static features, physiological time series such as vitals and interventions offer an improved means of continuous mortality prediction [23]. Another promising direction is to use semi-structured data, such as those present in diagnosis and inspection reports [37]. Methods such as topic modeling from clinical notes can be added to traditional variables to improve prediction [37,45]. Grnarova et al. [26] proposed a convolutional document-embedding approach applied to clinical notes showing high AUC values. However, variations in clinical annotation practices across health systems may affect how benchmarking may be applied to this type of model. Purushotham et al. [46] compared hand-picked features (such as those used for SAPS-II), raw values of features, and inputs without pre-processing. They showed that when using models that can learn data representations (such as deep learning models), unprocessed inputs provided the best results. Although premorbid functional status and diagnosis are known predictors of ICU–relevant study outcomes, they are not regularly implemented in established scoring systems. Moser et al. [34] included this information showing increased predictive model performance compared to predictions from established risk scoring systems.
Model choice: A one-fits-all model is unlikely, since model selection depends on the type of features used (raw data or clinical notes versus hand-picked clinical features) and outcomes required (continuous mortality prediction versus in-hospital and post-discharge). However, several promising approaches have addressed mortality prediction in different ways. Purushotham et al. [46] benchmarked the performance of deep learning models with respect to ensemble ML models and prognostic scoring systems, showing improved performance of deep learning models. Deep learning also offered promising results in Caicedo-Torres et al. [24], who used multi-scale deep convolutional neural networks. It was also used in Aczon et al. [25] regarding pediatric mortality risk. A convolutional document-embedding approach based on the textual content of clinical notes was proposed by Grnarova et al. [26]. Another popular approach is that of using ensemble classifiers to leverage the power of different groups of classifiers. Guo et al. [30] proposed a dynamic ensemble-learning algorithm based on k-means (DELAK) for mortality prediction. They used k-means sampling to generate several data subsets on which base classifiers could learn the classification boundary. El-Rashidy et al. [35] used a stacking ensemble classifier, leading to a high AUC for in-hospital mortality, whereas Awad et al. [32] used an ensemble-learning random forest model.
Class imbalance: A common problem with mortality prediction is that of class imbalance, with a rather low mortality versus survival rate. Several strategies are deployed commonly, either to pre-process imbalanced data (re-sampling, optimizing feature space) or to provide new algorithms that can address this problem. Bhattacharya et al. [29], for ex ample, proposed a binary classifier consisting of skewness-based transformation of input features and statistical hypothesis tests to obtain the final classification. The balanced random forest (BRF) algorithm was used by Li et al. [48] to address class imbalance with promising results.
2. Length of Stay Benchmarking Using ML
Length of stay (LOS) predictions can be used for planning, identifying individuals with unexpectedly long (or short) LOS, and benchmarking. The models for LOS prediction enable case-mix correction when comparing LOS between ICUs, hospitals, or even health systems across geographic regions. Verburg et al. conducted a systematic review of models that can be used for predicting LOS [9]. They identified 11 studies describing the development of 31 prediction models and three studies describing the external validation of one of these models. For benchmarking, they concluded that none of the models satisfied their criteria for performance with the exception of the original [49] and the second-order recalibration of APACHE [50]. However, none of the models considered fulfilled their requirements for moderate calibration. It is worth noting that the models reviewed were multivariable linear models, which assume linearity between LOS and its covariates or predictors. This assumption might not capture the complexity of the relationship; although patients with more severe illness tend to have a longer LOS, they also have a higher mortality risk, which could lead to shorter stays. Another important observation highlighted by Verburg et al. [9] is that LOS distributions are asymmetrical (right-skewed) and present multimodality, since patient discharge tends to occur at particular times of day.
Given the above, nonlinear models have shown promising results in LOS prediction versus linear/statistical models. The recent review by Peres et al. [51] covers several approaches that have been proposed to address some disadvantages of using linear models. Another recent review [52] focused on the use of ML for predicting medical inpatient LOS with a focus on non-ICU patients. In this report, we focus on the use of ML for predicting LOS for ICU patients.
It is important to highlight that methods can be categorized into two types: regression, which involves predicting LOS as a continuous result, and classification, which revolves around categorizing patients into distinct groups. These categories might encompass distinctions like extended stays versus brief ones. The results for LOS regression models are usually assessed using the R-squared error, root-mean-squared error (RMSE), and mean absolute error (MAE). The concordance correlation coefficient is also presented in some studies. The classification results (a long stay, for example) are usually presented using the AUC metric, as well as sensitivity, specificity, and prediction accuracy. Recent studies have proposed the use of classification models that convert length of stay into a binary or multi-class problem and classify LOS into smaller buckets [53].
Table 2 shows a set of different ML approaches for the prediction of both ICU and hospital LOS for critical care patients [38,53–63]. Below we summarize some of our observations:
Preprocessing: To deal with the asymmetric nature of the LOS distribution, preprocessing in some studies included log transformation [54] and Z-score normalization. In a regression application, a log transformation can be seen as modeling LOS via a Poisson or negative binomial regression model, among others. Studies in other areas have used methods that can deal with a skewed distribution; an example is the use of gamma mixture models that were applied to maternal hospital LOS [64].
Features: In most of the studies above, the features or predictors used focused on data readily available in the ICU, such as labs and patient demographics. However, Houthouft et al. [54] combined the raw data available in the first 5 ICU days with sequential organ failure assessment scores, as well as sub-scores created to assess the performance of different physiological systems, such as renal, cardiovascular, and respiratory systems. Recently, Peres et al. [65] surveyed risk factors that have been used in ICU LOS prediction and suggested that a list of risk factors should be considered in prediction models for ICU LOS. These factors included severity scores, mechanical ventilation, hypomagnesemia, delirium, malnutrition, infection, trauma, red blood cell count, and PaO2:FiO2 ratios.
Models: As in the mortality prediction case, it is difficult to compare model performance because the datasets used were different, with different sizes, patient groups, and geographic regions. However, turning the problem into a classification issue showed better results. Harutyunyan et al. [53] showed an AUC of 0.84 for predicting ICU LOS >7 days using channel-wise long short-term memory units (LSTMs) and multitask training, whereas Ma et al. [58] had an AUC of 0.85 for predicting LOS >10 days using just-in-time learning (JITL) and one-class extreme learning machine (note that their study included only 4000 patients). In the studies of Iwase et al. [38] and Peres et al. [62], the authors used random forest models and achieved good classification accuracy for short and long ICU stays with an AUC larger than 0.87. Houthouft et al. [54] approached the issue by initially transforming it into a classification task to identify patients with extended stays (beyond 10 days). Subsequently, they tackled the stays shorter than 10 days as a regression problem, achieving a MAE of 1.79 days for this subgroup.
3. Mechanical Ventilation Management Benchmarking
Although less investigated than either mortality or LOS, the last few years have seen several new approaches for the prediction of both probability and duration of mechanical ventilation (Table 3). Note that unlike mortality and LOS, where our focus was on models for generic ICU patients, few papers predicted ventilation for all groups, so we present work that focused on certain cohorts (such as acute respiratory distress syndrome and COVID-19 patients). Other important areas where ML is used are ventilation weaning and extubation outcomes. However, they are out of the scope of this review.
Much like the case of predicting mortality, drawing direct comparisons between the methods outlined in Table 3 proves challenging when considering the results alone [11,12,66–72]. The variability in cohorts and target variables, such as distinguishing between ventilation duration and daily/entire-stay probabilities, contributes to this complexity. Some observations are as follows:
Features: The use of imaging (X-rays), especially for COVID-19 patients [68], provides a new source of data that can be leveraged for increased precision, especially when combined with deep learning approaches. Other studies depended on more standard features, such as patient characteristics, baseline comorbidities, vital signs, laboratory values, medication administration records, and processes of care.
Models: For ventilation duration, gradient boosting showed promising results [66,72], as did simpler methods like multivariable log regression models [67]. For ventilation probability, the choice of methods depended on the targets and features. Deep learning was used for X-ray imaging datasets as well as for clinical features, with promising results in both cases.
IV. Discussion
ML has provided a novel means of benchmarking critical care through utilizing the power of large datasets and improved algorithms for outcome prediction. However, despite the plethora of articles appearing in the last two decades, the comparison of results and performance remains challenging. Despite some attempts to offer unified datasets for comparison [21], many of the models are developed on different databases, which may be country-specific, be disease/cohort-specific, or even target different outcomes (such as mortality in the ICU, hospital, or after release). Several ICU databases have recently been shared publicly, which can facilitate the comparison of modeling approaches [73].
The studies reviewed showed a variety of inputs or predictors used. Traditionally, features were hand-crafted and included demographics, characteristics, input diagnoses, labs, and vitals. However, we have recently seen more studies that devote less effort to fine-tuning, features yet achieve good results based on learning from raw data [32,33]. New predictors have also been added, such as imaging [68], clinical notes, and premorbid functional status [34], which show improvements in outcome prediction.
In terms of the models selected, the studies show a large variety, including support vector machines, gradient boosting, hidden Markov models, and deep learning. Model selection is affected by performance, data size, the handling of missing/erroneous data, and interpretability. Multi-task learning is an interesting direction because it improves generalization by leveraging the domain-specific information contained in the training signals of related tasks. Harutyunyan et al. [53] applied a deep learning multi-task learning framework to cover a range of clinical problems, including modeling risk of mortality, forecasting LOS, detecting physiologic decline, and classifying phenotype. In terms of interpretability, methods such as Shapley values can be used to convey the importance that an ML model assigns to input features [23,24].
Second, a significant concern during the training and evaluation of benchmarking models is class imbalance, a phenomenon evident across all the clinical outcomes examined for the current study. This imbalance is particularly pronounced in cases of mortality, as a relatively small subset of critical care patients experience death. Furthermore, this issue extends to recent studies that assess the efficacy of established LOS models. Interestingly, these models do not distinguish between patients who have survived and those who have not, leading to the overrepresentation of surviving and lower-risk patients [74]. We presented some approaches that addressed class imbalance, such as skewness-based transformations [29] and balanced random forest algorithms [48]. We point the reader to several reviews on this active research area [75,76].
Third, one factor contributing to differences among studies relates to how stays are defined and consolidated. While in a hospital setting, patients could experience multiple instances of being discharged and readmitted to the ICU. To maintain uniformity in model development, it becomes essential to define standardized criteria for classifying these occurrences as either one continuous stay or several separate stays. This effort aims to reduce variation in the resulting models. An associated subject pertains to ICU type, encompassing different patient groups, treatments, and results, such as cardiac care versus neurological cases. Despite its significance warranting deeper exploration, the published literature shows limited emphasis in this domain.
A further issue that could affect model performance is that some sub-populations, such as ethnic minorities, may be underrepresented even in large datasets. Other sources of bias that could influence performance are related to variations in documentation across sites and geographic regions, due mostly to subjective evaluation. Both the reason for admission to the ICU and the Glasgow Coma Score, for example, may incorporate subjective evaluation from clinicians [77].
Most existing benchmarking models were developed on country-specific databases. The APACHE scores, for example, were United States-trained and tested. However, clinical practice, documentation, and patient diversity differ across geographic regions, requiring model recalibration and training.
As in other fields, ML in benchmarking ICU outcomes has focused on developing models with improved performance on retrospective data. However, little work has occurred on long-term validation post-deployment, which would observe data drift, model drift, and performance over time. Existing models merit recalibration every few years due to data drift. Some reasons for data drift in critical care include changes in data due to seasonality, changes in documentation practices, the addition of new devices, missing data, and changes in clinical practices over time. The same applies to bias, model generalizability, and fairness [78]. Federated learning, a distributed technique for training ML models without exchanging data, presents an intriguing paradigm for locations where data sharing is not feasible or for refining models using local datasets for updates [79].
Although the models highlighted in the current review attempt to adjust for measured risk factors, unobserved patient attributes mean that risk adjustment is never perfect [80]. Areas such as medication adherence, social support, or mobility before admission can be considered as unmeasured factors. Even when models are accurately calibrated to the collected data, the influence of these factors continues to impact the results. Finding ways to incorporate some of these factors, possibly through clinical notes and patient interactions, remains crucial. This could emerge as a thriving research domain for large language models or generative artificial intelligence (AI) methods to offer a potential solution that would bridge this gap.
In conclusion, ML has provided novel tools for benchmarking critical care outcomes, leading to improved results as well as addressing important drawbacks of previous methods, such as reducing biases due to documentation, missing data, and class imbalance, as well as modeling non-linear relationships between variables and outcomes. Prospects exist for using ML to encompass a broader array of data types, including imaging, medical notes, and diagnoses. The utilization of multi-national datasets via techniques like federated learning could also prove advantageous in developing models that find broader relevance across diverse patient groups and geographic regions where data sharing is not possible. Generative AI and large language models present a fresh approach for scrutinizing extensive datasets, including medical notes, thereby enhancing the efficacy of future ML models within this domain. In clinical contexts, we suggest that healthcare practitioners opt for well-validated models tailored to their specific geographic and patient-demographic considerations.
Acknowledgments
The authors would like to thank Dr Omar Badawi and Robin French for their ideas on bottle necks in benchmarking critical care outcomes based on their extensive experience.
Notes
Conflict of Interest
This work was funded by Philips Healthcare and all authors are fully employed by Philips. The authors have no competing interests.