Data Mining to Identify the Right Interventions for the Right Patient for Heart Failure: A Real-World Study
Article information
Abstract
Objectives
To identify the right interventions for the right heart failure (HF) patients in the real-world setting using machine learning (ML) trained on individual-level clinical data linked with social determinants of health (SDOH) data.
Methods
In this retrospective cohort study, point-of-care claims data from Komodo Health and SDOH data from the National Health and Wellness Survey (NHWS), from January 2014–December 2020, were linked. Data mining was conducted using K-means clustering, an ML tool. Komodo Health data were used to access longitudinal data for the selected patient cohorts and cross-sectional data from NHWS for additional patient information. The primary outcome was HF-related hospitalizations; secondary outcomes, all-cause hospitalization and all-cause mortality. Use of digital healthcare (DHC)/non-DHC interventions and related outcomes were also assessed.
Results
The study population included 353 HF patients (mean age, 63.5 years; 57.2% women). The use of non-DHC (75.9%–81.9%) and DHC (4.0%–9.1%) interventions increased from baseline to follow-up. Overall, 17.0% of patients had HF-related hospitalizations (DHC, 6.9%; non-DHC, 16.5%) and 45.0% had all-cause hospitalization (DHC, 75.0%; non-DHC, 50.9%). Two archetypes with distinct patient profiles were identified. Archetype 1 (vs. 2) characterised by older age, greater disease severity, more comorbidities, more medication use, took steps to prevent heart attack/problems, had better lifestyle, higher HF-related hospitalizations (18.3% vs. 16.3%) and lower all-cause hospitalizations (42.9% vs. 46.3%). The trends remained the same regardless of the intervention type.
Conclusions
Identification of patient archetypes with distinct profiles can be useful to understand underlying disease subtypes, identify specific interventions, predict clinical outcomes, and define the right intervention for the right patient.
I. Introduction
Digital healthcare (DHC) interventions have the potential to improve disease control and management, population health outcomes, and healthcare quality [1–3]. Current DHC solutions include telehealth, digital and virtual disease management platforms, modifiable risk factor technologies, dietary counselling, psychological assistance, and personalized short messaging. DHC tools are inexpensive, convenient, easy to navigate, provide accessible/concise information and secure data management leading to higher acceptability [4].
A combination of medical (point-of-care) data from clinical sources (e.g., electronic medical records, registries, insurance claims) and social determinants of health (SDOH, between-care) data from devices (e.g., smartphones and apps) can provide insights into patients’ behaviors, medication responses, lifestyle choices, and a holistic view of their healthcare journey [5]. These combined data can be used to train a machine learning (ML) model to predict responses to interventions.
DHC solutions can be safe alternatives to conventional healthcare services to manage patients with cardiovascular conditions [6]. Results from a randomized trial [7] on 765 heart failure (HF) patients suggested that remote patient management may reduce unplanned hospitalizations, morbidity, and mortality [8]. Data mining techniques can be used to discover patterns and associations in medical data to uncover solutions to existing gaps and has been used in HF studies [9]. Thus, in context of HF, this study aimed to identify the right interventions for the right patient in the real-world setting using data mining by (1) describing DHC/non-DHC interventions used by HF patients; (2) identifying HF patient archetypes according to socio-demographic, clinical characteristics, procedures, laboratory tests, patient-reported outcomes (PROs), and comorbidities; and (3) describing hospitalizations/re-hospitalizations, mortality, costs, and use of DHC/non-DHC interventions in all HF patients and archetypes.
II. Methods
1. Study Design and Data Sources
This retrospective cohort study conducted between January 2014–December 2020 in the United States (US) linked claims data from Komodo Health database [10] (Supplement A) and data from the National Health and Wellness Survey (NHWS) [11] based on probabilistic matching of first and last name, date of birth, sex, address, and Zone Improvement Plan code. The linked dataset from the Komodo database was de-identified in compliance with the Health Insurance Portability and Accountability Act; therefore, no Institutional Review Board approval was required [12,13]. The 2015–2017 NHWS database has been granted Institutional Review Board exemption status by the Pearl Institutional Review Board.
Patients identified with HF diagnosis in the Komodo database were enrolled between January 2015–December 2019. Index date was defined as the date of the first HF diagnosis during enrolment period. A baseline period of ≥1 year was defined before index date to assess baseline characteristics and exclude patients with previous HF diagnosis. A follow-up period of ≥1 year was defined after index date to assess use of interventions and outcomes (Figure 1).
2. Patient Population
Adult HF patients aged ≥18 years at index date, having ≥1 year baseline and follow-up linked data from Komodo Health and NHWS, and ≥1 diagnosis code—International Classification of Diseases 9th edition Clinical Modification (ICD-9-CM) and 10th edition (ICD-10-CM)—for HF in Komodo data during enrolment were included. Patients having ≥1 diagnosis code for HF in Komodo data during baseline or having completed NHWS ≥2 years before or after index date were excluded.
3. Study Variables
1) Primary outcome
Proportion of patients with HF-related hospitalization (≥1 record of any hospitalization for HF) at 1-year follow-up period.
2) Secondary outcomes
All-cause hospitalization and all-cause death (≥1 record of any hospitalization/death) at 1-year follow-up period; and overall healthcare costs (sum of costs for all claims) at 1-year baseline and follow-up period.
Komodo data were used to assess primary and secondary outcomes.
3) Other variables
Demographics, lifestyle, and clinical characteristics: ≤2 years before index date; NHWS data.
PROs: ≤2 years before index date; NHWS data using the Medical Outcomes Study Short Form 36 (SF-36v2 Health Survey; 36 items, physical and mental component summaries; score 0–100; higher score indicating better health outcome) [14] and the EuroQol 5-Dimension 5-Level Health Questionnaire (EQ-5D-5L; hereafter EQ-5D) [15].
Comorbidities (baseline period), interventions (DHC/non-DHC; HF medications), procedures and laboratory tests (stratified by baseline and follow-up period) using Komodo data (Supplements B–E).
4. Statistical Analysis
No formal calculation of sample size/power was performed, as hypothesis testing was not involved. Descriptive statistics were used to summarize all variables and outcomes; continuous variables as mean ± standard deviation and median (interquartile range); and categorical variables as number (%).
K-means clustering, an ML tool, was used to identify HF patient archetypes according to socio-demographic and clinical characteristics, procedures, laboratory tests, PROs, and comorbidities (Supplements F–G). Eighty-seven variables were available for analysis. Factor analysis of mixed data, a dimensionality reduction technique, was used to derive a new set of uncorrelated variables (dimensions) that reduced the number of features while preserving most information in the dataset (Supplement H). The 87 variables described eigenvalues and contribution of each dimension to the total variance and were used to characterize main archetypes identified during clustering analysis. For each archetype × variable, prevalence within the archetype and cohort, and lift score = (Prevalence within archetype)/(Prevalence within cohort) was calculated. Variables with lift score >1 (i.e., enriched densely within archetype than cohort) were used to describe patient archetypes (Supplement I). Then, standard descriptive statistics were used to analyse outcomes within each archetype, including hospitalizations/re-hospitalizations and mortality, and overall healthcare costs. All outcomes were analysed at 1-year follow-up.
All analyses were performed on non-missing and valid values. Missing data were not imputed and reported as number (%) in a separate category. Statistical analyses were conducted using R software (version 3.6.3; https://www.r-project.org/).
III. Results
1. Baseline Characteristics
Of 6,915 HF patients with Komodo and NHWS linked data, 1,472 patients (age of 18+ years) with two closed claims (2014–2020), including ≥1 HF claim (2015–2019), with ≥1 year of continuous enrolment before index date were identified. Of these, 353 patients with no HF visit during baseline period, with ≥1 year follow-up, who completed NHWS ≤2 years before index date, were included for analysis (Figure 2).
Overall, mean age was 63.5 ± 12.18 years; 57.2% were women. Most patients (76.8%) were White (Caucasian); without university education (72.0%); 50.1% were retired, 28.7% had <$25,000 income. Patients with HF diagnosis at index date in inpatient and non-inpatient setting had similar characteristics. Most patients were overweight/obese (78.3%), had not exercised in the previous month (56.4%), were former smokers/non-smokers (78.4%), and consumed alcohol once a month/abstained (63.7%); 42.5% reported taking steps to prevent heart attack/problems (Table 1).
2. Patient-Reported Outcomes
Mean physical and mental component scores on SF-36v2 were 41.99 and 48.80, respectively (Table 2). Using EQ-5D instrument, mean utility index value in the total population was 0.752 (median = 0.79). Overall, 61.8% patients reported difficulties in walking and 57.1% faced challenges in usual activities.
3. Comorbidities
Both cardiac and non-cardiac comorbidities were observed. The most common (>20%) were obesity (51%), systemic hypertension (47.9%), diabetes mellitus (DM; 39.4%), and coronary heart disease (CHD; 21.8%). Others (10.5%–16.4%) included chronic kidney failure (CKF), chronic obstructive pulmonary disease (COPD), sleep apnea, and anemia (Supplement J).
4. DHC and Non-DHC Interventions
Proportion of patients using DHC (all telehealth interventions) increased from 4% at baseline to 9.1% during follow-up. The proportion of those using non-DHC interventions at baseline (75.9%) increased to 81.9% during post-index period. At baseline, the most frequently used medications were low-density lipoprotein (LDL)-cholesterol-lowering therapy (45.3%), diuretics (38.2%), beta-blockers (37.7%), and angiotensin-converting enzyme inhibitors (28.3%) (Table 3).
5. Association between Hospitalization, Costs, and Use of Different Interventions
Overall, 45.0% patients had all-cause hospitalization/re-hospitalizations and 17.0% had HF-related hospitalizations. A lower rate (6.9%) of HF-related hospitalizations were reported following any DHC intervention post-index date than in overall population and other treatment groups (non-DHC interventions, 16.5%; medications, 15.2%). However, the subgroup with DHC interventions had considerably higher rate (75.0%) of all-cause hospitalisations than other treatment groups (non-DHC interventions, 50.9%; medications, 44.9%) (Figure 3).

Hospitalization and re-hospitalizations in patients with HF and by subgroup. HF: heart failure, DHC: digital healthcare.
Seven patients (2%) died at 1-year post-HF diagnosis (Figure 4). However, due to the small sample size, these estimates are likely unreliable and were not subjected to further analyses.
At baseline, mean total cost was $22,240; outpatient costs accounted for 50% of total costs ($10,886), followed by pharmacy ($4,824) and inpatient costs ($3,569). The mean total cost significantly increased to $45,702 at 1-year follow-up (change from baseline, 105%), mainly comprising inpatient ($16,170), followed by outpatient ($16,002) and pharmacy costs ($5,483) (Table 4).
1) Descriptive analyses of archetypes
Two archetypes of HF patients with distinct baseline characteristics were identified in Supplement H. Archetype 1 (n = 126; vs. archetype 2 [n = 227]) was characterised by higher proportion of older patients (aged >70 years; 46.8% vs. 30.8%); diagnosis in non-inpatient setting (74.6% vs. 64.8%), income ranging $25,000–$75,000 ($25,000–$50,000/$50,000–$75,000, 28.6%/19.8% vs. 23.3%/16.7%), former smokers (46.8% vs. 28.2%), those who exercised regularly (46.8% vs. 41.9%), took steps to prevent heart attacks/problems, including consuming low-fat diet (55.6% vs. 0%), stress management (39.7% vs. 0.4%), had generally higher prevalence of comorbidities (both cardiovascular and non-cardiovascular), and use of laboratory tests (electrocardiogram, creatinine, etc.); also, when compared with total population (Table 5).
Medication use was more frequent in archetype 1 than archetype 2 at baseline (73.0% vs. 68.7%; p = 0.398); a significant difference in use of LDL-cholesterol lowering therapy (54.8% vs. 40.1%; p = 0.008) and angiotensin II receptor blockers (ARB; 25.4% vs. 15.4%; p = 0.022) was observed. During follow-up, medication use increased, especially in archetype 2, and became similar in both archetypes, except for a significant difference in ARB use (32.5% vs. 18.9%; p = 0.004) (Table 3).
2) Association between study outcomes and use of different interventions — by archetype
Patients in archetype 1 tended to have more HF-related hospitalizations than those in archetype 2 (18.3% vs. 16.3%); trend remained same regardless of intervention. HF-related hospitalizations were particularly higher among those who had inpatient index diagnosis (28.1% vs. 23.8% in archetype 2) (Figure 5A).

(A) Heart failure-related hospitalizations and (b) all-cause hospitalizations by archetype. DHC: digital healthcare.
Overall, lower proportion of patients in archetype 1 had all-cause hospitalizations than in archetype 2 (42.9% vs. 46.3%). Archetype 1 was characterized by less all-cause hospitalizations than archetype 2 in DHC (73.3% vs. 76.5%), non-DHC (46.7% vs. 53.3%), medications (40.8% vs. 47.4%) subgroups, and in those with inpatient HF diagnosis at index date (56.3% vs. 65.0%) (Figure 5B).
3) Overall healthcare costs by archetype
Archetype 1 was characterized by higher total costs than archetype 2 ($28,049 vs. $18,932; p = 0.066) (Table 4) and total population during baseline and follow-up periods; difference was significantly higher during follow-up ($62,023 vs. $36,409; p = 0.022). Inpatient costs were higher in archetype 1 than archetype 2 ($24,802 vs. $11,256; p = 0.050); this large difference was borderline significant due to high variability in overall data (standard deviation = 60,109); findings were similar for pharmacy costs.
IV. Discussion
The key challenge in using retrospective databases to explore the potential of DHC interventions in the real world is the poor availability of between-care and point-of-care data that co-exist at an individual-level longitudinal detail. Unlike most studies that only included data on clinical, laboratory, and healthcare use (except one that included QoL variables) [16], our study also considered social determinants of health as they can impact chronic disease management and enable a more targeted intervention. This combined dataset provides a more holistic view of the patient and clues to their behaviors between points of care, which is necessary to identify the patient profile most likely to benefit from chronic care interventions.
Our study population included 353 relatively young HF patients (63.5 years), with a higher proportion of women compared to the general HF population. Horiuchi et al. [17] estimated an average age of 73 years and a proportion of 65% men in their cohort of HF patients. A systematic review of methods to identify HF patients in general practice reported a weighted mean age of 75 years [18]. Another systematic review of cost-of-illness studies in adults with HF in the United States reported mean age 59–84 years, with most studies estimating an average age of ≥70 years [19]. The inclusion of relatively young patients was likely due to lower participation of older patients in the NHWS.
The SF-36v2 scores at baseline suggested moderate deterioration in the physical health status and mild impairment in the mental HRQoL. The EQ-5D instrument and mean utility index value in the total population corresponded to mild HF severity at baseline [20]. Further analysis showed that the mild HRQoL deterioration was driven by challenges in mobility and usual activities. Most patients had co-existing obesity, systemic hypertension, DM, CHD, COPD, sleep apnea, anemia, iron deficiency, atrial fibrillation and flutter, CKF, valvular heart disease, thyroid disorders, anxiety, and depression, which is consistent with the most frequent HF-associated comorbidities [21–23].
According to a recent report of the United States Department of Health and Human Services, the COVID-19 pandemic led to increased DHC utilisation [24], also observed in our study (pre- vs. post-index date, 4.0% vs. 9.1%). This increase was expected in the index period, as patients were more likely to utilise DHC following disease progression or increased severity or on development of comorbidities, complications, or mobility problems. DHC interventions were associated with lower HF-related hospitalizations than in the total population and other subgroups, but with higher all-cause hospitalizations than with non-DHC interventions, possibly due to prevalence of non-cardiovascular comorbidities. The higher rate should be interpreted with caution and is likely overestimated as patients can have multiple types of interventions.
The total healthcare costs increased significantly from baseline ($22,240) to follow-up ($45,702), mainly driven by inpatient costs. Overall, 45.0% patients had all-cause hospitalizations; and 17.0%, HF-related hospitalizations. In another analysis, the high costs (mean total costs, $62,615; HF-related costs, $35,329) in the year following HF worsening in patients with HF with reduced ejection fraction (HFrEF) were attributed to inpatient encounters [25]. Similarly, another study reported a significant (p < 0.001) change in the mean costs/person in the year after HF diagnosis ($34,372) than in the year before diagnosis ($8,219) [26]. However, HF-associated costs in similar published studies have varied widely [19]. In a systematic review of HF-associated costs in the US, HF-specific hospitalizations (median cost/patient, $15,879) accounted for the increase in annual median total costs for HF care ($24,383). Costs were largely driven by the length of stay and varied based on patient characteristics (e.g., comorbidities) [19]. A review synthesizing international cost estimates of cardiovascular events also reported lower costs [27]. The average cost of HF hospitalization across studies was $11,686 (median, $10,291). Costs from US claims analysis were high ($27,006); and follow-up costs through 1 year, $12,931 (median, $15,238) [27].
ML tools enable the use and analysis of large datasets to examine multiple clinical features and identify trends in disease progression and prognosis within a patient population [7]. Several clustering analyses have been conducted to identify HF phenotypes/subclasses and comorbidity patterns in HF patients [16,28]. The K-means clustering method is one of the most adopted methods of clustering in real-world evidence studies due to its simplicity and performance [29].
Although two archetypes were identified, there was a low separation between them due to a homogenous study population and relatively small sample size. Archetype 1 may comprise HFrEF patients as indicated by greater disease severity, more comorbidities, and significantly higher ARB prescription at baseline and follow up versus archetype 2. Furthermore, archetype 1 had slightly higher HF-related hospitalizations and lower all-cause hospitalizations. Better lifestyle and higher rate of heart disease prevention practices in archetype 1 may have contributed to patients’ general wellbeing, causing lower all-cause hospitalizations. Costs associated with archetype 1 were significantly higher than for archetype 2, possibly due to older age and higher comorbidities, use of laboratory tests, and medications. High variability between archetypes in terms of costs could be due to different follow-up durations, disease severity, or comorbidities. A more in-depth analysis of comorbidities, medication, New York Heart Association classes, and study outcomes stratified by HF subtypes is needed to investigate these hypotheses.
Similar studies have examined relationships between patient profiles in different HF archetypes and outcomes. In a Swedish registry study, of the four distinct HF patient clusters differing significantly in outcomes and therapeutic response, the two clusters with the lowest 1-year survival rates were characterized by older age, low body mass index, high blood pressure, prior strokes/transient ischemic attacks, more comorbidities, low β-blocker, angiotensin-converting-enzyme inhibitors and implanted devices, and high diuretics, nitrates, and digoxin uptake; patients were least likely to have a university degree and had the lowest income [30]. As methods to diagnose/identify HF patients may differ across countries, our results may not be generalizable. Additionally, HF management choices are influenced by local guidelines, therefore treatment patterns, interventions, and costs may also differ.
Our work may set a new framework for generating data driven ML approaches that link point-of-care data with SDOH data. This deterministic linking of datasets can help to obtain insights from a more complete historical data of the patient. Furthermore, the availability of such linked datasets in the future can be improved with big data technologies, which can help to obtain better insights from patient populations.
Our study had several limitations. First, the linkage of claims data with online survey data required active patient participation, which could have been challenging for severely ill/elderly patients. Therefore, the study population (younger, healthier, fewer comorbidities) may not be representative of the general US HF population. Second, US claims database enables inclusion of many patients, especially for a medical condition like HF, but sample size was reduced due to linkage with NHWS data. However, the linkage was necessary to obtain variables not available in the claims database for better patient characterization. Third, patients were selected based on any HF diagnosis, which may have led to inclusion of those with other conditions/comorbidities, impacting patients’ outcomes, including disease severity, HRQoL, healthcare resource utilization, and costs. To reduce the likelihood of including patients without a primary HF diagnosis, those with HF during the baseline period were excluded.
More clinically meaningful archetypes could have been identified using a larger sample. However, the nature of the first HF diagnosis (inpatient/non-inpatient) was considered as a proxy for disease severity in the analyses. Additionally, mortality data were available only for approximately 85% patients in the Komodo database, potentially leading to underestimation of the number of deaths. As for all claims data analysis, medication use was assessed based on prescriptions, but information on medication adherence by patient was unavailable. Finally, missing values for clinical measures in the NHWS data could be a limitation.
In conclusion, HF is associated with substantial clinical and financial burden and impacts patients’ QoL. Efforts to integrate DHC interventions as complementary to traditional face-to-face health services may improve patient outcomes, efficiency of healthcare delivery, and cost savings. Despite certain limitations, identification of two archetypes with distinct patient profiles and outcomes using K-means clustering algorithm can help to better understand underlying disease subtypes, predict clinical outcomes, and define the right intervention for the right patient. Future studies with a larger, more enriched database are warranted to generate further clinical insights using advanced analytics.
Notes
Conflict of Interest
Keni Lee is an employee of Sanofi, UK. Ramzi Argoubi is an employee of Cerner Enviza, an Oracle company, which received funding from Sanofi to conduct this study. Halley Costantino was an employee of Cerner Enviza at the time of conducting this study but is now an employee of BluePath Solutions.
Acknowledgments
This study was funded by Sanofi. Sanofi was involved in the design of the study.
The authors wish to thank Ingrid Diana Monteiro and Deepali Garg, employees of Sanofi, for assistance in writing and publication support.
Data Statement
Qualified researchers may request access to patient level data and related study documents including the clinical study report, study protocol with any amendments, statistical analysis plan, and dataset specifications. Patient level data will be anonymized, and study documents will be redacted to protect the privacy of our participants. Further details on Sanofi’s data sharing criteria, eligible studies, and process for requesting access can be found at: https://vivli.org/.
Supplementary Materials
Supplementary materials can be found via https://doi.org/10.4258/hir.2025.31.1.66.