Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data

Article information

Healthc Inform Res. 2025;31(1):16-22
Publication date (electronic) : 2025 January 31
doi : https://doi.org/10.4258/hir.2025.31.1.16
1Department of Informatics Engineering, Faculty of Computer Science, Brawijaya University, Malang, Indonesia
2Department of Biology, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia
3Department of Statistics, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia
4Department of Histology, Faculty of Medicine, Maranatha Christian University, Bandung, Indonesia
Corresponding Author: Lailil Muflikhah, Department of Informatics Engineering, Faculty of Computer Science, Brawijaya University, Malang 65145, Indonesia. Tel: +62-341-577-911, E-mail: lailil@ub.ac.id (https://orcid.org/0000-0001-7903-0576)
Received 2024 May 31; Revised 2024 September 6; Accepted 2024 October 28.

Abstract

Objectives

Hypertension, commonly known as high blood pressure, is a prevalent and serious condition affecting a significant portion of the adult population globally. It is a chronic medical issue that, if left unaddressed, can lead to severe health complications, including kidney problems, heart disease, and stroke. This study aims to develop a feature selection model using the XGBoost algorithm to identify specific single nucleotide polymorphisms (SNPs) as biomarkers for detecting hypertension risk.

Methods

We propose using the high dimensionality of genetic variations (i.e., SNPs) to build a classifier model for prediction. In this study, SNPs were used as markers for hypertension in patients. We utilized the OpenSNP dataset, which includes 19,697 SNPs from 2,052 samples. Extreme gradient boosting (XGBoost) is an ensemble machine learning method employed here for feature selection, which incrementally adjusts weights in a series of steps.

Results

The experimental results identified 292 SNPs that exhibited high performance, with an F1-score of 98.55%, precision of 98.73%, recall of 98.38%, and overall accuracy of 98%. This study provides compelling evidence that the XGBoost feature selection method outperforms other representative feature selection methods, such as genetic algorithms, analysis of variance, chi-square, and principal component analysis, in predicting hypertension risk, demonstrating its effectiveness.

Conclusions

We developed a model for predicting hypertension using the SNPs dataset. The high dimensionality of SNP data was effectively managed to identify significant features as biomarkers using the XGBoost feature selection method. The results indicate high performance in predicting the risk of hypertension.

I. Introduction

Hypertension, commonly known as high blood pressure, occurs when blood pressure readings consistently reach 140/90 mmHg or higher [1]. Although hypertension can be detected in primary healthcare settings, the primary cause of most cases remains unidentified, with genetic factors believed to play a significant role [2]. Given the substantial global health impact of hypertension, it is crucial to prevent the associated risks through early identification and assessment. A thorough review and critical assessment of machine learning methodologies will facilitate their successful integration into healthcare practice, benefiting both society and public health [3]. Previous studies on hypertension risk identification primarily utilized clinical data, blood pressure measurements, imaging techniques, and physiological signal data [1,4,5]. However, incorporating single nucleotide polymorphism (SNP) data, as we propose in this study, presents a potentially useful approach for implementing prevention strategies. Utilizing genetic information could enhance the precision and individualization of prevention, leading to better public health program outcomes. Exploring SNP data is of great importance to both academics and health practitioners and requires a careful examination of its potential benefits and drawbacks, particularly concerning ethical considerations and privacy issues [68]. Identifying the primary attributes that strongly correlate with blood pressure could lead to improved predictive models. Therefore, feature selection strategies are employed to streamline the number of predictors. Research incorporating hybrid feature selections into machine learning classifier models for hypertension diagnosis has shown improved performance [9,10]. The extreme gradient boosting (XGBoost) algorithm, known for its robustness and predictive capability, has shown promise in this area. However, efficient feature selection remains challenging, particularly when dealing with large genetic datasets like SNPs, which contain a vast number of potential predictors. This results in a high-dimensional dataset where the number of features significantly exceeds the number of samples. Several studies have indicated that many genes in DNA microarray datasets may not be directly relevant or informative for accurate disease diagnosis [1114]. This study aimed to enhance the feature selection model using the XGBoost algorithm to identify significant predictors for hypertension risk based on SNPs. The methodology includes integrating the XGBoost feature selection method with analysis of variance (ANOVA), principal component analysis (PCA), chi-square (Chi2), and a genetic algorithm (GA). Additionally, specific SNPs are proposed to improve diagnostic accuracy and efficiency, thereby reducing barriers to early intervention and enhancing patient care outcomes.

II. Methods

The method proposed in this research includes several steps that are essential for achieving the study’s objectives, as illustrated in Figure 1 for predictive modeling. The initial step involves data scraping, during which raw data is collected from various sources. This is followed by data preprocessing, a crucial stage that ensures the data is clean and well-organized before proceeding to analysis and modeling. Subsequently, feature selection is performed, where only the most important and relevant attributes are retained in the dataset. In this study, we employed various methods including ANOVA, Chi2, PCA, GA, and XGBoost. These techniques were continuously applied to develop a classifier model to predict hypertension.

Figure 1

Block diagram of the general proposed method. XGBoost: extreme gradient boosting, ANOVA: analysis of variance, PCA: principal component analysis, GA: genetic algorithm, SNP: single nucleotide polymorphism, GWAS: genome-wide association studies.

1. Data Scrapping

SNP rs699 is a variant associated with hypertension or high blood pressure. We conducted web scraping on the Open-SNP website to obtain the dataset for SNP rs699. This dataset includes information about individuals’ genes. The steps for web scraping the dataset are detailed in the Algorithm, as presented in Table 1. Through the web scraping process, we successfully collected a total of 2,100 raw data samples.

Algorithm of web scraping

For data scraping, we utilized the Python libraries, Beautiful Soup [15] and Requests [16]. This process involved collecting raw SNP data as features, which were subsequently preprocessed and analyzed to predict hypertension using machine learning models.

2. Data Processing

On the OpenSNP website, the data sources include 23andMe and Ancestry, each with distinct structures for presenting genotype information. In the 23andMe dataset, genotype data is consolidated into a single column named “genotype.” Conversely, the Ancestry dataset splits the genotype into two separate columns, labeled “allele1” and “allele2.” To standardize our analysis, we undertook data preprocessing to align the genotype columns across both datasets. Specifically, for the Ancestry data, we merged the information from “allele1” and “allele2” into a newly formed “genotype” column.

1) Data cleaning

In the first stage, we selected SNPs that were completely recorded for all participants. Non-alphabetic values such as “0.0,” “–,” and “0” identified in the dataset were replaced with “NaN.” This substitution with “NaN,” a standard notation for missing data, ensures uniform handling of missing values throughout the analysis. We will compare our automated SNP selection process with manual methods; thus, we specifically selected columns referenced in several studies [17,18]. Applying the preprocessing steps to the raw data reduced the total number of samples in the dataset from 2,100 to 2,052.

2) One-hot encoding

We employed one-hot encoding as the final preprocessing step to prepare the data for machine learning. One-hot encoding is a technique that transforms categorical data into a format that can be effectively utilized by machine learning algorithms, which often require numerical inputs. After applying one-hot encoding to the dataset, the number of columns increased significantly, from 19,697 to 101,855. This increase in columns is a direct consequence of converting categorical data into a binary format.

3. Feature Selection

Feature selection is aimed at identifying characteristics within a large dataset to enhance efficiency and reduce complexity. By selecting key features, we can improve model performance, prevent overfitting, and facilitate pattern recognition. Selecting an appropriate feature selection technique leads to a prediction model that is both more accurate and easier to understand [5,9,10,19]. This research integrated various feature selection approaches with machine learning classifiers, including ANOVA, Chi2, PCA, GA, and XGBoost.

ANOVA feature selection is a technique employed to identify which features significantly impact the target variable within a dataset. This method involves comparing the means of groups generated by different features to determine if their differences significantly affect the target variable [20].

Chi2 feature selection is a statistical technique that evaluates the independence between categorical variables and the target variable. This method calculates the Chi2 statistic for each feature to assess the strength of association with the target class. It is widely used in machine learning pipelines to enhance model performance by retaining only the most informative features [21].

PCA feature selection transforms high-dimensional data into a smaller set of uncorrelated variables known as principal components. These components capture the most significant variance in the data, thereby retaining essential information while reducing the feature space. By selecting the principal components that account for the majority of the variance, PCA effectively streamlines the feature set, making it a valuable tool in machine learning pipelines [22].

GA feature selection is an optimization technique based on evolutionary principles, inspired by the process of natural selection. It operates by iteratively generating populations of feature subsets, evaluating their fitness based on model performance, and applying genetic operators such as selection, crossover, and mutation to evolve improved subsets over generations. By optimizing the feature set, GA can enhance model accuracy and reduce computational costs, making it a powerful tool in machine learning [23].

4. XGBoost

XGBoost has been used for feature selection for performing classification tasks, improving predictive performance, and identifying important variables in complex datasets [24]. The XGBoost feature selection process starts with a comprehensive clinical dataset that includes a variety of features related to hypertension. Initially, a decision tree is built, with nodes being split based on an objective function aimed at minimizing prediction errors. The importance of each feature is then reweighted based on its effectiveness in reducing these errors. This is an iterative process, where each subsequent decision tree adjusts for the residuals left by its predecessors, continuously refining the importance of each feature. Ultimately, the final model aggregates the contributions from all decision trees, identifying the set of features that most effectively predict hypertension. These individual classifiers or predictors are then combined to form a more robust and accurate model [25]. The objective function is shown in (1).

(1) Obj(θ)=i=1nl(yi,y^i)+k=1KΩ(fk)

where K is the number of trees, f is the functional space of F, and F is the set of possible classification and regression trees.

(2) y^ι=k=1Kfk(xi),fkF.

5. Performance Evaluation

To evaluate the performance of the results, we utilized several metrics derived from the confusion matrix, as illustrated in Table 2. These include precision, recall, accuracy, F1-score, and area under the curve (AUC) (Table 3). These metrics offer a thorough assessment of the model’s effectiveness, encompassing various aspects of its predictive capabilities, from accurately identifying positive cases to overall accuracy and the model’s capacity to differentiate between classes.

Confusion matrix

Performance metrics

III. Results

We collected data from OpenSNP.org and processed 2,051 samples. In our study, we classified the target column based on the value of rs699. If the value of rs699 is TT or AA, we assigned the target as 0, indicating a normal risk of hypertension. Conversely, if the value of rs699 is anything other than TT or AA, we assigned the target as 1, indicating an increased risk of hypertension. The class distribution in the data is imbalanced, with approximately 69% of the samples falling into the “risked” target class and about 31% into the “normal” target class. To address this data imbalance, we employed stratified k-fold cross-validation with k set to 10. This method ensures that each fold maintains the same proportion of class labels as the original dataset, allowing for a more reliable evaluation of model performance on imbalanced data.

In addition to the five automatic SNP selection methods—ANOVA, PCA, GA, Chi2, and XGBoost—we also evaluated three manual SNP selection approaches. These approaches are (1) SNPs identified as significant in genome-wide association studies (GWAS) [17]; (2) the top 10 SNPs based on their ranking [26]; and (3) four specific SNPs [18]. Table 4 displays the complete list of SNPs that were selected based on these three papers.

SNPs manually selected from other papers

The experimental results demonstrate that XGBoost feature selection effectively predicted hypertension risk using the SNP dataset, as evidenced by its performance in F1-score, precision, recall, accuracy, and AUC, as shown in Table 5.

Performance evaluation of feature selection methods

IV. Discussion

Research on hypertension risk prediction using variant genetic datasets has been conducted using machine learning and feature selection methods. The potential SNPs can be implemented in an early prediction model for hypertension screening programs. This study demonstrates the effectiveness of the XGBoost feature selection method in predicting hypertension risk. XGBoost demonstrates high performance and accurately identifies the relevant features of SNPs for precise predictions. This approach improves the efficiency of risk assessment and offers insights into the factors that contribute to hypertension.

The experimental results indicate that the AUC of hypertension prediction using XGBoost was high and stable across a number of iterations exceeding 40, as illustrated in Figure 2. In the realm of hypertension risk prediction, maintaining a high and consistent AUC when employing the XGBoost algorithm is regarded as a favorable outcome.

Figure 2

Area under the curve (AUC) of the XGBoost feature selection and classifier.

The loss function of the proposed method is both low and stable starting from the 40th iteration, as illustrated in Figure 3. A low loss function indicates a close alignment between the model’s predictions and the actual hypertension risk values. Stability in the loss function is crucial as it demonstrates the model’s ability to consistently perform well across different data subsets or throughout multiple iterations. This consistency is essential for ensuring the reliability of the model’s predictions, protecting them from being overly influenced by external variables or random data fluctuations.

Figure 3

Loss function of the XGBoost feature selection method and classifier.

However, a limitation of this study concerning the implementation of feature selection using XGBoost centers on the manual setting of the threshold value related to feature importance. This manual intervention introduces the potential for bias or subjective errors in determining the appropriate threshold for selecting relevant features. Furthermore, the manual process prolongs the analysis time and reduces efficiency, particularly when handling large and complex datasets. Developing an automated method for determining the threshold would significantly improve the accuracy and efficiency of the feature selection process.

Notes

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

Acknowledgments

This research is supported by Brawijaya University under the Program of Superior Research Grant Program “HPU 2023” (Contract No. 612.41/UN10.C20/2023).

References

1. NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in hypertension prevalence and progress in treatment and control from 1990 to 2019: a pooled analysis of 1201 population-representative studies with 104 million participants. Lancet 2021;398(10304):957–80. https://doi.org/10.1016/S0140-6736(21)01330-1.
2. Adlung L, Cohen Y, Mor U, Elinav E. Machine learning in clinical decision making. Med 2021;2(6):642–65. https://doi.org/10.1016/j.medj.2021.04.006.
3. Silva GF, Fagundes TP, Teixeira BC, Chiavegatto Filho AD. Machine learning for hypertension prediction: a systematic review. Curr Hypertens Rep 2022;24(11):523–33. https://doi.org/10.1007/s11906-022-01212-6.
4. AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: findings from Qatar Biobank Study. PLoS One 2020;15(10):e0240370. https://doi.org/10.1371/journal.pone.0240370.
5. Martinez-Rios E, Montesinos L, Alfaro-Ponce M, Pecchia L. A review of machine learning in hypertension detection and blood pressure estimation based on clinical and physiological data. Biomed Signal Process Control 2021;68:102813. https://doi.org/10.1016/j.bspc.2021.102813.
6. Alzubi R, Ramzan N, Alzoubi H, Katsigiannis S. SNPs-based hypertension disease detection via machine learning techniques. In : Proceedings of 2018, 24th International Conference on Automation and Computing (ICAC); 2018 Sep 6–8; Newcastle Upon Tyne, UK. p. 1–6. https://doi.org/10.23919/IConAC.2018.8748972.
7. Antony Raj CB, Nagarajan H, Aslam MH, Panchalingam S. SNP identification and discovery. In : Gupta MK, Behera L, eds. Bioinformatics in rice research: theories and techniques Singapore: Springer; 2021. p. 361–86. https://doi.org/10.1007/978-981-16-3993-7_17.
8. Kurland L, Liljedahl U, Lind L. Hypertension and SNP genotyping in antihypertensive treatment. Cardiovasc Toxicol 2005;5(2):133–42. https://doi.org/10.1385/ct:5:2:133.
9. Park HW, Li D, Piao Y, Ryu KH. A hybrid feature selection method to classification and its application in hypertension diagnosis. In : Bursa M, Holzinger A, Renda M, Khuri S, eds. Information technology in bio-and medical informatics Cham, Switzerland: Springer International Publishing; 2017. p. 11–9. https://doi.org/10.1007/978-3-319-64265-9_2.
10. Peng Y, Xu J, Ma L, Wang J. Prediction of hypertension risks with feature selection and XGBoost. J Mech Med Biol 2021;21(05):2140028. https://doi.org/10.1142/S0219519421400285.
11. Asmare Z, Erkihun M. Recent application of DNA microarray techniques to diagnose infectious disease. Pathol Lab Med Int 2023;15:77–82. https://doi.org/10.2147/PLMI.S424275.
12. Liu L, So AY, Fan JB. Analysis of cancer genomes through microarrays and next-generation sequencing. Transl Cancer Res 2015;4(3):212–8. https://doi.org/10.3978/j.issn.2218-676X.2015.05.04.
13. Beck DB, Petracovici A, He C, Moore HW, Louie RJ, Ansar M, et al. Delineation of a human Mendelian disorder of the DNA demethylation machinery: TET3 deficiency. Am J Hum Genet 2020;106(2):234–45. https://doi.org/10.1016/j.ajhg.2019.12.007.
14. Gupta S, Gupta MK, Shabaz M, Sharma A. Deep learning techniques for cancer classification using microarray gene expression data. Front Physiol 2022;13:952709. https://doi.org/10.3389/fphys.2022.952709.
15. Hassan F. Beautiful soup: a python library for web scraping [Internet] San Francisco (CA): Medium; 2023. [cited at 2024 Mar 6]. Available from: https://blog.devgenius.io/introduction-to-beautiful-soup-a-python-library-for-web-scraping-21cacb9cf088.
16. Requests. Requests: HTTP for Humans (Release v2.31.0) [Internet] [place unknown]: Requests; 2023. [cited at 2023 Nov 21]. Available from: https://requests.kenneth-reitz.org/en/latest/.
17. Li C, Sun D, Liu J, Li M, Zhang B, Liu Y, et al. A prediction model of essential hypertension based on genetic and environmental risk factors in Northern Han Chinese. Int J Med Sci 2019;16(6):793–9. https://doi.org/10.7150/ijms.33967.
18. Lim NK, Lee JY, Lee JY, Park HY, Cho MC. The role of genetic risk score in predicting the risk of hypertension in the Korean population: Korean genome and epidemiology study. PLoS One 2015;10(6):e0131603. https://doi.org/10.1371/journal.pone.0131603.
19. Hasan N, Bao Y. Comparing different feature selection algorithms for cardiovascular disease prediction. Health Technol 2021;11(1):49–62. https://doi.org/10.1007/s12553-020-00499-2.
20. Kumar M, Rath NK, Swain A, Rath SK. “Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor,”. Procedia Computer Science 54:301–310. Jan. 2015;10.1016/j.procs.2015.06.035.
21. Cai L, Lv S, Shi K. “Application of an Improved CHI Feature Selection Algorithm,”. Discrete Dynamics in Nature and Society 2021(1):9963382. 2021;10.1155/2021/9963382.
22. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci 2016;374(2065):20150202. https://doi.org/10.1098/rsta.2015.0202.
23. Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence Cambridge (MA): MIT Press; 1992.
24. Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A Tristage wrapper-filter feature selection framework for disease classification. Sensors (Basel) 2021;21(16):5571. https://doi.org/10.3390/s21165571.
25. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In : Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13–17; San Francisco, CA, USA. p. 785–94. https://doi.org/10.1145/2939672.2939785.
26. Lajevardi SA, Kargari M, Daneshpour MS, Akbarzadeh M. Hypertension risk prediction based on SNPS by machine learning models. Curr Bioinform 2023;18(1):55–62. https://doi.org/10.2174/157489361766622101109332.

Article information Continued

Figure 1

Block diagram of the general proposed method. XGBoost: extreme gradient boosting, ANOVA: analysis of variance, PCA: principal component analysis, GA: genetic algorithm, SNP: single nucleotide polymorphism, GWAS: genome-wide association studies.

Figure 2

Area under the curve (AUC) of the XGBoost feature selection and classifier.

Figure 3

Loss function of the XGBoost feature selection method and classifier.

Table 1

Algorithm of web scraping

Line Instruction
1 Input:
2  url: URL of the website

3 Output:
4  raw_data: Genotype raw data

5 Procedure:
6  Download HTML file from url
7  Access href under the id=“user-list”
8  Load to each user’s link.
9  Access href in each user’s link.
10  Download genotype raw data raw_data
11 return raw_data

Table 2

Confusion matrix

Predicted: NO Predicted: YES
Actual: NO TN FP
Actual: YES FN TP

TP (true positive) represents cases predicted as hypertension risk, where the patients actually have hypertension; TN (true negative), cases predicted as not having hypertension risk (normal), and the patients indeed do not have hypertension; FP (false positive), cases predicted as hypertension risk, but the patients do not actually have hypertension; FN (false negative), cases predicted as not having hypertension risk, but the patients actually have hypertension.

Table 3

Performance metrics

Measure Formula Remark
Precision Precision=TP(TP+FP) Accuracy of a classifier
Recall Recall=TP(TP+FN) A classifier’s ability to correctly identify positive instances
Accuracy Accuracy=(TP+TN)(TP+TN+FP+FN) A classifier’s ability to correctly identify negative instances
F1-score F1-score=2×(precision×recall)(precision+recall) Comprehensive assessment of prediction performance
AUC (area under the curve) AUC=12(TPTP+FN+TNTN+FP) The classifier’s capability to minimize false predictions

TP: true positive, TN: true negative, FP: false positive, FN: false negative.

Table 4

SNPs manually selected from other papers

Feature selection SNPs
FourSNPs [18] rs995322, rs17249754, rs1378942, rs12945290
GWAS [17] rs17030613, rs16849225, rs6825911, rs1173766, rs11066280, rs35444, rs880315, rs11191548, rs17249754, rs9810888, rs11067763, rs820430, rs1902859, rs4409766, rs4757391, rs1887320, rs13143871, rs1991391
TenSNPs [26] rs6506537, rs10021303, rs380914, rs3768939, rs4150161, rs31864, rs1925458, rs991316, rs1799945, rs12509878

SNP: single nucleotide polymorphism.

Table 5

Performance evaluation of feature selection methods

Feature selection Number of SNPs Mean precision Mean recall Mean F1-score Mean accuracy Mean AUC
FourSNPs 4 0.6900 0.9901 0.8132 0.6862 0.4998
GWAS 18 0.7066 0.8369 0.7660 0.6477 0.5317
TenSNPs 10 0.6882 0.8962 0.7785 0.6482 0.4960
AllSNPs 19,697 0.9781 0.9675 0.9727 0.9625 0.9594
PCA 19,697 0.7130 0.9541 0.8161 0.7032 0.5494
Chi2 61 0.7096 0.8694 0.7813 0.6642 0.5385
ANOVA 7 0.9879 0.9626 0.9750 0.9659 0.9679
XGBoost 292 0.9873 0.9838 0.9855 0.9800 0.9777
GA 19,697 0.9773 0.9661 0.9716 0.9610 0.9579

SNP: single nucleotide polymorphism, GWAS: genome-wide association studies, PCA: principal component analysis, Chi2: chi-square, GA: genetic algorithm.

Bold font indicates the best performance in each measurement.