Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data
Article information
Abstract
Objectives
Hypertension, commonly known as high blood pressure, is a prevalent and serious condition affecting a significant portion of the adult population globally. It is a chronic medical issue that, if left unaddressed, can lead to severe health complications, including kidney problems, heart disease, and stroke. This study aims to develop a feature selection model using the XGBoost algorithm to identify specific single nucleotide polymorphisms (SNPs) as biomarkers for detecting hypertension risk.
Methods
We propose using the high dimensionality of genetic variations (i.e., SNPs) to build a classifier model for prediction. In this study, SNPs were used as markers for hypertension in patients. We utilized the OpenSNP dataset, which includes 19,697 SNPs from 2,052 samples. Extreme gradient boosting (XGBoost) is an ensemble machine learning method employed here for feature selection, which incrementally adjusts weights in a series of steps.
Results
The experimental results identified 292 SNPs that exhibited high performance, with an F1-score of 98.55%, precision of 98.73%, recall of 98.38%, and overall accuracy of 98%. This study provides compelling evidence that the XGBoost feature selection method outperforms other representative feature selection methods, such as genetic algorithms, analysis of variance, chi-square, and principal component analysis, in predicting hypertension risk, demonstrating its effectiveness.
Conclusions
We developed a model for predicting hypertension using the SNPs dataset. The high dimensionality of SNP data was effectively managed to identify significant features as biomarkers using the XGBoost feature selection method. The results indicate high performance in predicting the risk of hypertension.
I. Introduction
Hypertension, commonly known as high blood pressure, occurs when blood pressure readings consistently reach 140/90 mmHg or higher [1]. Although hypertension can be detected in primary healthcare settings, the primary cause of most cases remains unidentified, with genetic factors believed to play a significant role [2]. Given the substantial global health impact of hypertension, it is crucial to prevent the associated risks through early identification and assessment. A thorough review and critical assessment of machine learning methodologies will facilitate their successful integration into healthcare practice, benefiting both society and public health [3]. Previous studies on hypertension risk identification primarily utilized clinical data, blood pressure measurements, imaging techniques, and physiological signal data [1,4,5]. However, incorporating single nucleotide polymorphism (SNP) data, as we propose in this study, presents a potentially useful approach for implementing prevention strategies. Utilizing genetic information could enhance the precision and individualization of prevention, leading to better public health program outcomes. Exploring SNP data is of great importance to both academics and health practitioners and requires a careful examination of its potential benefits and drawbacks, particularly concerning ethical considerations and privacy issues [6–8]. Identifying the primary attributes that strongly correlate with blood pressure could lead to improved predictive models. Therefore, feature selection strategies are employed to streamline the number of predictors. Research incorporating hybrid feature selections into machine learning classifier models for hypertension diagnosis has shown improved performance [9,10]. The extreme gradient boosting (XGBoost) algorithm, known for its robustness and predictive capability, has shown promise in this area. However, efficient feature selection remains challenging, particularly when dealing with large genetic datasets like SNPs, which contain a vast number of potential predictors. This results in a high-dimensional dataset where the number of features significantly exceeds the number of samples. Several studies have indicated that many genes in DNA microarray datasets may not be directly relevant or informative for accurate disease diagnosis [11–14]. This study aimed to enhance the feature selection model using the XGBoost algorithm to identify significant predictors for hypertension risk based on SNPs. The methodology includes integrating the XGBoost feature selection method with analysis of variance (ANOVA), principal component analysis (PCA), chi-square (Chi2), and a genetic algorithm (GA). Additionally, specific SNPs are proposed to improve diagnostic accuracy and efficiency, thereby reducing barriers to early intervention and enhancing patient care outcomes.
II. Methods
The method proposed in this research includes several steps that are essential for achieving the study’s objectives, as illustrated in Figure 1 for predictive modeling. The initial step involves data scraping, during which raw data is collected from various sources. This is followed by data preprocessing, a crucial stage that ensures the data is clean and well-organized before proceeding to analysis and modeling. Subsequently, feature selection is performed, where only the most important and relevant attributes are retained in the dataset. In this study, we employed various methods including ANOVA, Chi2, PCA, GA, and XGBoost. These techniques were continuously applied to develop a classifier model to predict hypertension.

Block diagram of the general proposed method. XGBoost: extreme gradient boosting, ANOVA: analysis of variance, PCA: principal component analysis, GA: genetic algorithm, SNP: single nucleotide polymorphism, GWAS: genome-wide association studies.
1. Data Scrapping
SNP rs699 is a variant associated with hypertension or high blood pressure. We conducted web scraping on the Open-SNP website to obtain the dataset for SNP rs699. This dataset includes information about individuals’ genes. The steps for web scraping the dataset are detailed in the Algorithm, as presented in Table 1. Through the web scraping process, we successfully collected a total of 2,100 raw data samples.
For data scraping, we utilized the Python libraries, Beautiful Soup [15] and Requests [16]. This process involved collecting raw SNP data as features, which were subsequently preprocessed and analyzed to predict hypertension using machine learning models.
2. Data Processing
On the OpenSNP website, the data sources include 23andMe and Ancestry, each with distinct structures for presenting genotype information. In the 23andMe dataset, genotype data is consolidated into a single column named “genotype.” Conversely, the Ancestry dataset splits the genotype into two separate columns, labeled “allele1” and “allele2.” To standardize our analysis, we undertook data preprocessing to align the genotype columns across both datasets. Specifically, for the Ancestry data, we merged the information from “allele1” and “allele2” into a newly formed “genotype” column.
1) Data cleaning
In the first stage, we selected SNPs that were completely recorded for all participants. Non-alphabetic values such as “0.0,” “–,” and “0” identified in the dataset were replaced with “NaN.” This substitution with “NaN,” a standard notation for missing data, ensures uniform handling of missing values throughout the analysis. We will compare our automated SNP selection process with manual methods; thus, we specifically selected columns referenced in several studies [17,18]. Applying the preprocessing steps to the raw data reduced the total number of samples in the dataset from 2,100 to 2,052.
2) One-hot encoding
We employed one-hot encoding as the final preprocessing step to prepare the data for machine learning. One-hot encoding is a technique that transforms categorical data into a format that can be effectively utilized by machine learning algorithms, which often require numerical inputs. After applying one-hot encoding to the dataset, the number of columns increased significantly, from 19,697 to 101,855. This increase in columns is a direct consequence of converting categorical data into a binary format.
3. Feature Selection
Feature selection is aimed at identifying characteristics within a large dataset to enhance efficiency and reduce complexity. By selecting key features, we can improve model performance, prevent overfitting, and facilitate pattern recognition. Selecting an appropriate feature selection technique leads to a prediction model that is both more accurate and easier to understand [5,9,10,19]. This research integrated various feature selection approaches with machine learning classifiers, including ANOVA, Chi2, PCA, GA, and XGBoost.
ANOVA feature selection is a technique employed to identify which features significantly impact the target variable within a dataset. This method involves comparing the means of groups generated by different features to determine if their differences significantly affect the target variable [20].
Chi2 feature selection is a statistical technique that evaluates the independence between categorical variables and the target variable. This method calculates the Chi2 statistic for each feature to assess the strength of association with the target class. It is widely used in machine learning pipelines to enhance model performance by retaining only the most informative features [21].
PCA feature selection transforms high-dimensional data into a smaller set of uncorrelated variables known as principal components. These components capture the most significant variance in the data, thereby retaining essential information while reducing the feature space. By selecting the principal components that account for the majority of the variance, PCA effectively streamlines the feature set, making it a valuable tool in machine learning pipelines [22].
GA feature selection is an optimization technique based on evolutionary principles, inspired by the process of natural selection. It operates by iteratively generating populations of feature subsets, evaluating their fitness based on model performance, and applying genetic operators such as selection, crossover, and mutation to evolve improved subsets over generations. By optimizing the feature set, GA can enhance model accuracy and reduce computational costs, making it a powerful tool in machine learning [23].
4. XGBoost
XGBoost has been used for feature selection for performing classification tasks, improving predictive performance, and identifying important variables in complex datasets [24]. The XGBoost feature selection process starts with a comprehensive clinical dataset that includes a variety of features related to hypertension. Initially, a decision tree is built, with nodes being split based on an objective function aimed at minimizing prediction errors. The importance of each feature is then reweighted based on its effectiveness in reducing these errors. This is an iterative process, where each subsequent decision tree adjusts for the residuals left by its predecessors, continuously refining the importance of each feature. Ultimately, the final model aggregates the contributions from all decision trees, identifying the set of features that most effectively predict hypertension. These individual classifiers or predictors are then combined to form a more robust and accurate model [25]. The objective function is shown in (1).
where K is the number of trees, f is the functional space of F, and F is the set of possible classification and regression trees.
5. Performance Evaluation
To evaluate the performance of the results, we utilized several metrics derived from the confusion matrix, as illustrated in Table 2. These include precision, recall, accuracy, F1-score, and area under the curve (AUC) (Table 3). These metrics offer a thorough assessment of the model’s effectiveness, encompassing various aspects of its predictive capabilities, from accurately identifying positive cases to overall accuracy and the model’s capacity to differentiate between classes.
III. Results
We collected data from OpenSNP.org and processed 2,051 samples. In our study, we classified the target column based on the value of rs699. If the value of rs699 is TT or AA, we assigned the target as 0, indicating a normal risk of hypertension. Conversely, if the value of rs699 is anything other than TT or AA, we assigned the target as 1, indicating an increased risk of hypertension. The class distribution in the data is imbalanced, with approximately 69% of the samples falling into the “risked” target class and about 31% into the “normal” target class. To address this data imbalance, we employed stratified k-fold cross-validation with k set to 10. This method ensures that each fold maintains the same proportion of class labels as the original dataset, allowing for a more reliable evaluation of model performance on imbalanced data.
In addition to the five automatic SNP selection methods—ANOVA, PCA, GA, Chi2, and XGBoost—we also evaluated three manual SNP selection approaches. These approaches are (1) SNPs identified as significant in genome-wide association studies (GWAS) [17]; (2) the top 10 SNPs based on their ranking [26]; and (3) four specific SNPs [18]. Table 4 displays the complete list of SNPs that were selected based on these three papers.
The experimental results demonstrate that XGBoost feature selection effectively predicted hypertension risk using the SNP dataset, as evidenced by its performance in F1-score, precision, recall, accuracy, and AUC, as shown in Table 5.
IV. Discussion
Research on hypertension risk prediction using variant genetic datasets has been conducted using machine learning and feature selection methods. The potential SNPs can be implemented in an early prediction model for hypertension screening programs. This study demonstrates the effectiveness of the XGBoost feature selection method in predicting hypertension risk. XGBoost demonstrates high performance and accurately identifies the relevant features of SNPs for precise predictions. This approach improves the efficiency of risk assessment and offers insights into the factors that contribute to hypertension.
The experimental results indicate that the AUC of hypertension prediction using XGBoost was high and stable across a number of iterations exceeding 40, as illustrated in Figure 2. In the realm of hypertension risk prediction, maintaining a high and consistent AUC when employing the XGBoost algorithm is regarded as a favorable outcome.
The loss function of the proposed method is both low and stable starting from the 40th iteration, as illustrated in Figure 3. A low loss function indicates a close alignment between the model’s predictions and the actual hypertension risk values. Stability in the loss function is crucial as it demonstrates the model’s ability to consistently perform well across different data subsets or throughout multiple iterations. This consistency is essential for ensuring the reliability of the model’s predictions, protecting them from being overly influenced by external variables or random data fluctuations.
However, a limitation of this study concerning the implementation of feature selection using XGBoost centers on the manual setting of the threshold value related to feature importance. This manual intervention introduces the potential for bias or subjective errors in determining the appropriate threshold for selecting relevant features. Furthermore, the manual process prolongs the analysis time and reduces efficiency, particularly when handling large and complex datasets. Developing an automated method for determining the threshold would significantly improve the accuracy and efficiency of the feature selection process.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgments
This research is supported by Brawijaya University under the Program of Superior Research Grant Program “HPU 2023” (Contract No. 612.41/UN10.C20/2023).