Genetic Algorithm-based Convolutional Neural Network Feature Engineering for Optimizing Coronary Heart Disease Prediction Performance
Article information
Abstract
Objectives
This study aimed to optimize early coronary heart disease (CHD) prediction using a genetic algorithm (GA)-based convolutional neural network (CNN) feature engineering approach. We sought to overcome the limitations of traditional hyperparameter optimization techniques by leveraging a GA for superior predictive performance in CHD detection.
Methods
Utilizing a GA for hyperparameter optimization, we navigated a complex combinatorial space to identify optimal configurations for a CNN model. We also employed information gain for feature selection optimization, transforming the CHD datasets into an image-like input for the CNN architecture. The efficacy of this method was benchmarked against traditional optimization strategies.
Results
The advanced GA-based CNN model outperformed traditional methods, achieving a substantial increase in accuracy. The optimized model delivered a promising accuracy range, with a peak of 85% in hyperparameter optimization and 100% accuracy when integrated with machine learning algorithms, namely naïve Bayes, support vector machine, decision tree, logistic regression, and random forest, for both binary and multiclass CHD prediction tasks.
Conclusions
The integration of a GA into CNN feature engineering is a powerful technique for improving the accuracy of CHD predictions. This approach results in a high degree of predictive reliability and can significantly contribute to the field of AI-driven healthcare, with the possibility of clinical deployment for early CHD detection. Future work will focus on expanding the approach to encompass a wider set of CHD data and potential integration with wearable technology for continuous health monitoring.
I. Introduction
Coronary heart disease (CHD), identified by the World Health Organization as a leading cause of death, develops from arterial blockages that impair heart function. Each year, CHD claims approximately 17.9 million lives, underscoring the importance of early detection and personalized management to reduce patient risk [1]. In healthcare, the use of artificial intelligence (AI), machine learning, and deep learning is on the rise, with deep learning particularly showing potential in improving the prognosis of CHD and customizing patient care [2]. While these technologies play a crucial role in enhancing prediction accuracy [3], predicting CHD remains a challenging task [4].
Machine learning has shown promising results in predicting coronary heart disease by developing models that utilize established risk indicators such as age, sex, blood pressure, cholesterol levels, smoking habits, and family medical history. Key algorithms, including naive Bayes, XGBoost, k-nearest neighbor, multilayer perceptron, and support vector machine (SVM) [5], have been used in these predictions. Feature selection, particularly through information gain (IG), plays a crucial role in enhancing the accuracy, simplicity, and efficiency of these models. This leads to more accurate, interpretable models that are well-suited for practical application in various real-world settings [6]. IG is instrumental in identifying the most critical features for predicting CHD [7], thus optimizing the prediction process.
Hyperparameter optimization plays a crucial role in machine learning, aiming to improve model accuracy and prevent overfitting by achieving a balance in variation. This process involves setting specific model configurations that are not derived from data but are selected to boost network performance. Common techniques such as grid search [8], randomized search [9], and Bayesian optimization [10] are frequently utilized because the success of a model is significantly influenced by the values of its hyperparameters [11]. The extensive range of hyperparameter combinations presents a considerable challenge [12] and is approached as a multi-objective problem to optimize performance indices [13]. Meta-heuristic methods, including the ant colony system [14] and genetic algorithms (GAs) [15], address this issue by exploring optimal configurations. Our study utilized a GA, as GAs have proven to be effective in optimizing hyperparameters.
Employing feature engineering with a convolutional neural network (CNN) involves the automatic extraction and representation of relevant features from raw data by the network itself. The CNN is specifically designed to learn and extract features hierarchically through its convolutional layers, which automatically identify discriminative features from raw data, thus reducing the reliance on manual feature engineering [16]. This approach eliminates the need for human-designed features and enables the network to autonomously identify essential patterns in the data. CNNs offer high accuracy and flexibility, depending on sufficient data for their design and training, and facilitate enhanced classification across various domains [17]. Determining the hyperparameters is crucial for effective performance. GA-based CNNs have demonstrated satisfactory results in applications such as crop pest image classification [18], croup cough classification [19], and pattern recognition [20]. They also have the potential to improve the outcomes of CHD prediction.
Utilizing a GA-based CNN for hyperparameter optimization offers several advantages over traditional methods. GAs are particularly effective at balancing exploration and exploitation, which helps them avoid local minima and identify optimal hyperparameters. This capability renders them computationally superior to exhaustive methods such as grid search. Additionally, GAs are highly adaptable, capable of managing the complex, high-dimensional search spaces typical of deep learning models. Their scalability facilitates the efficient optimization of large models, and the mechanisms of crossover and mutation contribute to preventing overfitting. Moreover, the parallelizability of GAs enhances their efficiency by allowing the simultaneous evaluation of multiple solutions. Overall, this approach provides a robust and flexible solution for optimizing CHD prediction models.
Public datasets for CHD prediction, such as the Cleveland dataset from the UCI Machine-Learning Repository, are structured in tabular form. However, CNNs, which are primarily designed for image data, encounter challenges when processing tabular data. Converting tabular data into images [21] preserves spatial information, enabling CNNs to effectively capture spatial dependencies and patterns [22]. This method improves the model’s ability to detect intricate patterns and relationships that might not be fully discernible when using tabular data directly with a CNN.
This study introduced a GA-based CNN feature engineering method aimed at optimizing the accuracy of early CHD detection. The main contributions of this research include: (1) utilizing IG for the selection of critical features; (2) improving the prediction performance of early CHD detection through the application of GA for hyperparameter tuning in CNN; and (3) converting tabular data into image format to more effectively capture spatial correlations and improve CHD prediction accuracy.
II. Methods
1. Dataset
In this study, the Cleveland dataset from the UCI Machine Learning Repository was employed. This database includes 76 features, with a particular focus on the target feature that indicates the presence of heart disease. Typically, research has concentrated on a subset of 14 attributes from records of 303 patients. The dataset is primarily used for binary classification to distinguish between cases with heart disease (499 instances) and those without (526 instances). Additionally, a multiclass risk stratification approach was employed to categorize instances into five levels: class 0 (no heart disease), class 1 (low risk, 55 instances), class 2 (moderate risk, 36 instances), class 3 (high risk, 35 instances), and class 4 (very high risk, 13 instances).
2. Proposed Method
This paper introduces a CNN that is enhanced by a GA and specifically designed for the early detection of CHD, as illustrated in Figure 1. The approach was structured as follows: initially, a CHD dataset was acquired and preprocessed to address missing values. Subsequently, an IG technique was employed to identify the most predictive features, which were then converted into an image-like format suitable for CNN analysis. The CNN was further refined using a GA that optimized its hyperparameters, improving its feature extraction capabilities. The optimized features were then utilized to train various machine learning classifiers, including naive Bayes, SVM, decision tree, and logistic regression. A rigorous evaluation employing metrics such as accuracy, precision, recall, and F-measure showed that this comprehensive approach, which integrates data preparation, feature engineering, and machine learning, markedly improved the accuracy of CHD prediction.
For feature selection, as illustrated in Figure 2A, IG was utilized to identify the most predictive features. This involved several steps: calculating entropy, determining the conditional entropy for each feature, computing IG, and ranking the features. This phase focuses on selecting the most informative features, ensuring that the model highlights the key predictors of CHD. The subsequent challenge involves adapting the tabular CHD dataset for CNN application by transforming it into a three-dimensional, image-like structure. Figure 2B outlines this transformation process, which includes defining the dataset dimensions, specifying the number of channels based on grouped categories from the selected features, and configuring them into a 3 × 4 × 1 input shape for CNN analysis.
As illustrated in Figure 2C, our GA-based CNN starts with an input layer designed to accommodate a 3 × 4 × 1 representation of the dataset’s features. Following this, the convolutional layer carries out several convolution operations, each succeeded by batch normalization to improve training speed and stability.
A GA optimizes the hyperparameters of the CNN. It generates potential solutions (sets of hyperparameters), applies crossover and mutation, and evaluates their fitness, such as accuracy. The best solutions are selected for subsequent generations, ultimately converging on an optimal set. The output of this optimized CNN serves as a refined feature set, which enhances the accuracy of traditional machine learning models in predicting CHD. The performance is rigorously evaluated using metrics such as accuracy, precision, recall, F1-score, specificity, G-mean, and p-value.
3. Evaluation
An evaluation was conducted to assess the performance of CHD prediction using several metrics beyond accuracy, providing a comprehensive overview of the model’s capabilities. This included an analysis of recall, precision, F1-score, specificity, and G-mean. Additionally, we report the p-value for each metric to demonstrate their statistical significance. This expanded evaluation provides a more nuanced understanding of the model’s strengths and limitations across different performance dimensions.
III. Results
1. Results of Preprocessing: A Foundation for Analysis
1) Imputation of missing values
For this study, the CHD dataset was thoroughly examined for missing values to ensure data integrity and completeness. Upon inspection, it was observed that the CHD dataset contained no missing values in either the binary or multiclass datasets. The absence of missing values simplifies the preprocessing stage, allowing us to proceed directly to feature selection and modeling without the need for imputation strategies.
2) Feature selection
Using IG for feature selection, 12 features were identified as significant for both the binary and multiclass CHD datasets. In the binary dataset, the significant features along with their IG values are as follows: cholesterol (“chol,” 0.212439), maximum heart rate (“thalach,” 0.164900), chest pain (“cp,” 0.150377), ST depression (“oldpeak,” 0.148694), thalassemia (“thal,” 0.138485), visible vessels (“ca,” 0.111282), exercise angina (“exang,” 0.094309), age (0.085682), heart rate slope (“slope,” 0.076467), resting blood pressure (“trestbps,” 0.060951), sex (“sex,” 0.046134), and ECG results (“restecg,” 0.009632). Fasting blood sugar (“fbs”) with a value of 0.000000 was not considered informative. The correlations among these features are depicted in Figure 3A.
For the multiclass dataset, the notable features were chest pain (“cp,” 0.141675), thalassemia (“thal,” 0.127007), exercise angina (“exang,” 0.111727), maximum heart rate (“thalach,” 0.090939), ST depression (“oldpeak,” 0.090754), visible vessels (“ca,” 0.076833), heart rate slope (“slope,” 0.076124), and age (0.027281). Interestingly, gender (“sex”), resting blood pressure (“trestbps”), cholesterol (“chol”), and ECG results (“restecg”) each had an IG of zero and were excluded from the model. The heatmap displayed in Figure 3B illustrates the correlation among the features in the multiclass dataset. In contrast to the binary set, this dataset did not include sex as a factor.
3) Image representation from the CHD dataset
The transformation of the CHD dataset into an image representation resulted in an input shape for the CNN comprising 303 samples. Each sample includes three channels, with each channel containing four features, leading to a total format of 303 × 3 × 4 × 1. Consider, for example, the features of a single CHD sample: age, cholesterol level, blood pressure, and heart rate. These features are normalized and mapped onto three distinct color channels (e.g., red, green, and blue). Each feature was then positioned within a 4 × 1 grid for each channel. Figure 4 illustrates this concept, where a specific combination of feature values is visually represented. The intensity of each color within a square corresponds to the normalized value of a feature, providing a comprehensive visual summary of the dataset.
2. Results of Feature Engineering using GA-based CNN Hyperparameter Optimization
A GA with a population of 10 across 10 generations was used to perform optimization, thereby improving the CNN’s feature-processing layer. Initially, the GA was applied to the binary CHD dataset, and subsequently, the determined optimal hyperparameters were also used for the multiclass CHD data.
Figure 5A depicts the validation accuracy achieved at each stage of the GA evolution, while Figure 5B displays the minimum, maximum, and average validation accuracies across 10 GA generations. An analysis of these figures reveals that the highest hyperparameter optimization accuracy reached 85.1021%. This peak performance was achieved using a combination of various filter sizes, fully connected layer sizes, activation functions (“selu” and “relu”), and the “adagrad” optimizer over a span of 140 epochs. In contrast, the lowest accuracy recorded was 57.2591%, achieved with a slightly altered set of parameters for the filters and layers, but using the same “adagrad” optimizer, this time over 139 epochs. These specific hyperparameter configurations demonstrate their influence on the model’s validation accuracy.
3. Results of GA-based CNN Feature Engineering and CHD Prediction
Figure 6 illustrates the performance of the hyperparameter optimization using the GA-based CNN throughout the training and testing phases. Several machine learning algorithms, including naïve Bayes, SVM, decision tree, logistic regression, and random forest, have been utilized for CHD prediction due to their proven effectiveness in previous studies.
Tables 1 and 2 present the results of applying different machine learning algorithms to the binary and multiclass CHD datasets, respectively. These results specifically demonstrate the performance achieved using the highest hyperparameter optimization accuracy of 85.1021%, as determined by our GA-based CNN feature engineering approach. It is important to note that our exploration of hyperparameter optimization produced a spectrum of accuracies, with the lowest recorded accuracy being 57.2591%.
IV. Discussion
1. Implications
This study shows that integrating a GA with a CNN substantially enhances the prediction of CHD. Our approach surpassed traditional feature engineering, achieving 100% accuracy and highlighting the effectiveness of genetic algorithms in automating feature selection within complex datasets.
Comparisons with existing methodologies revealed that our GA-based CNN model exhibited superior performance, as illustrated in Table 3 [9,23–26], compared to previously reported methods. In particular, our proposed method achieved an accuracy of 100%. This is in contrast to the previous high accuracies, such as 97.52% reported by Valarmathi and Sheela [9] using random forest and XGBoost with grid and randomized search for hyperparameter optimization, and 99.51% accuracy reported by Najafi et al. [27], which involved particle swarm optimization.
While our focus on IG for feature selection demonstrated that exploring various methods, such as Lasso-CNN with different optimization techniques (e.g., particle swarm optimization), can achieve accuracies exceeding 97% [9], this highlights the impact of methodological choices and dataset characteristics on performance. This aligns with our emphasis on multi-criteria evaluation. Additionally, our work introduces a novel approach to CHD prediction compared to that used by Najafi et al. [27], utilizing a GA-based CNN for feature engineering and representing tabular data in an image-like format, which improves the analysis capabilities of the CNN.
The significance of our GA-based CNN model lies in its substantial potential to enhance public health through improved CHD detection accuracy. This improvement could facilitate better prevention strategies and more personalized medical treatments. By employing genetic algorithms for feature engineering, this approach not only boosts prediction precision but also supports the development of AI tools in healthcare. The success of this model underscores the critical role of interdisciplinary collaboration in incorporating such innovations into clinical practice, ultimately enhancing patient care and healthcare efficiency.
2. Limitations
While promising, the findings of our study must be viewed within the context of certain limitations that could affect their broader applicability. One significant limitation is the sample size and diversity of the dataset. If our research depended on a dataset that was either too small or not representative, it could limit the model’s ability to extrapolate to different demographics, including variations in ethnicity, age groups, or regions characterized by distinct CHD patterns and risk profiles.
The exclusive reliance on historical data presents another limitation, as it may introduce biases that could skew the model’s predictive capacity. To enhance future studies, incorporating a more inclusive dataset that includes prospective data and diverse ethnic backgrounds would help ensure the robustness of the results across various populations. Additionally, the predictive performance may differ between research and clinical settings, as real-world applications necessitate thorough model validation and adjustments to accommodate the complexities of clinical data variability and operational integration.
Finally, despite the strong performance in feature selection and hyperparameter refinement, there is a risk of overfitting when dealing with complex datasets. It is imperative to confirm the model’s ability to generalize to new and unseen data, which is critical for its practical applicability in healthcare.
3. Future Works and Conclusion
Future research should prioritize multi-institutional validation of GA-based CNN to confirm its applicability and effectiveness across diverse patient groups. Prospective clinical trials are crucial for evaluating its integration and performance in real-world settings. Further investigative efforts could also explore synergies with other machine learning methods to enhance and potentially increase the prediction accuracy. Efficient integration of the model within clinical workflows and electronic health record systems is essential for its practical adoption. Concurrently, the ethical use of AI and cost-effectiveness analyses will play a significant role in determining the model’s sustainability and its alignment with healthcare priorities, thereby advancing the role of AI in personalized patient care.
In conclusion, this study underscores the effectiveness of GA-based feature engineering in optimizing CNNs for CHD prediction. Our approach marks a substantial advancement in the use of machine learning techniques within the field of cardiovascular medicine.
Acknowledgments
We thank the Office of Research and Community Service (LPPM) Universitas Dian Nuswantoro (UDINUS) for all the support that made this study possible, under contract number 109/A.38-04/UDN-09/XI/2023.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.