The importance of the prediction of coronary heart disease (CHD) has been recognized in Korea; however, few studies have been conducted in this area. Therefore, it is necessary to develop a method for the prediction and classification of CHD in Koreans.

A model for CHD prediction must be designed according to rule-based guidelines. In this study, a fuzzy logic and decision tree (classification and regression tree [CART])-driven CHD prediction model was developed for Koreans. Datasets derived from the Korean National Health and Nutrition Examination Survey VI (KNHANES-VI) were utilized to generate the proposed model.

The rules were generated using a decision tree technique, and fuzzy logic was applied to overcome problems associated with uncertainty in CHD prediction.

The accuracy and receiver operating characteristic (ROC) curve values of the propose systems were 69.51% and 0.594, proving that the proposed methods were more efficient than other models.

Coronary heart disease (CHD) has the highest mortality rate of all the non-communicable diseases throughout the world. Therefore, the prediction of CHD is necessary for reducing the management costs of CHD and for promoting health [

Thus far, many previous studies have proposed methods to predict CHD using data mining, artificial intelligence, and machine learning techniques [

Therefore, it is necessary to develop a CHD prediction model for Koreans using data mining. In Korea, few studies have aimed to produce guidelines for CHD prediction thus far. Thus, rules based on guidelines are required, which should be produced using a data mining technique [

In this study, the model was developed data mining-driven CHD prediction model using fuzzy logic and decision-tree. Datasets derived from the Korean National Health and Nutrition Examination Survey VI (KNHANES-VI) were utilized to produce the proposed model [

The FRS, PROCAM, and Adult Treatment Panel III (ATP III) datasets have been used as standard guidelines for predicting CHD and CHD risk factors for the last 10 years. Therefore, the factors stated in these guidelines were used as a reference for data extraction.

Clinical data were acquired from KNHANES-VI, which was a survey study conducted by the Korea Centers for Disease Control and Prevention. KNHANES provides a basis for policy establishment and the evaluation of the comprehensive national health promotion plan. It contains data on the health and nutritional status of Koreans based on national statistics collected by the Korea Centers for Disease Control and Prevention [

The experimental subjects were 8,108 survey subjects from KNHANES-VI. There were 8,108 survey subjects in total, and the exclusions were 7,329 uncertain respondents, 31 people aged less than 20 years. The final dataset comprised 748 subjects.

A classification model and a process for dealing with uncertain data are required to predict CHD. The process of the CHD prediction model is shown in

The prediction model is a fuzzy-logic-based inference method that requires a rule base and fuzzy membership functions. Rule induction was performed using the KNHANES dataset to generate the rules. The rule induction technique generated rules using the decision tree method. After that, the generated rules were transformed for using in the fuzzy inference engine. [

Formal rules were extracted from the continuous dataset of observations by rule induction. In this study, a decision tree technique was used to generate the rules. CART is known to be a useful approach for pruning leaf nodes, which enhances the generalization capability of learned trees when the generated trees have an excessive number of steps and leaf nodes. CART can also perform analyses and interpretations to generate propositional knowledge, which is a set of rules used to generate 'If-Then' rules. Therefore, a CHD prediction model for Koreans was produced by applying the CART rule induction algorithm to KNHANES-VI.

Fuzzy logic is a multi-valued logic that is useful for solving uncertainty problems, and it can address the degree of membership and degrees of truth. CHD-related data contains considerable uncertainty; hence, the data is inferred using fuzzy logic.

The fuzzy inference model determines the CHD risk level by inference using the heart-disease-related input data. The continuous dataset and categorical dataset were used as the input data. The input continuous dataset comprised the age, total cholesterol, LDL cholesterol, HDL cholesterol, systolic blood pressure, and diastolic blood pressure. The uncertainty of the continuous data was inferred by fuzzifying using the fuzzifier. The fuzzifier acquired the data via the fuzzy membership function.

The categorical dataset contained Boolean logic data types, such as sex, smoking, and diabetes; hence, the fuzzy membership function was not required. After the fuzzified functions and categorical data had been input, the fuzzy inference engine performed inference using the rules. The Mamdani max-min approach was used as the inference mechanism, while defuzzification used the center of gravity (COG) method to display the final output.

The proposed CHD risk prediction model was implemented and evaluated.

IBM SPSS modeler 14.2 was used for rule induction. CART was also used for rule induction where the pruning severity was 75%, the minimum records per child branch was two, the boosting number was restricted to 10 for individual options, and the highest probability rule model was used. MATLAB R2009b with a fuzzy tool box was used to produce the fuzzy inference model. A confusion matrix was used to evaluate the predictive model [

The true positive (TP) value was the number of cases that correctly predicted CHD patients and the true negative (TN) value was the number of cases that correctly predicted healthy subjects as non-heart-disease patients. The false positive (FP) value was the number of cases that identified a patient as healthy who had CHD, and the false negative (FN) value was the number of cases that predicted that a patient had CHD who was healthy.

Our model was compared with previous results using an artificial neural network (ANN) [

The experimental results showed that the ANN, LR, and SVM had relatively high accuracy rates of 62.78%, 63.23%, and 67.71%, respectively, although they were lower than that of the proposed model because ANN and SVM only made observations at the learning level. C5.0, which are decision tree-based methods, yielded accuracy scores of 53.36%. The proposed model had accuracy and sensitive scores of 69.51% and 93.10%, respectively, which were higher than those of the other models. The higher accuracy and sensitivity of the proposed model can be attributed to the reduction of uncertainty achieve by using fuzzy logic. CART, which was used for rule induction, cannot process uncertainty adequately. The propose model performs better than ANN and SVM in terms of accuracy and sensitive is the highest reason, ANN and SVM learning and resoning about the complex relationship between the each training data; however, ANN and SVM do not resolve the problem of uncertainy. However, the propose model overcomes the problem of the uncertainty of the data by using fuzzy logic. However, specificity of proposed model is lower than that of the other models. Thus, future studies are required to develop a prediction model with higher specificity. The ROC curve result of the proposed model (0.594) was higher than that of the other models, and this can help in the decision support of the prediction of CHD.

This paper proposed a novel predictive model for CHD based on data derived from KNHANES-VI, which were collected by the Korea Centers for Disease Control and Prevention. The propose model decision supports the prediction of CHD by utilizing fuzzy logic and CART-based rule induction. Rule induction was performed using the KNHANES-VI datasets to generate the rules using the CART method. The prediction model used an inference model based on fuzzy logic. The rules were generated using a CART decision tree method, and fuzzy membership functions were created based on those used in previous case studies and FRS. A final dataset containing 748 subjects was selected from KNHANES-VI and used for the performance evaluation. The experimental results showed that the proposed model improved the prediction accuracy and sensitivity. Using the propose model is expected to offer decision support for CHD prediction.

Future research should focus on developing data mining based prediction methods that may also increase the accuracy and specificity of CHD prediction.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2013 M3C8A2A02078403).

Values are presented as mean (standard deviation) or number.

HDL: high-density lipoprotein, LDL: low-density lipoprotein, BP: blood pressure.

ANN: artificial neural network, SVM: support vector machine, LR: logistics regression, PPV: postivie prediction value, NPV: negative prediction value, TP: true positivie, TN: true negative, FP: false positive, FN: false negative.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Sensitivity = TP / (TP + FN)

Specificity = TN / (FP + TN)

PPV = TP / (TP + FP)

NPV = TN / (TN + FN)

ROC: receiver operating characteristic, AUC: area under ROC curve, CI: confidence interval, ANN: artificial neural network, SVM: support vector machine, LR: logistics regression.