I. Introduction
Atrial fibrillation (AF), the most common type of arrhythmia, decreases cardiac function by halting cardiac activity and causing irregular ventricular activity, and it is the main cause of 85% of systemic thromboembolic events and 33% of strokes resulting from blood clots in the atria [
1]. Among the AF-induced cardiac diseases, cerebral infarction results in a large number of lesions with a high recurrence rate, and it is therefore associated with high mortality and morbidity rates. Therefore, it is of utmost importance to prevent strokes, for which it is essential to precisely identify the risk factors involved [
2]. For AF, the annual incidence rate of thromboembolism is approximately 4%-6%, and its risk factors include congestive heart failure, coronary artery disease, hypertension, age (>65 years), diabetes mellitus, mitral valve disease, and history of thromboembolism. Vigorous research is being conducted to identify the risk factors of strokes that are complications resulting from AF [
3].
With the recent introduction of the Electronic Medical Record (EMR) and Electronic Health Record (EHR), collecting various types of clinical data has become easier. However, the clinical data within a hospital contain various types of unclear and incomplete data that are difficult to comprehend. As such, it is difficult to identify important factors related to the diagnosis or prognosis of certain diseases or to obtain meaningful knowledge from these types of clinical data [
4,
5]. In clinical medicine, the independent risk factors for the corresponding disease are generally extracted using multivariate statistical analysis utilizing logistic regression analysis to identify the risk factors associated with the diagnosis or the prognosis of certain diseases and to design a diagnosis and prediction model based on this. This method may be most effective for identifying independent risk factors of certain diseases; however, this method has a limitation, that is, the extraction of the relationships among the risk factors may be difficult. To compensate for this limitation, many previous studies have proposed methods using various data mining techniques [
5-
10]. In particular, some studies [
5,
10] have proposed a hybrid decision model (e.g., "multivariate statistical analysis and decision tree" and "rough set and decision tree"), which combines the advantages of the statistical analysis technique and machine learning techniques, and to verify its effectiveness, this model was applied to uncover useful information for acute appendicitis and heart failure. As with the method described in the these studies [
5,
10], we combined multivariate statistical analysis with the association rule mining technique to analyze the risk factors for cerebral infarction, a complication of AF, and we discuss the methods of obtaining useful information for decision making.
II. Methods
1. Subjects
The study included 1,134 patients with AF who were among the patients who visited the outpatient clinic of the Dongsan Medical Center in Daegu, Korea, between September 1983 and September 2010. Medical records were collected on demographic characteristics, medical history, initial electrocardiographic (ECG) findings, and initial echocardiographic (Echo) findings; samples with missing values were excluded. To obtain information related to the risk factors associated with cerebral infarction in patients with AF, 227 patients with cerebral infarction complications were selected from among those with AF for the study group. 907 patients without cerebral infarction complications were recruited for the control group.
2. Statistical Analysis
To compare the differences between the study group and the control group, the chi-square test or Fisher's exact test was performed for category variables. For continuous variables, the student's t-test was carried out if they satisfied the normality after the Kolmogorov-Smirnov test; otherwise, the Mann-Whitney test was used if they did not. Here, the distribution of the category variables were given as percentages (%), and the continuous variables were given as mean ± standard deviation. In addition, the level of statistical significance was defined as p < 0.05. To identify the independent risk factors for cerebral infarction, multivariate statistical analysis was conducted on the significant factors with the p-value of 0.05 and 0.10 for entry and removal, respectively. All statistical analysis were carried out using the SPSS ver. 12.0 (SPSS Inc., Chicago, IL, USA).
3. Association Rule
The present study used an a priori algorithm [
11] to obtain the information related to cerebral infarction. The a priori algorithm is one of the typical association rule mining methods with the generated rule expressed as R: IF A THEN B (or A => B) to extract association rules based on 2 variable parameters, i.e., support and confidence. In addition, the generated rules are judged based on the lift or improvement scale [
12].
Here, |.| represents the cardinality of the set, i.e., the number of the element, and N represents the number of samples in the entire data.
In general, support represents the ratio of samples that simultaneously satisfy both condition A in the antecedent part and condition B in the consequent part in the entire set of samples, while confidence represents the ratio of samples that satisfy both conditions A and B, among those that satisfy condition A in the antecedent part. A confidence value of 1 for a certain rule means that the possibility of obtaining outcome B when A is a given condition (A → B) is 100% (i.e., certain rule); if not, the possibility of A → B is defined as a value (possible rule) between 0 and 1. In addition, the lift or improvement value of ≥1 represents a positive correlation, while a value of <1 represents a negative correlation. Therefore, more general or useful information would have a higher confidence level, with a lift or improvement value of ≥1. However, as discussed above, it is difficult to determine appropriate values for the two free parameters in association rule mining, because information must be obtained based on the minimum threshold for support and for confidence. As such, in this study, the minimum confidence level was variably adjusted to 10%-50%, when the minimum support was defined as 10%; cardiology specialists were consulted for the association rules generated here, while the final confidence level was determined by a clinical specialist's opinion. Furthermore, the a priori component of a commercial data mining program, Clementine ver. 12.0 (SPSS Inc., Chicago, IL, USA), was used for these experiments, and default values were used for the experimental parameters.
IV. Discussion
Atrial fibrillation, the most common supraventricular arrhythmia, in which irregular atrial muscle contractions produce an irregular pulse, is known as a major risk factor of thromboembolism. Furthermore, it leads to strokes and causes hemodynamic instability, deterioration of renal function, and systemic embolic events [
13,
14].
In the present study, multivariate statistical analysis using logistic regression analysis was conducted to extract the risk factors for cerebral infarction complications in patients with AF, and the relationship between these factors was analyzed by applying the association rule mining technique. As a result, the independent risk factors associated with cerebral infarction complications were found to include age, hypertension, initial ECG rhythm, and initial Echo LAD, and the following information associated with cerebral infarction could be obtained: 1) age >63 years, hypertension is present, initial ECG rhythm is AF, initial Echo LAD >4.06 cm => cerebral infarction (support, 6.88%; confidence, 33.48%); 2) age >63 years, hypertension is present, initial Echo LAD >4.06 cm => cerebral infarction (support, 7.50%; confidence, 31.84%); 3) hypertension is present, initial ECG rhythm is AF, initial Echo LAD >4.06 cm => cerebral infarction (support, 8.29%; confidence, 30.52%); and 4) age >63 years, hypertension is present, initial ECG rhythm is AF => cerebral infarction (support, 7.50%; confidence, 30.47%). In addition, the analysis results using web node revealed AF as the initial ECG rhythm to be the factor most closely associated with cerebral infarction in patients with AF, followed by initial Echo LAD >4.06 cm, age >63 years, and hypertension in decreasing order of association.
An existing numeric tool that can be used to estimate the risk of stroke in patients with atrial fibrillation is CHADS
2 (congestive heart failure, hypertension, age, diabetes mellitus, prior stroke or TIA or thromboembolism [double]) score [
15]. In this tool, 1 point is given for congestive heart failure, hypertension, age 75 years or older, and medical history of diabetes mellitus, and 2 points are given for history of stroke or transient cerebral ischemic attack, to categorize a low-risk group (0 point), a moderate-risk group (1 point), and a high-risk group (more than 2 points). In addition, the CHA
2DS
2-VAS
c (congestive heart failure/left ventricular dysfunction, hypertension, age ≥75 [doubled], diabetes, stroke [doubled], vascular disease, age 65-74, and sex category [female]) score (Birmingham 2009 scheme), based on the new guidelines reported at the European Society of Cardiology in 2010, is a more detailed stroke risk assessment tool than the previous CHADS
2 score, because it includes the risk factors (female, 65-75 years of age, left ventricular dysfunction, vascular diseases) that affect thromboembolism in patients whose CHADS
2 score is between 0 and 1 [
16]. In the two stroke assessment tools mentioned above, the factors used to assess the risk of stroke in patients with atrial fibrillation include age, hypertension, diabetes mellitus, stroke or history of transient ischemic attack, sex, left ventricular dysfunction, and vascular diseases. These factors show a consistent trend with the 4 factors that are suggested in this study as risk factors (age, hypertension, AF as the initial ECG rhythm, and initial Echo LAD).
The results of the present study suggest the risk factors for complications of cerebral infarction in atrial fibrillation patients and the association rules between these factors, based on medical record data collected retrospectively. However, the effectiveness and reliability of these risk factors and the association rules suggested in this study have yet to be verified for clinical application; further research is required for such verification in addition to comparative studies with existing stroke assessment tools.