I. Introduction
In 1817, in his document “An Essay on the Shaking Palsy”, James Parkinson was the first to describe Parkinson’s disease (PD) as a neurological syndrome, characterized by a shaking palsy [
1,
2]. Later in 1872, the French neurologist Jean-Martin Charcot, described this disease more precisely and distinguished bradykinesia from other tremendous disorders by examining a large number of patients and suggested the use of the term “Parkinson’s disease” for the first time [
3,
4]. Further studies were made until Brissaud and Meige [
5] identified damage of the substantia nigra as the main cause of PD. This damage leads to a range of symptoms including rigidity, balance impairment, rest tremor, and slowness of movement [
6].
In addition to these motor symptoms, the voice is also affected. Voice and speech impairment is a typical symptom of PD that occurs in most patients [
7]. Gradual deterioration of communication skills in patients with PD is considered to be a significant cause of disability [
7]. Ho et al. [
8] found that 147 PD patients out of 200 had speech impairment, and participants showed a gradual deterioration of speech characteristics. Traditional diagnosis of PD is costly, and it can take from hours to a few days to be performed. Consequently, evaluating the consistency of the voice and recognizing the triggers of its deterioration in the sense of PD based on phonological and acoustic signals is essential to improving PD diagnosis. Furthermore, developing a smart system based on machine-learning (ML) techniques able to detect this disease in an early stage will reduce the number of clinical visits for examinations and the workload of clinicians [
9].
For example, Little et al. [
10] presented a system that detects dysphonia by discriminating between healthy controls (HC) and PD participants using a dataset of 195 records collected from 31 patients, of which 23 were diagnosed with PD. They extracted both time domain and frequency domain features from the records and achieved a classification accuracy of 91.4% using 10 highly uncorrelated measures and the support vector machine (SVM) technique. Benba et al. [
11] used a dataset consisting of voice samples of 17 PD patients and 17 HCs recorded using a computer’s microphone. They extracted 20 Mel-frequency cepstral coefficients (MFCC) and achieved a classification accuracy of 91.17% using linear SVM with 12 coefficients. Hemmerling et al. [
12] used an original dataset consisting of 198 records of 33 PD patients and 33 HCs. They extracted several acoustic features, and applied principal component analysis (PCA) for feature selection and used a linear SVM classifier, which achieved an accuracy of 93.43%.
In this paper, we describe our methodology that analyzes raw audio recordings collected using smartphones to create accurate predictive models. As previously mentions, several studies have been conducted on this subject, but our methodology differs in many ways. First, we used a large dataset, which consisted of 18,210 recordings, where 9,105 were obtained from 453 patients with PD and 9,105 were obtained from 1,037 HCs. To the best of our knowledge, this is the largest cohort data used in a clinical application. Second, instead of extracting only time, frequency, or cepstral domain features from the recordings, we used a combination of the three domains to create highly accurate predictive models. Our final dataset consisted of 80,594 instances and 138 features as well as a class variable. We applied two feature selection methods, analysis of variance (ANOVA) and least absolute shrinkage and selection operator (LASSO), to select the best subset of features. Then we compared our method with various state-of-the-art and newer ML techniques, namely, linear SVM, K-nearest neighbor (KNN), random forest (RF), and extreme gradient boosting (XGBoost). A maximum accuracy of 95.78% was achieved using XGBoost on unseen data.
The remainder of the paper is organized as follows. We describe our method in detail in Section II, Section III presents the results, and Section IV discusses the findings.
II. Methods
1. Data Acquisition
1) The mPower study
The raw audio recordings used in this study were collected from the mPower Public Researcher Portal [
13], the data repository of the mPower mobile Parkinson disease study [
14] in Synapse, an open-source data analysis platform, led by Sage Bionetworks. The mPower project is a clinical study of PD done only through an iPhone application interface (ResearchKit), an open-source software framework developed by Apple that facilitates the creation of medical applications for research. Participation was open to individuals from the United States diagnosed with PD as well as HCs with knowledge of the disease and interested in the study. The mPower study has seven principal tasks, three survey questionnaires that must be filled out by the participants— Demographic Survey, Parkinson’s Disease-Questionnaire-8 [PDQ8], Unified-Parkinson’s Disease Rating Scale (UPDRS), and four tasks (memory task, tapping task, voice task, and walking task). In this paper, we are only interested in the demographic survey and the voice task.
2) Cohort selection
The Demographic Survey is an important questionnaire, by which we distinguished PD patients from HCs. Of the 6,805 participants who answered this questionnaire, 1,087 identified themselves as having a professional diagnosis of PD, while 5,581 did not (137 chose not to answer the question). Each participant had his or her own ID (healthCode) that was used in this phase; more details are given in [
13,
14].
Of the whole group of subjects, 5,826 participated in the voice task, resulting in a total of 65,022 recordings. Each person was asked to record his or her voice using the smartphone’s microphone saying “aaaah” for 10 seconds at a steady pace three times a day. In case of PD patients, they were instructed to record their voices immediately before taking PD medication, just after taking PD medication (feeling at their best), and another time of the day. In the case of the HCs, they could record their voices at any time of the day.
To avoid optimism in predicting PD, we conducted a serious cohort filtering process, which is illustrated in
Figure 1 (steps 1 and 2).
Step 1: Using the demographic survey
- PD group selection: If the participant is professionally diagnosed by a doctor AND he or she has a valid date of diagnosis, AND is actually a Parkinsonian, not a caretaker, AND has never had surgery to treat PD nor deep brain stimulation, AND his or her age is valid.
- HC group selection: If the participant is not professionally diagnosed by a doctor, AND he or she has no valid date of diagnosis, AND has no movement symptoms, AND his or her age is valid.
- Unknown group selection: A participant is said to be unknown if their professional diagnosis is unknown.
Step 2: Using the medical time point of the recordings
The recordings were downloaded from [
13] using the synapse Python client and SQL query commands, with a size of 80 GB. In this step, two important variables were used to filter the participants (healthCode from step 1 and medtime-point from this step); see
Table 1 for details.
- PD group selection: Selected PD participants from step 1 AND (recordings of participants immediately before taking PD medication OR recordings of participants who didn’t take PD medication).
- HC group selection: The same participants from step 2.
- Unknown group selection: Unknown group from step 1 OR (records with undefined medication time point OR recordings of participants after taking PD medication OR recordings of participants at another time of the day).
Note AND represents the logical conjunction, OR represents the logical disjunction.
The final cohort dataset statistics are shown in
Table 2.
2. Audio Signal Feature Extraction
Feature extraction is a primary step in ML and pattern recognition systems, particularly at the audio analysis stage. Audio signals are constantly changing, i.e., non-stationary, which is why, in most applications, audio signals are divided into short-term frames [
15], and the analysis is done on a frame basis. Using the pyAudioAnalysis [
16] library, we extracted important audio features that represented the properties of the selected recordings using two common techniques: the short-term and mid-term processing techniques.
Short-term processing was done by following a windowing procedure. The window is generally between 20 ms and 40 ms [
16]. Each recording was sampled at 44.1 kHz and divided into short windows of 30 ms with a step of 15 ms. The audio signal was multiplied with a shifted version of this window. This phase resulted in a sequence of feature vectors that led to 34 extracted features from the time, frequency, and cepstral domains [
15,
16] (
Supplementary Table S1). The cepstral domain or the cepstrum is defined as the inverse discreet Fourier transform (DFT) of the log magnitude of the DFT of a signal.
The mid-term processing was done by dividing each recording into mid-term windows of 5 seconds (generally between 1 and 10 seconds [
16]), with a step of 2.5 seconds. Then for each window, short-term processing was applied to calculate the feature statistics (feature_mean, delta_feature_mean, feature_std, delta_feature_std) for each of the 34 features. This resulted in 136 extracted audio features, plus the age and gender of each participant as well as the class variable (1 for PD participants, 0 for HC participants) (see workflow 1 in
Figure 2). The final dataset had 80,594 instances, 138 features, and a class variable in total (
Supplementary Table S2).
3. Baseline Models
Different techniques are better suited for different problems, and for different types of data (in our case predicting a category, with labeled data, under 100k samples). For this, we used the scikit-learn algorithm cheat-sheet [
17] to select multiple classifiers that suited our problem, namely, linear-SVM, KNN, RF, and XGBoost.
The first step was to create a baseline ML model for each technique, using the default hyperparameters, and to compare their performance. To avoid overfitting, we used stratified 5-fold cross-validation because of our large dataset (+80k samples) although it was computationally intensive. Four measures were used to assess the performance of the classifiers, namely, accuracy, sensitivity, specificity, and the F1-score.
Before training our classifiers, we applied and compared various data preprocessing techniques. Data preprocessing is an important step in every ML process, including cleaning and standardization.
1) Dataset cleaning: handling missing values
It is quite common to have missing values in a dataset (NaNs). This was true in our case which resulted from the audio analysis feature extraction phase of some recordings. Handling missing values can improve an ML model’s accuracy. For this, we tested the following methods:
- Removing instances with missing values;
- Replacing missing values with zero;
- Imputing missing values with the mean, median, and most frequent value in each column.
2) Dataset normalization
Our dataset included features with different ranges, for example age between 18 and 85, gender either 0 or 1, zero crossing rate feature (zcr_mean) between 0 and 0.7, energy feature (energy_mean) between 6.205814e-09 and 5.019101e-01, and so on (
Supplementary Table S1) for features description.
For this, dataset normalization was required to change the column values into a common range; hence, we implemented the following techniques:
Then we compared the performance of those data processing techniques to choose the best combination to finally create the baseline models. We divided our dataset into a training set (80%) and held-out test set (20%). The test set was used to assess the performance of the final models on unseen data as seen in
Figure 2 (workflow 2).
4. Feature Selection
Feature selection is one of the main concepts in ML. Having a large dataset increases the complexity of the models and may decrease their performance because it is computationally intensive. Various feature selection methods are widely used in the literature [
18]. In this work, we adopted a filter method, and an embedded method. Wrapper methods were excluded due to their exhaustive search to find the optimal set of features that is computationally intensive, while using large datasets.
1) Filter method: ANOVA
ANOVA provides a statistical test to determine whether the means of several groups are equal. It computes the ANOVA F-value between each feature and the class variable. This F-value is used to select the subset of K features that have the strongest relationship with the class.
2) Embedded method: LASSO
LASSO is a regression analysis that performs L1 regularization which also performs an indirect feature selection. It has a parameter C that controls the sparsity; the smaller C, the fewer features are selected.
We decided to choose the maximum number of features in the 30th range to reduce the complexity of our models. In our case, adding less-important features (more than 30) made the classifiers more complex and did not add any significant or noticeable improvement in terms of performance. Thus, we tested various values of K(10, 20, 30) and C(0.01, 0.02, 0.03) to assess the performance of our classifiers, and decided on the best one (
Table 3,
Supplementary Table S3).
III. Results
1. Baseline Models Results
After preprocessing our dataset, using a combination of techniques that handle missing values and normalize the columns into a common range, deleting rows with missing values and rescaling the dataset between 0 and 1 gave the best performing baseline results.
Table 4 presents the classification results obtained using all of the features. XGBoost was the most accurate, sensitive, and specific technique with 90.97%, 90.80%, and 91.14%, respectively, with an F1-score of 90.92%. Linear SVM was the least accurate, sensitive, and specific technique with 76.47%, 78.60%, and 74.36%, respectively, with an F1-score of 76.88%.
2. Feature Selection Results
Table 3 presents the classification results obtained after feature selection. We selected various subsets of ranked features using ANOVA’s K best parameter (K = 10, 20, and 30) and LASSO’s C parameter (C = 0.01, 0.02, and 0.03) and tested their performance for each ML technique.
RF was the most accurate, sensitive, and specific technique using ANOVA’s best 10, 20, and 30 features and using LASSO’s C = 0.01 and C = 0.02. KNN was the most accurate, sensitive, and specific technique using Lasso’s C = 0.03. Linear SVM was the least accurate, sensitive, and specific technique in all cases.
Supplementary Table S3 presents the subset of features for each feature selection method using the various parameters K and C.
3. Hyperparameter Tuning Results
From
Table 3, we concluded that the combination of features using LASSO outperformed ANOVA with almost the same number of features (K = 10 vs. C = 0.01, K = 20 vs. C = 0.02, K = 30 vs. C = 0.03). Thus, to perform hyperparameter tuning, we used the best subset of features that maximized the performance of each ML technique, knowing that the results shown in
Table 4 were measured using the default hyperparameters with 138 features.
Linear SVM, KNN, and XGBoost were mostly accurate using LASSO and C = 0.03 with 76.02%, 92.69%, and 90.83%, respectively, using only 33 features. However, RF was mostly accurate using LASSO and C = 0.01 with 92.30% using only 11 features.
Table 5 presents the hyperparameter tuning results obtained using random search. XGBoost was the most accurate, sensitive, and specific technique with 95.31%, 95.19%, and 95.43%, respectively, with an F1-score of 95.28%, while predicting new cases on unseen data with an accuracy, sensitivity, and specificity of 95.78%, 95.32%, and 96.23%, respectively (
Table 6) and with an F1-score of 95.74%. KNN, RF, and linear SVM were the least accurate, sensitive, and specific techniques.
IV. Discussion
This paper presented our method which we used to classify PD patients and distinguish them from HCs using 18,210 smartphone recordings by creating a dataset of 80,594 samples with 138 features and a class variable. The LASSO feature selection method outperformed ANOVA with almost the same number of features. From
Supplementary Table S3, we conclude that age and gender features are highly ranked using both methods; it is known that PD is seen in people aged over 50, and it affects males more than females. Furthermore, energy entropy, spectral spread, and MFCC coefficients are highly ranked using both methods, which indicates the importance of extracting time, frequency, and cepstral domain features, in addition to the age and the gender of participants in classifying this disease.
From the same table, we can notice that there is a difference in the ranking of features between ANOVA and LASSO because each technique implements a different approach. ANOVA analyses the relationship between each feature and the class variable separately and assigns a test score to each feature. Then all the test scores are compared, and the features with top scores are selected (K = 10, 20, 30). On the other hand, LASSO regularization adds a penalty to the different parameters of the model to avoid overfitting. This penalty is applied over the coefficients that multiply each of the features. Thus, the L1 technique analyses all the features at once. In addition, LASSO has an important property of shrinking down to zero unimportant features, which depends on the chosen C parameter. For this reason, in
Supplementary Table S3, there are different features for each C value in LASSO and the same ranked features using ANOVA with regards to the chosen K value. With this combination of features, XGBoost outperformed the remaining classifiers with an accuracy of 95.31% using 80% of the data (
Table 5) while predicting new cases on unseen data with an accuracy of 95.78% (
Table 6).
Several studies [
10–
12] have reported high classification accuracy using SVM in the range of 91% to 93%, as seen in
Table 7. However, in
Tables 3–
6, we note that SVM was the least accurate, with a maximum accuracy of 76.47% using 138 features. This is attributed to the fact that these studies used a small number of recordings, and small datasets. Furthermore, we found that a regular SVM takes time to fit the data (approximately 40 minutes). For this reason, we used linear SVM, which is optimized for large datasets, which limited our chances to test various other kernels (RBF, poly, and sigmoid).
Singh and Xu [
19] used the same dataset that we used and achieved an accuracy of 99% using MFCC coefficients, L1-based feature selection, and an SVM classifier with an RBF kernel using 1,000 samples. The problem is that those 1,000 recordings were chosen randomly from an unbalanced database of 65,022 recordings, where 14% of participants were diagnosed with PD and 86% were healthy controls (claimed in their paper). Randomly choosing 1,000 recordings from an unbalanced set of recordings may have introduced an unbalanced set of 1,000 recordings. Moreover, those recordings were chosen without taking into account the medication time point; therefore, their dataset may have included some recordings of patients after they had taken PD medication. We avoided this in our cohort selection phase, as seen
Figure 1, compared to our dataset where we made a 50/50% split of the recordings, where 9,105 were obtained from PD patients, and 9,105 were obtained from HCs. Furthermore, relying only on accuracy as a metric to assess the performance of the classifiers is not sufficient in medical diagnostic. Adding other metrics, such as sensitivity (which measures the proportion of PD patients that were correctly classified as having PD), and specificity (which measures the proportion of HCs that were correctly classified as not having PD) will give a better estimate of the performance of the classifiers. In our case, our highest classifier achieved an accuracy of 95.78%, a sensitivity of 95.32%, and a specificity of 96.23% with an F1-score of 95.74%. In the case of an unbalanced dataset (which is not clear in their case), relying only on accuracy may results in a high accuracy if one class is outnumbered; thus, introducing other metrics is important. Hence, we believe that our approach is more accurate and precise for classifying PD even if Singh et al. [
19] achieved a higher accuracy which could be affected by their methodology choices.
In conclusion, we proposed a method to classify PD using a large sample of smartphone recordings as a sustained phonation of /a/ for 10 seconds. These recordings were then processed to extract multiple domain features in addition to demographic parameters to create an original dataset that was subjected to various ML techniques after data cleaning, normalization, and feature selection. We have demonstrated the importance of using these features to precisely classify PD with an accuracy of 95.78% using XGBoost. The main objective of this work was to build a smart framework based on ML techniques capable of distinguishing between PD patients and HCs using voice as a disease biomarker. As a future work, we aim to develop an mHealth system capable of implementing these ML techniques to speed up diagnosis time and to integrate it with conventional clinical methods.