Users share valuable information through online smoking cessation communities (OSCCs), which help people maintain and improve smoking cessation behavior. Although OSCC utilization is common among smokers, limitations exist in identifying the smoking status of OSCC users (“quit” vs. “not quit”). Thus, the current study implicitly analyzed user-generated content (UGC) to identify individual users’ smoking status through advanced computational methods and real data from an OSCC.
Secondary data analysis was conducted using data from 3,833 users of
Introducing novel features boosted smoking status recognition (quit vs. not quit) by 9.3% relative to the use of text-only post features. Furthermore, advanced computational methods outperformed baseline algorithms across all models and increased the smoking status prediction performance by up to 12%.
The results of this study suggest that the current research method provides a valuable platform for researchers involved in online cessation interventions and furnishes a framework for on-going machine learning applications. The results may help practitioners design a sustainable real-time intervention via personalized post recommendations in OSCCs. A major limitation is that only users’ smoking status was detected. Future research might involve programming machine learning classification methods to identify abstinence duration using larger datasets.
Many current and former smokers use online smoking cessation communities (OSCCs) for smoking cessation every year. These users post about their smoking cessation journey, efforts to remain abstinent, and celebrations [
More than 12 million smokers search for online information about quitting smoking every year globally, of whom a majority participate in social networking sites for cessation [
Existing methods of data mining can be categorized as baseline and deep learning (DL) methods. In the former type of method, a classifier is used to assign a sentence to either a positive or negative class. Baseline classifiers such as support vector machines (SVMs) and logistic regression (LR) have successfully been applied in previous research [
In everyday conversation, opinions are expressed implicitly—that is, in a way that depends on domain and context. Identifying context-dependent features can also be useful in applications such as identifying users’ smoking status and semantic searching. Although several challenges exist in monitoring and retrieving UGC, the clickstream data and metadata surrounding a user’s participation in an online community are easy to retrieve. The extent to which UGC usually reflects user experiences makes it possible to gather additional details about the original post from how other users respond to it. In conjunction with the text of user blog posts, new features such as classifier inputs can be used to boost the output instead of relying solely on text. Moreover, methods using implicit and latent features have led to the emergence of DL models, which have demonstrated excellent performance in comparison to existing state-of-the-art methods [
This study collected data from
To evaluate the performance of the machine learning classifiers, labeled data are needed for training and evaluation to learn the variation between instances from different classes. To label unstructured data, we availed ourselves of the services of three experts in the field of machine learning (ML) and data mining who had domain expertise and in-depth familiarity with user conversations in OSCCs. Sample posts along with their corresponding labels are shown in
In addition, we performed several pre-processing steps on the raw data before performing ML-based basic classification and LSTM-based DL classification (
It is a major challenge to identify a learning algorithm for text analytics that takes individual word vectors as an input and transforms them into a feature vector. Several methods exist to generate word vectors, in which sentences are transformed into a matrix using word embedding [
In this study, DL methods were employed to process the sequential data. A recurrent neural network (RNN) for sequence encoding was used to input every word into the model and explain the overall meaning of each post [
The algorithms in Stanford CoreNLP were trained using all blog posts and comments as a learning set. During the experiments, the dataset was split into a testing dataset (30%) and a training dataset (70%). To avoid sample biasing and uneven distribution, we conducted various shuffling steps in the dataset. All ML algorithms, including LSTM models, used the same data split ratio between the training and test sets. Next, this study employed 10-fold cross-validation to reduce the bias associated with random sampling of the training data. The rationale behind this decision was that previous research successfully evaluated the performance of an algorithm using the above two criteria [
The sequential frameworks required for the optimization techniques and regularization parameters are listed in
We classified the users’ posts with using ML algorithms—SVM, LR, adaptive boosting (AdaBoost), gradient boost decision tree (GBDT), and eXtreme gradient boosting (XG-Boost)—and LSTM as a DL algorithm. These algorithms were implemented in Python using the sci-kit learn module. For the ML algorithms, we first created a document-word matrix. Each user post in the corpus was represented in the row matrix, whereas each column denoted a word occurring in the user post. Words other than nouns, verbs, adjectives, or adverbs were screened out. We chose highly unique words using the chi-square statistic [
For the LSTM model, we performed word embedding with the Skip-Gram model [
where
Regarding the input feature sets of each of the seven models (models 1 to 7), feature sets 5, 6, and 7 were closely correlated with smoking status (
The concordance of categorical smoking status with feature sets 1, 2, and 3 did not matched poorly (kappa = 0.04–0.23) (
Of the 3,833 users in the study sample, 3,623 (94.52%) had written at least one post between the date of registration on the community website and the ending date of data collection, from which their smoking status was identified (average number of posts = 2). Posts suggesting that the user still smoked were usually written within a few days of users’ registration (the average time until the first post on “smoking” was 2 days after registration). For 191 users, there were multiple posts indicating that they had quit, followed by a subsequent post indicating that they had resumed smoking.
Furthermore, 3,429 users wrote at least one post in which they reported a “quit” with information their quitting status (average number of posts = 2). On average, participants posted their first quit post 2 weeks after becoming members of the community (median = 14 days). The median interval of the quit posts was 5 days after the inferred date of quitting.
A total of 2,417 blog posts with 28,031 comments published by 1,733 users indicated that the author was not smoking at the time of the post. Thus, 57% of users (2,184/3,833) who authored a blog or blog comment wrote at least one post suggesting that they had quit smoking for at least a certain period.
We conducted several experiments to show the classification performance for smoking status identification for each baseline and LSTM classifier using seven models. Sklearn. FeatureSelection, a well-known feature selection technique, was used for feature selection from the text. In addition, the performance of ML algorithms was calculated through various performance scores (i.e., accuracy, precision, recall, F-measure, and area under the receiver operating characteristic curve (AUC) in
Overall, the study results are divided into two parts. For the ML algorithms, feature sets 3, 4, 5, 6, and 7 showed better execution of the classifier than when only the standard feature sets 1 or 2 were considered (
Furthermore, model 7 with the XGBoost algorithm had the best AUC (0.931), which was 8.5% higher than in model 1 (0.845). The next best algorithm across all models was GBDT, as shown by its accuracy (90.85%), precision (0.904), recall (0.905), F1-score (0.904), and AUC (0.916). The worst algorithm in terms of predictive value for model 7 was the LR algorithm, as shown by its accuracy (76.52%), precision (0.761), recall (0.766), F1-score (0.763), and AUC (0.788).
For the DL algorithm, the LSTM DL classifier outclassed all other baseline classifiers across all seven models (
The results also showed that adding the feature sets improved the prediction performance of the proposed algorithms to identify users’ smoking status. The balance between positive and negative cases was 49.3:50.7 (with positive cases defined as posts for which the author was clearly not smoking). This implies that the proportion of threads containing positive cases was 49.3%. Moreover, the LSTM algorithm achieved the highest AUC of 0.977.
The goal of our study was to predict the smoking status (“quit or not”) of individual users who posted comments on an OSCC. Previous research has already described OSCC users’ behaviors and engagement [
We added novel features along with user posts by considering social influence features, domain-specific aspects, author-based characteristics, thread-based features, and adjacent posts in our models. The addition of novel features to enhance the performance of our algorithms highlights the importance of these features in identifying users’ smoking status in OSCCs. A high concordance between domain-dependent feature sets and smoking status identification supports the validity of those inferences [
This study provides several implications for practitioners. For designers of other online platforms such as Text2Quit, BecomeAnEX,
Users’ language may differ from one online community to another depending on the specific addiction being discussed, meaning that certain domain-specific characteristics can vary and may need to be modified with the aid of frequent group users or by reading UGC. However, in most online communities, social influence features are visible. For instance, users may communicate with other community members by posting comments. It does not matter whether such interactions take place in the form of posts suggested by others, allowing the content of “social influence” posts to be leveraged when mining a focal post. Therefore, practitioners should pay careful attention to social influence on their platforms.
This phenomenon might be associated with information overload, which exhausts users while making quick decisions [
There are a few limitations to our study. First, the UGC dataset contained a large volume of noisy text; thus, future research can perform different experiments and develop ML techniques that resolve the noisy text problem to improve the performance of classifiers. Second, in a few posts, users mentioned their duration of abstinence; however, our current classification algorithm only detected the current smoking status of users (i.e., quit or not). In further studies, researchers can program ML methods to identify the duration of abstinence using larger datasets.
No potential conflict of interest relevant to this article was reported.
The overall analytical framework of the study. NLP: natural language processing, POS: part-of-speech.
Sample posts along with their corresponding labels
Post content | Label | Class label |
---|---|---|
Yay, I finally hit my week mark today. I am not going to lie. This weekend was tough but I made it through without smoking. Hope everyone else is doing well. | Obviously not smoking | Positive |
Well, it’s still only like 5.60 for a pack of reds here in Tulsa, Oklahoma. I have not quit yet but have my date set. | Obviously smoking | Negative |
I work in a hospital with cancer patients and it still has not fazed me yet. It takes strength. You also have to want it. Good Luck. | Unidentified | Negative |
Examples of feature sets alongside community posts
Feature sets | Post content |
---|---|
- | |
We used the BOWs approach by performing data pre-processing. Standard unigram and bigram text features are popular features for text classification tasks, and have been used previously to identify users’ quit status. The second set contains the Doc2Vec feature set, a document embedding method in which each document is represented as a vector matrix. | |
Happy Milestones to Angelina. Love is good! Thank you doctor for your support in a less painless way. Thanks for my family and friends for helping me grow my sobriety. | |
Almost Day 3 for me, and I’ m worried about the weekend coming up, too. Today is my Day 7 still hasn’t smoked or drinks. | |
For each post author, we separated the total duration of being a community member and the number of posts published by each author. Both these features were calculated since a user joined the community until s/he published the post. | |
For each particular post, we mined the length of each post (i.e., number of words), number of posted comments in the thread, number of individual users who posted to the thread, and duration of the thread activity. | |
Hey Joe, thanks for the encouragement. I have been an ex for 7 days. Today has not been too bad, and I keep exciting. |
OSCC: online smoking cessation community, BOW: bag-of-words.
Hyperparameters for machine learning and LSTM algorithms
Algorithm | Hyperparameter | Value |
---|---|---|
AdaBoost | Number of estimators | 250 |
Base estimator | Decision stump | |
Learning rate | 0.1 | |
| ||
GBDT | Number of estimators | 250 |
Learning rate | 0.1 | |
Maximum depth | 5 | |
Minimum samples at leaf node | 2 | |
| ||
XGBoost | Number of estimators | 500 |
Learning rate | 0.001 | |
Maximum depth | 3 | |
Regularization coefficient | 0.0001 | |
Gamma | 0.1 | |
| ||
LSTM | Mini batch size | 256 |
Number of layers | 2 | |
Optimization method | Adam | |
Loss | Binary cross-entropy | |
L2 regularization coefficient | 1e-4 | |
Dropout | 0.25 | |
Epochs | 200 | |
Output activation | Sigmoid | |
Learning rate | 0.001 |
AdaBoost: adaptive boosting, XGBoost: eXtreme gradient boosting, GBDT: gradient boost decision tree, LSTM: long short-term memory.
Correlations among input and output variables
Smoking status | Feature set 1 | Feature set 2 | Feature set 3 | Feature set 4 | Feature set 5 | Feature set 6 | Feature set 7 | |
---|---|---|---|---|---|---|---|---|
Smoking status | 1.00 | −0.23 | 0.31 | 0.33 | −0.15 | - | - | - |
Feature set 1 | - | 1.00 | −0.23 | −0.31 | −0.06 | - | - | - |
Feature set 2 | - | - | 1.00 | 0.34 | 0.17 | - | - | - |
Feature set 3 | - | - | - | 1.00 | 0.30 | - | - | - |
Feature set 4 | - | - | - | - | 1.00 | - | - | - |
Feature set 5 | 0.72 | - | - | - | - | 1.00 | - | - |
Feature set 6 | 0.76 | - | - | - | - | 0.67 | 1.00 | - |
Feature set 7 | 0.82 | - | - | - | - | 0.71 | 0.75 | 1.00 |
Concordance matrix for selected feature sets
Quit | Not quit | Kappa | |||
---|---|---|---|---|---|
Feature set 1 (n = 41) | Smoking status | Quit | 14 | 9 | 0.23 |
Not quit | 10 | 8 | |||
| |||||
Feature set 2 (n = 44) | Smoking status | Quit | 15 | 13 | 0.04 |
Not quit | 9 | 7 | |||
| |||||
Feature set 3 (n = 47) | Smoking status | Quit | 17 | 15 | 0.15 |
Not quit | 8 | 7 | |||
| |||||
Feature set 4 (n = 52) | Smoking status | Quit | 19 | 17 | 0.10 |
Not quit | 9 | 7 |
Description of various measures used to evaluate algorithm performance
Model | Algorithm | Accuracy (%) | Precision | Recall | F1-score | AUC |
---|---|---|---|---|---|---|
Model 1 | SVM | 66.09 | 0.625 | 0.661 | 0.642 | 0.642 |
LR | 64.41 | 0.641 | 0.644 | 0.642 | 0.661 | |
AdaBoost | 72.26 | 0.724 | 0.723 | 0.723 | 0.751 | |
GBDT | 82.12 | 0.823 | 0.825 | 0.824 | 0.827 | |
XGBoost | 83.45 | 0.833 | 0.835 | 0.834 | 0.845 | |
LSTM | 85.51 | 0.859 | 0.855 | 0.857 | 0.823 | |
| ||||||
Model 2 | SVM | 66.84 | 0.652 | 0.668 | 0.660 | 0.653 |
LR | 68.56 | 0.684 | 0.686 | 0.685 | 0.667 | |
AdaBoost | 75.94 | 0.745 | 0.759 | 0.752 | 0.751 | |
GBDT | 84.52 | 0.847 | 0.878 | 0.862 | 0.848 | |
XGBoost | 84.65 | 0.842 | 0.845 | 0.843 | 0.847 | |
LSTM | 87.68 | 0.876 | 0.874 | 0.875 | 0.871 | |
| ||||||
Model 3 | SVM | 70.31 | 0.704 | 0.703 | 0.703 | 0.685 |
LR | 70.33 | 0.703 | 0.703 | 0.703 | 0.751 | |
AdaBoost | 82.52 | 0.829 | 0.825 | 0.827 | 0.805 | |
GBDT | 85.31 | 0.850 | 0.853 | 0.851 | 0.855 | |
XGBoost | 85.78 | 0.856 | 0.858 | 0.857 | 0.857 | |
LSTM | 89.68 | 0.897 | 0.896 | 0.896 | 0.892 | |
| ||||||
Model 4 | SVM | 80.40 | 0.803 | 0.804 | 0.803 | 0.798 |
LR | 80.50 | 0.806 | 0.805 | 0.805 | 0.796 | |
AdaBoost | 84.75 | 0.853 | 0.847 | 0.850 | 0.828 | |
GBDT | 86.14 | 0.863 | 0.865 | 0.864 | 0.867 | |
XGBoost | 86.25 | 0.864 | 0.866 | 0.865 | 0.888 | |
LSTM | 90.42 | 0.901 | 0.903 | 0.902 | 0.907 | |
| ||||||
Model 5 | SVM | 82.41 | 0.824 | 0.825 | 0.824 | 0.819 |
LR | 72.43 | 0.724 | 0.724 | 0.724 | 0.762 | |
AdaBoost | 84.62 | 0.840 | 0.846 | 0.843 | 0.826 | |
GBDT | 87.41 | 0.871 | 0.874 | 0.872 | 0.866 | |
XGBoost | 87.88 | 0.877 | 0.868 | 0.872 | 0.857 | |
LSTM | 92.13 | 0.922 | 0.921 | 0.921 | 0.923 | |
| ||||||
Model 6 | SVM | 84.56 | 0.826 | 0.847 | 0.836 | 0.840 |
LR | 74.57 | 0.746 | 0.746 | 0.746 | 0.780 | |
AdaBoost | 86.77 | 0.862 | 0.868 | 0.865 | 0.848 | |
GBDT | 89.55 | 0.892 | 0.894 | 0.893 | 0.898 | |
XGBoost | 89.95 | 0.898 | 0.899 | 0.898 | 0.875 | |
LSTM | 94.15 | 0.943 | 0.942 | 0.942 | 0.944 | |
| ||||||
Model 7 | SVM | 85.58 | 0.856 | 0.858 | 0.857 | 0.840 |
LR | 76.52 | 0.761 | 0.766 | 0.763 | 0.788 | |
AdaBoost | 87.79 | 0.874 | 0.878 | 0.876 | 0.879 | |
GBDT | 90.85 | 0.904 | 0.905 | 0.904 | 0.916 | |
XGBoost | 92.78 | 0.928 | 0.927 | 0.927 | 0.931 | |
LSTM | 97.56 | 0.974 | 0.971 | 0.972 | 0.977 |
AUC: area under the receiver operating characteristic curve, SVM: support vector machine, LR: logistic regression, AdaBoost: adaptive boosting, XGBoost: eXtreme gradient boosting, GBDT: gradient boost decision tree, LSTM: long short-term memory.
Performance comparison between existing machine learning models and proposed models
Study, year | Accuracy (%) | Precision | Recall | F1-score |
---|---|---|---|---|
Cohn et al. [ |
0.86 | - | - | 0.860 |
Pearson et al. [ |
0.91 | - | - | 0.910 |
Nguyen et al. [ |
75.40 | - | - | - |
Zhang and Yang [ |
- | 0.85 | 0.72 | 0.74 |
Myslin et al. [ |
0.85 | 0.82 | 0.88 | 0.85 |
Rose et al. [ |
73.60 | - | - | - |
Wang et al. [ |
- | - | - | 0.759 |
Proposed method | 92.78 | 0.928 | 0.927 | 0.927 |
Overall comparison between the proposed model and other deep learning methods
Method | Accuracy (%) | Precision | Recall | F1-score |
---|---|---|---|---|
Joint AB-LSTM [ |
- | 74.47 | 64.96 | 69.39 |
Tree-LSTM |
- | 79.30 | 67.20 | 72.70 |
Dep-LSTM |
- | 72.53 | 71.49 | 72.00 |
Proposed method | 97.56 | 0.974 | 0.971 | 0.972 |
AB-LSTM: attention-based bidirectional long short-term memory, Dep-LSTM: dependency-based long short-term memory.
From Lim S, Lee K, Kang J. Drug drug interaction extraction from the literature using a recursive neural network. PLoS One 2018;13(1):e0190926.
From Wang W, Yang X, Yang C, Guo X, Zhang X, Wu C. Dependency-based long short term memory network for drug-drug interaction extraction. BMC Bioinformatics 2017;18(Suppl 16):578.