Symptom and Sentiment Analysis of Older People with Cancer and Caregivers: A Text Mining Approach Using Korean Social Media Data
Article information
Abstract
Objectives
This study examined the symptoms and emotions expressed by older adults with cancer and their caregivers in South Korean online cancer communities. It aimed to identify narrative patterns and provide insights to inform personalized care strategies.
Methods
We analyzed 6,908 user-generated posts collected from major online cancer communities in South Korea. Keyword frequency analysis, term frequency-inverse document frequency, 2-gram analysis, and latent Dirichlet allocation-based topic modeling were applied to explore language patterns. Sentiment analysis identified 12 emotional categories, and Pearson correlation coefficients were calculated to examine associations between symptoms and emotional expressions. All data were cleaned and standardized prior to analysis.
Results
Many users expressed anxiety (20.63%) and depression (19.59%), frequently associated with chemotherapy and sleep disturbances. Among reported symptoms, sleep problems carried the highest negative sentiment (79.81%), underscoring their profound impact on well-being. Topic modeling consistently revealed seven recurring themes, including treatment decision-making, symptom management, and concerns about family, demonstrating the layered and personalized experiences of older cancer patients and their caregivers.
Conclusions
This study explored treatment-related and symptom-related difficulties faced by older adults with cancer. Many reported significant emotional strain, especially anxiety, depression, and sleep disturbances. These findings highlight the necessity for supportive strategies addressing both psychological and physical aspects of care. Future research could investigate the utility of large language models in analyzing these narratives, provided the data is ethically managed and appropriate for such use.
I. Introduction
Age is a well-established risk factor for cancer, and cancer prevalence is increasing with population aging [1,2]. In 2020, the cancer incidence rate among South Koreans aged 65 and older was 1,552 per 100,000 people, significantly higher compared to younger age groups [3]. This trend is expected to impose considerable social and economic burdens on healthcare systems.
Advances in medical technology have significantly expanded therapeutic options and improved life expectancy for older adults with cancer. However, many still experience substantial distress throughout their cancer journey, contributing to a high prevalence of frailty in this population [4]. Additionally, older adults often report negative emotions following a cancer diagnosis [5], and these emotional responses, together with frailty, are strongly associated with diminished quality of life [6].
Caregivers support older adults with cancer throughout diagnosis, treatment, and survivorship stages, frequently experiencing fear of death and significant caregiving burdens [7,8]. Thus, it is essential to consider both patient and caregiver well-being when planning supportive care. Nevertheless, large-scale studies focusing specifically on this population, especially those employing unstructured, real-world data such as online narratives, remain limited. Existing Korean research has primarily concentrated on digital literacy, treatment decision-making, and end-of-life planning [9,10], with relatively little attention to analyzing physical and psychological symptoms using natural language processing (NLP) techniques.
As unstructured digital text data become increasingly available, NLP and text mining techniques have emerged as valuable tools for health research [11,12]. By employing sentiment analysis and topic modeling, these approaches can uncover symptoms, emotional patterns, and thematic concerns. Advances in NLP could enhance the efficiency and scope of oncology research, potentially transforming clinical practice [13].
Therefore, this study aimed to explore the symptoms and emotions expressed by older adults with cancer and their caregivers by applying NLP and text mining techniques to posts from online cancer communities in South Korea. The insights gained are intended to guide the development of person-centered nursing interventions for this population.
II. Methods
1. Study Design
The current study employed an NLP and text mining approach to analyze symptom expressions and emotional states in Korean-language online posts authored by older cancer patients and their caregivers. The data were collected from major social media platforms in South Korea.
2. Data Collection
Data were collected from online cancer communities on prominent South Korean platforms such as Naver and Daum, covering the period from January 2010 to October 2024. These communities included forums dedicated to specific cancer types (e.g., breast, lung, colorectal) and general platforms where patients and caregivers discussed their treatment experiences and feelings.
To develop a corpus suitable for NLP, we implemented a text mining strategy focusing on posts by or concerning older adults with cancer, utilizing search terms reflective of this group’s experiences. Examples included “seniors diagnosed with colorectal cancer,” “cancer treatment in the elderly,” and “experiences following cancer treatment.” Python libraries Selenium and BeautifulSoup4 were used for data scraping, adhering to ethical guidelines and minimizing server load. Personal identifiers, such as usernames, were removed from the dataset.
From an initial set of 8,789 posts, we curated a refined corpus of 6,908 posts by removing duplicates, promotional content, and overly brief entries. Posts were categorized into 11 topic-based Excel sheets (e.g., “elderly cancer patients,” “treatment experiences”), informed by previous research highlighting functional and comorbid considerations among older cancer patients [14].
To prevent overrepresentation of particular cancer types, we assessed the distribution of cancer discussions across forums. Although breast, lung, and prostate cancers were frequently discussed, stratification reviews confirmed balanced representation. NLP techniques, including topic modeling, were applied to the entire dataset to extract prevalent symptoms and emotional patterns.
Data processing was independently reviewed by a nursing professor and a data scientist, with discrepancies resolved through consensus. Demographic and clinical details varied due to the public nature of the posts. Although some authors explicitly identified themselves as patients or caregivers, many did not, precluding systematic classification. This limitation, along with potential engagement bias, was recognized in the analysis.
3. Data Analysis
The data analysis comprised several systematic steps designed to comprehensively examine the text data.
1) Text preprocessing
For processing Korean text, we employed the Okt tokenizer from the KoNLPy library [15], optimized for Korean language analysis. Stopwords—such as particles, verb endings, and general terms—were removed to improve the analytical accuracy. Additionally, spacing inconsistencies in synonymous terms (e.g., variations of “chemotherapy”) were standardized using regular expressions.
2) Extraction and categorization of treatment and symptoms
We developed a custom symptom-treatment dictionary encompassing 22 categories, including “surgery,” “chemotherapy,” “radiotherapy,” and “sleep disorders.” To extract relevant mentions, substring matching was conducted using the str. contains() function from the pandas library (version 2.2.3) [15]. All applicable categories were recorded when multiple symptoms or treatments appeared in a single post, enabling comprehensive classification and frequency analysis.
Additionally, keyword frequency and bigram (2-gram) analyses identified commonly used terms and co-occurring word pairs within the corpus. These analyses provided insights extending beyond symptoms and treatments, highlighting recurring language patterns and frequent word associations. In some cases, high-frequency function words such as “because,” “now,” and “other” were retained due to their contextual significance in conveying causal explanations, temporal states, or referencing additional symptoms or treatments. Despite their generality, these words carried significant emotional and narrative implications in user expressions.
3) Sentiment analysis
Sentiment analysis was performed using a predefined system of 12 emotional categories: fear, anger, anxiety, loss, depression, frustration, gratitude, determination, acceptance, relief, calmness, and hope. Two researchers independently annotated emotional content, resolving disagreements by consensus. Sentiment scores were normalized to a 0–1 scale, where 0 indicated “not present at all” and 1 represented “strongly felt,” ensuring consistent and interpretable emotion ratings.
4) Topic modeling
Topic modeling employed the latent Dirichlet allocation (LDA) algorithm from the gensim library. Optimal topic numbers were identified by calculating coherence and perplexity scores across a range of topics (2–10). Preprocessing and modeling were conducted using Python (version 3.11) with libraries including KoNLPy, gensim, scikit-learn, and re.
Korean-language texts were preprocessed using the Okt analyzer from KoNLPy. Stopwords, including topic/subject markers, verb endings (e.g., “do,” “be”), and general functional terms (e.g., “about,” “subject,” “content”), were removed. Regular expressions standardized spacing inconsistencies in semantically equivalent terms (e.g., “anticancer treatment” as one or two words).
Key topic terms were extracted using term frequency-inverse document frequency (TF-IDF) values computed via the TfidfVectorizer function from the scikit-learn library (version 1.5.2) [16]. Topics lacking a clearly dominant term distribution were labeled qualitatively. Two researchers reviewed the top 10 keywords per topic and assigned thematic labels by consensus.
5) Sentiment correlation analysis of symptoms and treatments
Pearson correlation analysis explored associations between symptoms and sentiment. Correlation coefficients and corresponding p-values were calculated for each symptom or treatment category to determine statistical significance (p < 0.05).
4. Ethical Considerations
We adhered strictly to ethical guidelines while utilizing publicly shared data. Posts originated from open cancer forums on South Korean platforms (e.g., Naver, Daum), where users voluntarily shared experiences. All data were anonymized, excluding personally identifiable information. The study complied with legal requirements and each platform’s policies, including those outlined in the Personal Information Protection Act.
To mitigate the risk of misinterpreting context-dependent social media content, two independent researchers reviewed sentiment and symptom categorizations, resolving discrepancies through consensus. Careful attention was given to preserving the intended meaning of each post.
Although the data were publicly accessible, ethical oversight was maintained throughout the study. Informed consent was not required, as the research involved no direct human subjects and met exemption criteria. We acknowledge ongoing ethical discussions concerning social media research and support continued efforts to refine best practices for digital health studies.
III. Results
1. Overview of the Dataset
A descriptive analysis was conducted to examine the characteristics of the dataset, which comprised a total of 6,907 unique texts (Table 1). Sentence lengths ranged from 13 to 383 words, with a mean of 194.8 words and a median of 205 words, indicating relatively consistent text lengths. The dataset contained 306,914 total words and 87,409 unique terms, reflecting a combination of medical terminology and everyday language typically found in healthcare-related discourse.
2. Keyword Analysis Using Frequency and TF-IDF
Keyword analysis was performed by assessing term frequency and contextual relevance using TF-IDF (Table 2). Frequently occurring terms included “older adults,” “surgery,” and “chemotherapy,” while TF-IDF analysis emphasized contextually significant yet less frequent terms, such as “premature ejaculation” and “hepatocellular carcinoma.”

Frequency and TF-IDF (term frequency-inverse document frequency) analysis of keywords in the current dataset
General terms like “research” and “treatment” demonstrated lower TF-IDF scores due to their broader usage, whereas terms such as “symptoms” and “diagnosis” exhibited higher TF-IDF scores despite their lower overall frequency. To confirm the relevance of specific sexual health-related terms, sample posts were manually reviewed. Most were related to prostate cancer treatments, and unrelated content was eliminated during preprocessing.
3. 2-Gram Analysis of Frequent Word Pairs
A 2-gram analysis was conducted to identify commonly co-occurring word pairs within the dataset, grouped into five thematic categories (Table 3, Figure 1). Particularly frequent were pairs such as “erectile dysfunction treatment” (352 occurrences), “premature ejaculation treatment” (343 occurrences), and “treatment method” (224 occurrences). The phrase “surgery cost” appeared in both the “treatment/surgery” and “cost/insurance” thematic categories.
4. Categorization of Symptoms and Treatment in the Dataset
Of the 6,907 texts analyzed, 3,836 (55.54%) explicitly mentioned symptoms and treatments, indicating that over half of the posts referenced these topics. An average of 0.85 symptoms or treatment categories was identified per post, with up to five distinct categories appearing in a single entry.
The most frequent category was “chemotherapy” (1,828 instances, 26.46%), followed by “radiotherapy” (8.64%), “sleep disorders” (6.09%), and “surgery” (6.08%) (Figure 2A, Table 4). Psychosocial issues (5.59%) were also commonly noted, with “depression” (155 occurrences), “distress” (143 occurrences), and “anxiety” (99 occurrences) standing out. Other symptoms appearing frequently included pain, cardiopulmonary symptoms, immune issues, and sexual dysfunction, in descending order of frequency.

Distribution of symptom, treatment, and sentiment categories in online cancer community posts: (A) frequency of symptoms and treatments, (B) frequency of emotional sentiments.
5. Seven Key Topics Identified from Topic Modeling
To identify the optimal number of topics for LDA, coherence and perplexity scores were evaluated for 2 to 10 potential topics [17]. Coherence peaked at seven topics (0.62), and perplexity was lowest at four topics (−6.86). A seven-topic model was ultimately selected, balancing semantic clarity and thematic coverage. Coherence scores declined beyond seven topics, suggesting semantic redundancy (Figure 3).
Topic modeling (Table 5) identified seven themes: (1) common questions about cancer among older adults; (2) decision-making related to cancer diagnosis and treatment; (3) cancer diagnosis and treatment considerations; (4) new cancer therapies and associated risks; (5) immune-function strategies for older adults with cancer; (6) symptom management among older male cancer patients; and (7) symptom management among older female cancer patients.
6. Sentiment Analysis of Older Adults with Cancer
Sentiment analysis of 6,908 texts revealed that anxiety (1,425 occurrences, 20.63%) and depression (1,353 occurrences, 19.59%) were the most frequently expressed emotions (Figure 2B, Table 6). Anxiety was exemplified by statements such as, “My mother says she’s terrified of chemotherapy and fears she won’t endure it. I’m also scared—what if my desire only makes her suffer more?” Similarly, depression was evident in statements like, “This overwhelming situation makes me feel unprepared. I suppose I’ll have to say goodbye soon.” Loss (761 occurrences, 11.02%) and anger (599 occurrences, 8.67%) were also prevalent. Loss was reflected in statements such as, “She’ll turn 80 next year, and with late-stage pancreatic cancer, what meaning does chemotherapy even have?” Anger was conveyed through statements including, “My mom quietly goes to the hospital for excessive treatments to avoid bothering her children. It’s frustrating; though I love her, I feel irritated and upset.” In contrast, positive emotions such as calmness (751 occurrences, 10.87%), hope (480 occurrences, 6.09%), and gratitude (369 occurrences, 5.34%) were relatively infrequent.
To ensure the reliability of sentiment annotations, inter-rater agreement was calculated using Cohen’s kappa coefficient, yielding a value of 0.75, indicative of substantial agreement [18].
7. Sentiment Analysis of Treatment and Symptoms among Older Adults with Cancer
Pearson correlation analysis revealed weak but statistically significant relationships between certain symptoms and sentiment (Table 7). Sleep disorders exhibited a negative correlation with sentiment, while radiotherapy, respiratory, circulatory, and gastrointestinal issues were associated primarily with neutral sentiments (p < 0.05). Most symptoms lacked strong correlations, reflecting a predominantly neutral tone, possibly indicative of informational or inquiry-based content.
Chemotherapy (1,828 instances) had a high negative sentiment ratio (65.81%) (Table 4), followed closely by surgery (69.52%) and radiotherapy (54.61%). Sleep disorders showed the highest negative sentiment overall (79.81%).
Psychosocial issues (386 instances) demonstrated emotional variability, with 49.74% negative and 20.73% positive sentiments, indicating simultaneous distress and support. Pain (273 instances) and respiratory symptoms (248 instances) also showed predominantly negative sentiment. In contrast, immunity issues and sexual dysfunction were predominantly associated with neutral sentiments (31.82% and 54.77%, respectively), with sexual dysfunction showing the highest overall neutrality.
IV. Discussion
Despite advances in medical treatments, the emotional strain on older adults with cancer and their caregivers remains substantial. This study explored symptom expressions and emotional experiences among older adults with cancer and their caregivers through text mining and sentiment analysis of numerous online community posts.
The study used NLP techniques, including keyword frequency, TF-IDF, and 2-gram analyses, to identify recurring concerns and emotions within user narratives. Frequently mentioned terms such as “older adults,” “surgery,” and disease-specific terminology reflected clinical priorities, while references to “immunity” and “exercise” indicated interests related to health maintenance. These results emphasize the importance of patient-centered approaches addressing emotional, informational, and physical needs.
Our findings highlighted frequent mentions of male sexual dysfunction issues, notably treatments for premature ejaculation and erectile dysfunction. These concerns likely relate to other commonly cited terms such as surgery and colon cancer, as postoperative sexual dysfunction—including erectile dysfunction and ejaculatory disorders—is prevalent among male colorectal cancer patients [19]. In South Korea, older patients and their families often find discussing sexual health difficult, which might prompt them to seek related information online, possibly explaining the high frequency of these terms in our dataset.
The TF-IDF and 2-gram analyses identified terms that commonly co-occurred and carried significant contextual meaning. Expressions such as “erectile dysfunction treatment,” “surgery cost,” and “treatment reviews” indicated patient concerns about health outcomes and suggested reliance on shared patient experiences for treatment decisions [20]. These findings underline the necessity for comprehensive, patient-centered care that addresses both medical and psychosocial dimensions.
Seven major topics related to cancer diagnosis, treatment decisions, symptom experiences, and exploration of new therapies emerged from topic modeling. The results suggest that older adults with cancer and their caregivers frequently turn to online communities for practical information and emotional support, which may not be fully provided during brief clinical interactions. Given the rising digital health engagement among older populations [21], expanding online support interventions may help fulfill their care needs and bridge gaps within current healthcare services.
Sentiment analysis indicated significant psychological distress among older adults with cancer, underscoring the critical need for emotional support. Anxiety and depression emerged as the most prevalent emotions, aligning with previous studies that identified the psychological impacts of cancer treatment, including concerns about side effects and uncertainty regarding outcomes [22,23]. Furthermore, our findings highlighted the substantial negative impact of sleep disturbances, which are known to exacerbate discomfort and reduce overall quality of life among cancer patients [24,25]. Thus, tailored interventions targeting improvements in sleep quality and emotional support are urgently needed.
Previous NLP research has examined emotional narratives related to cancer. For example, an analysis of lung cancer records found strong negative emotions linked to physical decline and concerns about treatment outcomes, consistent with our findings on anxiety and distress [26]. Another complementary study analyzing Reddit posts identified diverse emotional responses, including fear, sadness, and even unexpected moments of hope and joy [27]. In contrast, our study employed clearly defined NLP tools such as TF-IDF, 2-gram, and LDA analyses, enhancing our ability to systematically capture and understand emotional patterns in online communities.
Two complementary analyses were conduced to clarify emotional patterns associated with symptoms. While most symptoms exhibited minimal correlation with sentiment, sleep disorders showed a weak but statistically significant association with neutral sentiment. Psychosocial issues were predominantly associated with negative emotions, yet the considerable presence of neutral and positive responses suggested a more complex and dynamic emotional landscape [28]. Healthcare providers should recognize this variability, fostering emotional validation and supporting positive coping strategies.
Although several correlations between symptoms and sentiment were statistically significant (p < 0.05), their overall associations were generally weak. This might reflect the inherent limitations of user-generated content, which tends to be brief, contextually variable, and structurally inconsistent. These results should thus be interpreted cautiously, as they indicate general patterns rather than clear causal relationships. As noted in previous studies, individuals often describe the connection between physical symptoms and emotions as ambiguous or unique to each person [29]. Even individuals experiencing identical symptoms can differ markedly in how they are emotionally affected [30]. Future research could incorporate structured data collection methods or qualitative techniques to capture more detailed personal experiences.
This study has several limitations. Given the nature of social media, much of the content analyzed originated from family members caring for older adults. Consequently, we might not fully capture older adults’ direct perspectives, potentially omitting essential aspects of their personal experiences.
While online sources offer valuable insights into people’s experiences, they also have inherent limitations. For example, perspectives of individuals without internet access are not represented, restricting the diversity of viewpoints captured. This limitation highlights the need for comprehensive research incorporating varied perspectives from older adults concerning their health and care experiences.
Despite these limitations, the study provides meaningful insights into the experiences of older adults with cancer and their caregivers regarding symptom burden and emotional challenges. Future research could more clearly distinguish between patient and caregiver experiences by integrating structured data from electronic health records or through questionnaire-based methodologies. Additionally, advancements in NLP may facilitate automated speaker identification, enabling detailed subgroup analyses. Such methodological improvements could substantially enhance our understanding of this population’s unique needs. Finally, our findings establish a foundational basis for developing psychological support interventions, and future research could further guide customized symptom management strategies suitable for clinical application.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.
Acknowledgments
This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. RS-2024-00346310).