I. Introduction
Age is a well-established risk factor for cancer, and cancer prevalence is increasing with population aging [
1,
2]. In 2020, the cancer incidence rate among South Koreans aged 65 and older was 1,552 per 100,000 people, significantly higher compared to younger age groups [
3]. This trend is expected to impose considerable social and economic burdens on healthcare systems.
Advances in medical technology have significantly expanded therapeutic options and improved life expectancy for older adults with cancer. However, many still experience substantial distress throughout their cancer journey, contributing to a high prevalence of frailty in this population [
4]. Additionally, older adults often report negative emotions following a cancer diagnosis [
5], and these emotional responses, together with frailty, are strongly associated with diminished quality of life [
6].
Caregivers support older adults with cancer throughout diagnosis, treatment, and survivorship stages, frequently experiencing fear of death and significant caregiving burdens [
7,
8]. Thus, it is essential to consider both patient and caregiver well-being when planning supportive care. Nevertheless, large-scale studies focusing specifically on this population, especially those employing unstructured, real-world data such as online narratives, remain limited. Existing Korean research has primarily concentrated on digital literacy, treatment decision-making, and end-of-life planning [
9,
10], with relatively little attention to analyzing physical and psychological symptoms using natural language processing (NLP) techniques.
As unstructured digital text data become increasingly available, NLP and text mining techniques have emerged as valuable tools for health research [
11,
12]. By employing sentiment analysis and topic modeling, these approaches can uncover symptoms, emotional patterns, and thematic concerns. Advances in NLP could enhance the efficiency and scope of oncology research, potentially transforming clinical practice [
13].
Therefore, this study aimed to explore the symptoms and emotions expressed by older adults with cancer and their caregivers by applying NLP and text mining techniques to posts from online cancer communities in South Korea. The insights gained are intended to guide the development of person-centered nursing interventions for this population.
II. Methods
1. Study Design
The current study employed an NLP and text mining approach to analyze symptom expressions and emotional states in Korean-language online posts authored by older cancer patients and their caregivers. The data were collected from major social media platforms in South Korea.
2. Data Collection
Data were collected from online cancer communities on prominent South Korean platforms such as Naver and Daum, covering the period from January 2010 to October 2024. These communities included forums dedicated to specific cancer types (e.g., breast, lung, colorectal) and general platforms where patients and caregivers discussed their treatment experiences and feelings.
To develop a corpus suitable for NLP, we implemented a text mining strategy focusing on posts by or concerning older adults with cancer, utilizing search terms reflective of this group’s experiences. Examples included “seniors diagnosed with colorectal cancer,” “cancer treatment in the elderly,” and “experiences following cancer treatment.” Python libraries Selenium and BeautifulSoup4 were used for data scraping, adhering to ethical guidelines and minimizing server load. Personal identifiers, such as usernames, were removed from the dataset.
From an initial set of 8,789 posts, we curated a refined corpus of 6,908 posts by removing duplicates, promotional content, and overly brief entries. Posts were categorized into 11 topic-based Excel sheets (e.g., “elderly cancer patients,” “treatment experiences”), informed by previous research highlighting functional and comorbid considerations among older cancer patients [
14].
To prevent overrepresentation of particular cancer types, we assessed the distribution of cancer discussions across forums. Although breast, lung, and prostate cancers were frequently discussed, stratification reviews confirmed balanced representation. NLP techniques, including topic modeling, were applied to the entire dataset to extract prevalent symptoms and emotional patterns.
Data processing was independently reviewed by a nursing professor and a data scientist, with discrepancies resolved through consensus. Demographic and clinical details varied due to the public nature of the posts. Although some authors explicitly identified themselves as patients or caregivers, many did not, precluding systematic classification. This limitation, along with potential engagement bias, was recognized in the analysis.
3. Data Analysis
The data analysis comprised several systematic steps designed to comprehensively examine the text data.
1) Text preprocessing
For processing Korean text, we employed the Okt tokenizer from the KoNLPy library [
15], optimized for Korean language analysis. Stopwords—such as particles, verb endings, and general terms—were removed to improve the analytical accuracy. Additionally, spacing inconsistencies in synonymous terms (e.g., variations of “chemotherapy”) were standardized using regular expressions.
2) Extraction and categorization of treatment and symptoms
We developed a custom symptom-treatment dictionary encompassing 22 categories, including “surgery,” “chemotherapy,” “radiotherapy,” and “sleep disorders.” To extract relevant mentions, substring matching was conducted using the str. contains() function from the pandas library (version 2.2.3) [
15]. All applicable categories were recorded when multiple symptoms or treatments appeared in a single post, enabling comprehensive classification and frequency analysis.
Additionally, keyword frequency and bigram (2-gram) analyses identified commonly used terms and co-occurring word pairs within the corpus. These analyses provided insights extending beyond symptoms and treatments, highlighting recurring language patterns and frequent word associations. In some cases, high-frequency function words such as “because,” “now,” and “other” were retained due to their contextual significance in conveying causal explanations, temporal states, or referencing additional symptoms or treatments. Despite their generality, these words carried significant emotional and narrative implications in user expressions.
3) Sentiment analysis
Sentiment analysis was performed using a predefined system of 12 emotional categories: fear, anger, anxiety, loss, depression, frustration, gratitude, determination, acceptance, relief, calmness, and hope. Two researchers independently annotated emotional content, resolving disagreements by consensus. Sentiment scores were normalized to a 0–1 scale, where 0 indicated “not present at all” and 1 represented “strongly felt,” ensuring consistent and interpretable emotion ratings.
4) Topic modeling
Topic modeling employed the latent Dirichlet allocation (LDA) algorithm from the gensim library. Optimal topic numbers were identified by calculating coherence and perplexity scores across a range of topics (2–10). Preprocessing and modeling were conducted using Python (version 3.11) with libraries including KoNLPy, gensim, scikit-learn, and re.
Korean-language texts were preprocessed using the Okt analyzer from KoNLPy. Stopwords, including topic/subject markers, verb endings (e.g., “do,” “be”), and general functional terms (e.g., “about,” “subject,” “content”), were removed. Regular expressions standardized spacing inconsistencies in semantically equivalent terms (e.g., “anticancer treatment” as one or two words).
Key topic terms were extracted using term frequency-inverse document frequency (TF-IDF) values computed via the TfidfVectorizer function from the scikit-learn library (version 1.5.2) [
16]. Topics lacking a clearly dominant term distribution were labeled qualitatively. Two researchers reviewed the top 10 keywords per topic and assigned thematic labels by consensus.
5) Sentiment correlation analysis of symptoms and treatments
Pearson correlation analysis explored associations between symptoms and sentiment. Correlation coefficients and corresponding p-values were calculated for each symptom or treatment category to determine statistical significance (p < 0.05).
4. Ethical Considerations
We adhered strictly to ethical guidelines while utilizing publicly shared data. Posts originated from open cancer forums on South Korean platforms (e.g., Naver, Daum), where users voluntarily shared experiences. All data were anonymized, excluding personally identifiable information. The study complied with legal requirements and each platform’s policies, including those outlined in the Personal Information Protection Act.
To mitigate the risk of misinterpreting context-dependent social media content, two independent researchers reviewed sentiment and symptom categorizations, resolving discrepancies through consensus. Careful attention was given to preserving the intended meaning of each post.
Although the data were publicly accessible, ethical oversight was maintained throughout the study. Informed consent was not required, as the research involved no direct human subjects and met exemption criteria. We acknowledge ongoing ethical discussions concerning social media research and support continued efforts to refine best practices for digital health studies.
IV. Discussion
Despite advances in medical treatments, the emotional strain on older adults with cancer and their caregivers remains substantial. This study explored symptom expressions and emotional experiences among older adults with cancer and their caregivers through text mining and sentiment analysis of numerous online community posts.
The study used NLP techniques, including keyword frequency, TF-IDF, and 2-gram analyses, to identify recurring concerns and emotions within user narratives. Frequently mentioned terms such as “older adults,” “surgery,” and disease-specific terminology reflected clinical priorities, while references to “immunity” and “exercise” indicated interests related to health maintenance. These results emphasize the importance of patient-centered approaches addressing emotional, informational, and physical needs.
Our findings highlighted frequent mentions of male sexual dysfunction issues, notably treatments for premature ejaculation and erectile dysfunction. These concerns likely relate to other commonly cited terms such as surgery and colon cancer, as postoperative sexual dysfunction—including erectile dysfunction and ejaculatory disorders—is prevalent among male colorectal cancer patients [
19]. In South Korea, older patients and their families often find discussing sexual health difficult, which might prompt them to seek related information online, possibly explaining the high frequency of these terms in our dataset.
The TF-IDF and 2-gram analyses identified terms that commonly co-occurred and carried significant contextual meaning. Expressions such as “erectile dysfunction treatment,” “surgery cost,” and “treatment reviews” indicated patient concerns about health outcomes and suggested reliance on shared patient experiences for treatment decisions [
20]. These findings underline the necessity for comprehensive, patient-centered care that addresses both medical and psychosocial dimensions.
Seven major topics related to cancer diagnosis, treatment decisions, symptom experiences, and exploration of new therapies emerged from topic modeling. The results suggest that older adults with cancer and their caregivers frequently turn to online communities for practical information and emotional support, which may not be fully provided during brief clinical interactions. Given the rising digital health engagement among older populations [
21], expanding online support interventions may help fulfill their care needs and bridge gaps within current healthcare services.
Sentiment analysis indicated significant psychological distress among older adults with cancer, underscoring the critical need for emotional support. Anxiety and depression emerged as the most prevalent emotions, aligning with previous studies that identified the psychological impacts of cancer treatment, including concerns about side effects and uncertainty regarding outcomes [
22,
23]. Furthermore, our findings highlighted the substantial negative impact of sleep disturbances, which are known to exacerbate discomfort and reduce overall quality of life among cancer patients [
24,
25]. Thus, tailored interventions targeting improvements in sleep quality and emotional support are urgently needed.
Previous NLP research has examined emotional narratives related to cancer. For example, an analysis of lung cancer records found strong negative emotions linked to physical decline and concerns about treatment outcomes, consistent with our findings on anxiety and distress [
26]. Another complementary study analyzing Reddit posts identified diverse emotional responses, including fear, sadness, and even unexpected moments of hope and joy [
27]. In contrast, our study employed clearly defined NLP tools such as TF-IDF, 2-gram, and LDA analyses, enhancing our ability to systematically capture and understand emotional patterns in online communities.
Two complementary analyses were conduced to clarify emotional patterns associated with symptoms. While most symptoms exhibited minimal correlation with sentiment, sleep disorders showed a weak but statistically significant association with neutral sentiment. Psychosocial issues were predominantly associated with negative emotions, yet the considerable presence of neutral and positive responses suggested a more complex and dynamic emotional landscape [
28]. Healthcare providers should recognize this variability, fostering emotional validation and supporting positive coping strategies.
Although several correlations between symptoms and sentiment were statistically significant (
p < 0.05), their overall associations were generally weak. This might reflect the inherent limitations of user-generated content, which tends to be brief, contextually variable, and structurally inconsistent. These results should thus be interpreted cautiously, as they indicate general patterns rather than clear causal relationships. As noted in previous studies, individuals often describe the connection between physical symptoms and emotions as ambiguous or unique to each person [
29]. Even individuals experiencing identical symptoms can differ markedly in how they are emotionally affected [
30]. Future research could incorporate structured data collection methods or qualitative techniques to capture more detailed personal experiences.
This study has several limitations. Given the nature of social media, much of the content analyzed originated from family members caring for older adults. Consequently, we might not fully capture older adults’ direct perspectives, potentially omitting essential aspects of their personal experiences.
While online sources offer valuable insights into people’s experiences, they also have inherent limitations. For example, perspectives of individuals without internet access are not represented, restricting the diversity of viewpoints captured. This limitation highlights the need for comprehensive research incorporating varied perspectives from older adults concerning their health and care experiences.
Despite these limitations, the study provides meaningful insights into the experiences of older adults with cancer and their caregivers regarding symptom burden and emotional challenges. Future research could more clearly distinguish between patient and caregiver experiences by integrating structured data from electronic health records or through questionnaire-based methodologies. Additionally, advancements in NLP may facilitate automated speaker identification, enabling detailed subgroup analyses. Such methodological improvements could substantially enhance our understanding of this population’s unique needs. Finally, our findings establish a foundational basis for developing psychological support interventions, and future research could further guide customized symptom management strategies suitable for clinical application.