I. Introduction
Febrile diseases such as malaria, typhoid fever, human immunodeficiency virus (HIV) and acquired immune deficiency syndrome (AIDS), tuberculosis, respiratory tract infection, and urinary tract infection remain significant public health challenges, particularly in resource-scarce settings. These resource-constrained areas frequently lack adequate access to medical facilities and qualified healthcare professionals, resulting in delays in diagnosis and treatment. The diagnostic process is further complicated by overlapping symptoms shared among these diseases [
1], underscoring the need for timely and accurate diagnoses to ensure appropriate treatments, avoid complications, and minimize disease spread or transmission risks.
Mobile health (mHealth) applications have emerged as a promising avenue for enhancing healthcare delivery and facilitating timely diagnostic services [
2,
3]. Artificial intelligence (AI) and machine learning (ML) have significantly improved diagnostic efficiency and accuracy in medical systems [
4], and these techniques have been specifically applied in diagnosing febrile diseases [
5–
10].
However, the “black box” nature of ML models introduces challenges related to clarity and transparency, reducing the interpretability and comprehensibility of the underlying decision logic. In healthcare contexts, this opacity raises concerns about the reliability and transparency of diagnostic systems [
11]. To address issues related to transparency and trust, explainable AI (XAI) approaches have emerged to improve interpretability of ML models [
12,
13]. The local interpretable model-agnostic explanations (LIME) [
14] method is an XAI approach that locally approximates complex models with simpler, interpretable ones. Large language models (LLMs), such as generative pre-trained transformers (GPT), have demonstrated an enhanced ability to understand and produce human-like text [
15]. Such models hold considerable potential for improving comprehension among lay healthcare users by translating complex diagnostic results into easily understandable language. This capability is particularly important in mHealth applications, where clear comprehension of diagnostic data is crucial for recommending appropriate treatment plans.
Although ML models are widely used for illness diagnosis, medical professionals often find it difficult to trust their predictions due to the limited explainability associated with generated results. Moreover, despite the successful humanlike text generation by LLMs, medical diagnostics have not fully embraced their potential. Current diagnostic tools typically employ a single methodological approach rather than combining multiple methods.
This paper integrates the interpretability of LIME, the explainability of GPT-3.5, and the predictive capabilities of random forest (RF) into a mobile-based diagnostic application aimed at improving febrile disease diagnosis. By incorporating LIME, this study clarifies diagnostic predictions by visualizing the relative importance of each symptom, thereby enabling healthcare professionals to identify symptoms that most significantly influence predicted health conditions. The results are further enhanced by GPT-3.5, which provides a user-friendly interface by translating technical explanations into easily comprehensible language for non-specialists. In resource-scarce regions, the improved application will aid healthcare workers in diagnosing patient conditions rapidly and accurately, potentially leading to better health outcomes.
The remainder of this paper is organized as follows: Section II describes the methodology, presenting the proposed system framework, dataset description, preprocessing steps, diagnostic methods, and interpretability models designed to improve diagnostic explainability. Section III presents the study’s findings, evaluating the effectiveness of ML models, demonstrating how XAI provides insights into model decisions, and detailing the development of the mobile-based XAI system. Section IV concludes the study, identifies its limitations, and provides recommendations for future research directions and enhancements in diagnostic methodologies.
II. Methods
1. Proposed System Framework
A mobile-based diagnostic system for identifying febrile diseases was developed using agile software development methods, emphasizing iterative user interactions. The system is designed to be user-centered and flexible, facilitating ongoing enhancements based on user feedback and real-time data generation and transmission. The architecture of the proposed system is presented in
Figure 1. It comprises key components including medical experts, healthcare workers, a diagnostic system, and a mobile device used for collecting patient information, vital signs, and symptoms. These data are analyzed and stored both locally and on the cloud to aid decision-making. Medical professionals provided patient data, which were subsequently preprocessed to ensure suitability for ML and LLM modeling.
The RF algorithm, which forms the basis of this tool, was selected due to its robustness in medical diagnosis and its resilience when working with complex datasets. Trained on an extensive dataset of clinical symptoms associated with febrile illnesses, the algorithm predicts the likelihood of febrile diseases based on patient inputs. To address interpretability concerns inherent in ML models, the system integrates LIME, which generates understandable explanations for the predictions made by the RF model, highlighting the symptoms most influential in each disease diagnosis.
Furthermore, GPT-3.5 was employed by the system to provide natural-language explanations of diagnostic results obtained from LIME. This feature ensures that users, regardless of their medical expertise, clearly understand the diagnostic outcomes. The iterative integration of GPT-3.5 and other system features was facilitated by agile development practices, allowing rapid adaptations to evolving user needs and requirements. Because the diagnostic tool is implemented as a mobile application, users can access it conveniently in diverse settings, especially in resource-scarce areas.
2. Dataset Description and Data Preprocessing
This study utilized 4,870 patient records from the New Frontiers in Research Fund project [
16]. The dataset contained demographic information, patient symptoms, risk factors, suspected diagnoses, further investigations, and confirmed diagnoses. Symptom severity was rated using a 5-point scale (5 = very severe, 4 = severe, 3 = moderate, 2 = mild, and 1 = absent). Additionally, the dataset included susceptibility to non-clinical risk factors and medical experts’ suspected diagnoses. Further investigations, such as blood film examination, serology, full blood counts, etc., were conducted before confirmed diagnoses were established. The severity of both suspected and confirmed diagnoses was rated on a linguistic scale (6 = very high; 5 = high; 4 = moderate; 3 = low; 2 = very low; and 1 = absent). The dataset included the following diseases: malaria, HIV/AIDS, typhoid fever, tuberculosis, dengue fever, urinary tract infection, yellow fever, respiratory tract infection, and Lassa fever. During data preprocessing, records containing missing symptoms, as well as diseases outside the scope of this study (yellow fever, dengue fever, and Lassa fever), were removed to maintain dataset integrity. Patients under 5 years old were also excluded from the study because the data collection instrument did not account for symptoms in children younger than five, who are unable to clearly express certain symptoms. The dataset was then narrowed to include only symptoms and confirmed diagnoses relevant to the study, removing columns related to physicians’ suspected diagnoses. Disease labels (output labels) were converted into binary encoding—absent “0” (originally 1 = absent) and present “1” (originally 6 = very high; 5 = high; 4 = moderate; 3 = low; and 2 = very low). This simplification facilitated classification tasks, reduced model complexity, and improved performance, as the model no longer needed to distinguish among multiple overlapping classes. After preprocessing, the dataset consisted of 3,914 records.
Table 1 lists the 32 symptoms and 6 confirmed diagnoses used in the study.
Relevant medical experts and healthcare workers were involved throughout the iterative development process to ensure comprehensive understanding of the proposed system’s features. Several meetings with stakeholders were held to solicit feedback on system functionality. The elicited requirements were then organized and documented in standard software engineering formats.
Figure 2 illustrates the use case diagram, providing a graphical representation of interactions among users, healthcare workers, patients, and administrators.
3. Diagnostic, Interpretability, and Explainability Models
The study utilized Visual Studio Code and Google Colab, along with core Python libraries and packages including matplotlib, sklearn, numpy, pandas, seaborn, flask, flask-sqlalchemy, flet, joblib for model loading, an interpreter, automator, and prompt generator. The prompt generator organizes the list of diagnosed diseases and symptoms generated by the LIME model into structured prompts and saves these prompts in JSON format, which are then sent as requests to the ChatGPT API, with responses subsequently saved. The system consists of two main components: a frontend created with Flet that visualizes data using Matplotlib and Seaborn; and a backend that utilizes Flask for MySQL database management and API integration via Flask-SQLAlchemy. The diagnostic model, including performance metrics, was developed using the RF algorithm with hyperparameter tuning through GridSearchCV to enhance diagnostic precision. RF is an ensemble ML technique that improves predictive accuracy by combining multiple decision trees, effectively recognizing patterns in high-dimensional and complex problems [
17]. The RF algorithm was selected specifically for its robustness in medical diagnosis contexts and resilience when working with complicated datasets.
LIME generated visual explanations showing how symptom features influenced diagnoses. It provided local interpretability by approximating the complex model with a simpler, interpretable model around a specific prediction, thereby making the model’s reasoning transparent to healthcare workers. LIME identified the symptoms with the most significant influence on model decision-making, offering concise, locally interpretable explanations for individual diagnoses. GPT converted these explanations into plain, natural language, facilitating healthcare workers’ understanding and trust in the model’s predictions. The combination of these methods not only enhanced diagnostic performance but also improved user confidence by clearly and understandably communicating the rationale behind each diagnosis. The sample prompt, illustrated in
Table 2, leveraged the list of diagnosed diseases and patient symptoms generated by the LIME model, along-side explanations regarding the influence of each symptom on the diagnosis. Before submission to the ChatGPT API, the generated prompt was stored in a variable and formatted into JSON.
III. Results
This study evaluated the performance of three diagnostic models—multilayer perceptron (MLP), extreme gradient boosting (XGBoost), and RF—across six disease categories: MAL (Malaria), ENFVR (typhoid/enteric fever), HVAD (HIV and AIDS), UTI (urinary tract infection), RTI (respiratory tract infection), and TB (tuberculosis). Unlike deep learning models, which typically require large training datasets to achieve strong performance and avoid overfitting, these traditional machine learning models generally demonstrate effective generalization on smaller datasets and are less computationally demanding. The selected models were evaluated using standard classification metrics, specifically precision, recall, and F1-score. These metrics were chosen because they effectively assess diagnostic models, especially in clinical contexts where false negatives and false positives have significant implications.
Table 3 summarizes the comparative results, highlighting the strengths and limitations of each model in accurately diagnosing the respective conditions. Among the evaluated models, RF outperformed MLP and XGBoost across most disease categories. In healthcare settings, RF is particularly effective with multi-label and imbalanced data [
18], making it especially suitable for diagnostic tasks where certain conditions, such as HVAD and TB, may be underrepresented. Its ensemble learning strategy, which aggregates predictions from multiple decision trees, enhances generalization and mitigates overfitting, contributing to its robust and reliable performance.
The performance of the RF model in diagnosing each disease is further visualized in
Figure 3. The model demonstrated strong performance in detecting malaria, achieving precision, recall, and F1-scores of 85%, 91%, and 88%, respectively. However, its performance was notably weaker for typhoid fever and HIV/AIDS, with F1-scores of 39% and 51%, respectively, indicating a tendency to miss true-positive cases of these diseases. For urinary tract infections and respiratory tract infections, the model showed balanced performance, achieving F1-scores of 72% for both conditions. The detection of tuberculosis revealed moderate precision (77%) but lower recall (49%), resulting in an F1-score of 60%. These results indicate that although the RF model effectively diagnoses certain diseases, such as malaria, further data enrichment and model optimization are necessary to improve diagnostic performance across all disease categories.
Figure 4 presents a feature importance plot generated using the LIME framework applied to a random forest model. It highlights the most influential symptoms contributing to each disease diagnosis, such as difficulty breathing (DIFBRT), painful urination (PNFLURTN), and wheezing (WHZ). The least influential symptoms, including bitter taste in the mouth (BITAIM), fever (FVR), and generalized body pain (GENBDYPN), are also indicated. This visualization demonstrates that different diseases have distinct feature-importance profiles, underscoring LIME’s effectiveness in identifying symptom patterns unique to specific conditions. Such insights enable medical professionals to make informed clinical decisions by understanding not only the model’s predictions but also the rationale behind these predictions.
The febrile disease diagnostic system is compatible with Android OS version 4.0 and above and requires at least 2 GB RAM, 8 GB of minimum ROM storage, a portrait-oriented display layout, and an internet connection. The mobile application includes a user authentication screen shown in
Figure 5A, and the primary dashboard for healthcare workers shown in
Figure 5B. This dashboard enables healthcare professionals to view patient lists, register new patients, schedule appointments, and respond to patient requests. After registering a patient, healthcare workers access the patient dashboard displayed in
Figure 5C. Through the patient dashboard, healthcare workers can record vital signs, perform clinical examinations, take medical history (
Figure 5D), and carry out provisional diagnoses (
Figure 5E). The provisional diagnosis screen presents the patient’s likely diseases, visualizes how symptoms contribute to each diagnosis, and includes GPT-3.5-generated explanations of diagnostic outcomes.
Figure 5E also displays the output from the ChatGPT platform, highlighting both significant and less significant symptom contributions with clear, interpretable explanations.
IV. Discussion
The findings of this study illustrate the advantages and limitations of employing a mobile XAI system combining RF, LIME, and GPT-3.5 to diagnose febrile diseases. While the system exhibits impressive diagnostic accuracy for certain conditions, it struggles with others, highlighting specific areas for improvement. Firstly, the model’s robustness in identifying common febrile illnesses was demonstrated through its high performance in detecting malaria, achieving precision, recall, and F1-scores of 85%, 91%, and 88%, respectively. This performance indicates that the system can effectively detect malaria cases, thus assisting healthcare workers in managing a disease that continues to significantly contribute to morbidity, particularly in resource-limited regions. The model notably minimizes false negatives, ensuring accurate identification of most true cases, which is crucial for timely medical intervention, as indicated by the high sensitivity in malaria detection.
The system demonstrated balanced performance, achieving F1-scores of 72% for both RTI and UTI, thus showing moderate effectiveness in these diagnostic categories. These results suggest that while the model functions adequately, further enhancements could be made, particularly in recall rates (68% for RTI and 65% for UTI). The findings indicate that some true-positive cases may be missed, potentially due to symptom overlaps between these diseases and others within the dataset. The utility of the model in identifying common diseases in resource-constrained settings could be further improved by fine-tuning feature importance and symptom weighting, thus increasing sensitivity for these conditions. Conversely, the model exhibited limited performance in detecting typhoid fever and HIV/AIDS, with precision, recall, and F1-scores of 69%, 53%, and 60% for typhoid fever, and 75%, 39%, and 51% for HIV/AIDS, respectively. The low recall rates suggest a significant number of actual cases, particularly for HIV/AIDS, were not detected. Misclassification may arise due to overlapping symptoms shared with other febrile diseases, potentially explaining the model’s weaker performance. Additionally, the reduced detection rates may reflect suboptimal model parameters or insufficient representative data for these diseases in the training set. These results could be improved by fine-tuning the model parameters or retraining the model using larger datasets specifically enriched with symptom profiles of HIV/AIDS and typhoid fever cases.
The interpretability provided by LIME proved essential in understanding how symptoms influenced each diagnosis, thereby increasing model transparency. Identifying influential symptoms enabled insights into potential misclassifications, providing valuable context to healthcare workers. This interpretability supports clinical decision-making by allowing users to clearly understand the reasoning behind each diagnosis, thus enhancing system confidence.
The system significantly benefited from GPT-3.5’s capability to generate natural language explanations. GPT-3.5 facilitated informed decision-making and improved user comprehension by translating complex model outputs into accessible explanations. Clear, comprehensible explanations could increase acceptance of AI-assisted diagnoses, especially in low-resource settings lacking specialized diagnostic tools and medical expertise. Integrating LIME and GPT-3.5, which effectively reduces the ML model’s “black box” opacity, not only enhances disease diagnosis but also empowers healthcare workers by clarifying medical conditions and the rationale behind system-generated recommendations. To address the limitations associated with potential overgeneralizations or inaccuracies inherent in GPT-3.5 outputs, incorporating domain-specific fine-tuning and regular validation with expert feedback is recommended.
In conclusion, this study developed a mobile-based XAI system for diagnosing febrile diseases, integrating RF for disease prediction, LIME for interpretation, and GPT-3.5 for providing natural language explanations. This integration directly addressed the inherent “black box” nature of ML models, significantly improving transparency and trustworthiness in practical AI-driven healthcare diagnostics. The system achieved robust performance in diagnosing certain diseases, particularly malaria, with notably high precision, recall, and F1-scores. Additionally, the study demonstrated the potential of combining RF, LIME, and GPT-3.5 to create a user-friendly diagnostic tool accessible to both healthcare professionals and non-specialists in resource-constrained environments. Nevertheless, the study encountered some limitations. First, the system’s diagnostic accuracy varied significantly across diseases, demonstrating particular challenges in detecting typhoid fever and HIV/AIDS due to lower precision and recall, thus necessitating further refinement. Second, the dataset utilized was limited to a specific population, potentially restricting the model’s generalizability across diverse demographic and geographic contexts. Another notable limitation was the exclusion of pediatric patients (under 5 years of age), creating a gap in the system’s applicability for diagnosing young children.
Future research directions can address these limitations. Expanding the dataset to encompass more diverse populations and additional febrile diseases could increase the generalizability and robustness of the diagnostic model. Moreover, including data from pediatric patients and developing specialized diagnostic models tailored to this age group would significantly broaden the system’s applicability. Real-time data updates and feedback mechanisms could further improve accuracy and adaptability. Additionally, future studies should incorporate structured human evaluations of the interpretability components, specifically LIME visualizations and GPT-generated explanations, utilizing Likert-scale assessments and obtaining expert feedback from medical professionals to validate their clinical relevance and usability.