Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts
Article information
Abstract
Objectives
Assessing medical disputes requires both medical and legal expertise, presenting challenges for patients seeking clarity regarding potential malpractice claims. This study aimed to develop and evaluate a chatbot based on a chain-of-thought pipeline using a large language model (LLM) for providing medical dispute counseling and compare its performance with responses from human experts.
Methods
Retrospective counseling cases (n = 279) were collected from the Korea Medical Dispute Mediation and Arbitration Agency’s website, from which 50 cases were randomly selected as a validation dataset. The Claude 3.5 Sonnet model processed each counseling request through a five-step chain-of-thought pipeline. Thirty-eight experts evaluated the chatbot’s responses against the original human expert responses, rating them across four dimensions on a 5-point Likert scale. Statistical analyses were conducted using Wilcoxon signed-rank tests.
Results
The chatbot significantly outperformed human experts in quality of information (p < 0.001), understanding and reasoning (p < 0.001), and overall satisfaction (p < 0.001). It also demonstrated a stronger tendency to produce opinion-driven content (p < 0.001). Despite generally high scores, evaluators noted specific instances where the chatbot encountered difficulties.
Conclusions
A chain-of-thought–based LLM chatbot shows promise for enhancing the quality of medical dispute counseling, outperforming human experts across key evaluation metrics. Future research should address inaccuracies resulting from legal and contextual variability, investigate patient acceptance, and further refine the chatbot’s performance in domain-specific applications.
I. Introduction
Assessing medical disputes is a challenging task requiring both medical and legal expertise to accurately interpret incidents and evaluate disputed claims. Often entangled with deep emotional conflicts due to the physical and psychological damage experienced by patients, improving patient awareness of their dispute situation is essential for effective resolution [1]. As a preliminary step toward mediation, medical dispute counseling plays a vital role in guiding patient decision-making. Empowering patients to fully understand and address their grievances is fundamental to patient-centered care, enabling patients to navigate this interdisciplinary area more effectively.
Current medical dispute counseling practices encounter significant challenges, some of which large language models (LLMs) could address. Providing adequate counseling necessitates expertise in both medicine and law; however, consultants typically specialize in only one field, potentially leading to important oversights. High costs and the shortage of qualified professionals further strain judicial agencies, often relegating counseling to a lower priority. Additionally, counseling quality frequently varies, resulting in misunderstandings, unmet expectations, and dissatisfaction with the resolution process.
Recent studies applying LLMs in medical and legal domains provide a unique opportunity to overcome these challenges. Wu et al. [2] demonstrated a chain-of-thought (CoT) method for identifying errors in clinical notes, an approach that could potentially flag disputed claims in medical disputes. Cui et al. [3] introduced a multi-agent LLM approach for complex legal counseling, while Shi et al. [4] structured legal counseling into systematic, step-by-step processes using CoT. These methodologies may facilitate developing an LLM specifically for medical dispute counseling, complementing human experts by bridging knowledge gaps, easing workload burdens, and enhancing counseling quality and consistency.
The rapid integration of healthcare artificial intelligence (AI) has spurred broader discussions concerning patient well-being. Nevertheless, key concerns persist, including patient safety, privacy, inclusivity, and liability [5]. Efforts to address these concerns have resulted in various regulatory developments focused on responsible innovation [6,7]. However, the direct application of advanced natural language processing capabilities of LLMs for patient empowerment remains relatively unexplored.
Bedi et al. [8] conducted a systematic review of 519 studies examining LLM applications in healthcare, categorizing them according to task. Most studies concentrated on assessing medical knowledge (44.5%) or diagnosis (19.5%), while fewer addressed patient-centric tasks such as informed decision-making (17.7%) or communication (7.5%). These findings indicate that prior research primarily targeted provider perspectives, potentially overlooking the potential of LLMs to empower patients.
Decker et al. [9] notably demonstrated the use of LLMs as a direct approach to enhance patient rights by evaluating ChatGPT’s ability to communicate surgical risks, benefits, and alternatives. Allen et al. [10] highlighted issues with delegating informed consent processes to residents, suggesting LLMs as a potentially more comprehensive alternative. These studies underscore the potential of healthcare LLMs to empower patients while strengthening institutional compliance with legal and ethical obligations.
Although patient bills of rights (PBRs) exist in various forms, legislatively defined PBRs at national or state levels uniquely include the right to file grievances [11]. In South Korea, Article 4(3) of the Medical Service Act mandates healthcare institutions to prominently display rights, including (1) access to treatment, (2) informed consent, (3) confidentiality, and (4) the right to request counseling and mediation in medical disputes. While prior healthcare LLM research has addressed the first three rights, relatively little attention has been given to the fourth, where patient and provider interests may be in conflict.
Thus, this study aimed to explore the potential application of LLMs employing the CoT method specifically tailored for medical dispute counseling, offering insights into both opportunities and limitations within the medical dispute domain.
II. Methods
For this study, retrospective medical dispute counseling cases were collected from the website of the Korea Medical Dispute Mediation and Arbitration Agency (hereinafter referred to as the “Agency”), a public institution under the Ministry of Health and Welfare responsible for providing alternative dispute resolution (ADR) services, including counseling, for medical disputes.
The cases were collected on December 17, 2024, in compliance with South Korea’s Public Data Provision Act and Personal Information Protection Act. The study was exempted from review by the Institutional Review Board of Kangwon National University (Approval No. KWNUIRB-2024-12-013).
Each case comprised a single online exchange conducted in Korean between a patient or family member and a human expert affiliated with the Agency, consisting of a counseling request and a corresponding response.
We manually reviewed the cases to reach a consensus on effective counseling practices, recognizing the discretionary nature inherent in medical dispute counseling. Some experts advocated a patient-oriented approach, while others preferred a more facilitative stance, depending on their professional backgrounds and experiences. The authors’ consensus guided the pipeline’s development, but final evaluations involved a broader expert panel reflecting diverse perspectives.
1. Case Collection, Exclusion, and De-identification
Initially, 444 cases were collected from the Agency’s website. After applying exclusion criteria (Table 1), 279 cases remained eligible. From these, 50 cases were randomly selected as the validation dataset, and the remaining cases formed the development dataset. The development dataset supported pipeline creation, including prompt design, pipeline development, and the generation of in-context learning (ICL) examples [12]. The validation dataset remained strictly separate, reserved exclusively for the final comparative evaluation (Figure 1).
The Agency published the cases online as publicly accessible examples, thoroughly de-identified to exclude any unnecessary personal information. At the time of data collection, only the counseling request and, if disclosed, the patient’s sex and age were retained, ensuring adequate medical context.
2. Chain-of-Thought Pipeline Development
We developed a question-answering framework using the Claude 3.5 Sonnet model, which, at the time of this study, was regarded as a leading baseline for generative tasks—a crucial attribute for employing the CoT approach. Claude was previously reported to outperform contemporary models, such as GPT-4, in general medical knowledge and legal reasoning tasks [13,14].
The CoT methodology was implemented to decompose the counseling task into intermediate subtasks [15,16]. Each subtask was executed using individual prompts, with outputs sequentially passed forward through chat templates [17]. ICL examples were incorporated into each subtask [18]. Development was conducted using the Anthropic API with default parameters for the Claude 3.5 Sonnet model and the LangChain API. All coding was performed in Python (version 3.11.10).
Counseling requests typically provide limited evidence, lacking opposing statements and objective records. Thus, a primary concern during pipeline development was minimizing hallucinations, which could result in premature assumptions of malpractice. Although existing methodologies for identifying clinical note errors were employed, any “errors” identified in counseling requests were treated strictly as potential legal issues, not definitive instances of malpractice.
The pipeline processes a patient’s counseling request through five sequential steps:
- Step 1 (Clinical context generation): The counseling request was first analyzed to extract clinical keywords using an initial prompt, followed by a second prompt elaborating on these keywords to provide general background information. This step constructed a clinical context without interpreting or prematurely expanding on the patient’s specific claims, which were subsequently addressed in later pipeline steps.
- Step 2 (Legal issue identification): The counseling request was analyzed to identify potential malpractice-related legal issues. Building upon prior research, such as that by Wu et al. [2], who classified clinical errors into diagnosis, intervention, and management categories, the present study expanded these categories based on prior literature [19,20] into 10 malpractice categories—diagnosis and examination, medication and prescription, injection, anesthesia, surgery, hospital transfer, infection control, safety management, informed consent, and patient monitoring. Each malpractice category had its own dedicated prompt functioning as a filter to assess applicability to the counseling request. Outputs included a binary flag (1 for applicable, 0 for not applicable) and a concise rationale (Figure 2). Few-shot examples guided the chatbot’s reasoning across diverse cases. Categories flagged as applicable were subsequently passed to the Legal Issue Elaboration step, with their rationales removed.
- Step 3 (Legal issue elaboration): Categories identified as applicable (flag = 1) proceeded to this step, with initial rationales removed to prevent premature assumptions of malpractice. A distinct prompt further elaborated each flagged category as a legal issue, striving to minimize assumptions. This two-step process enabled the pipeline to expand on legal issues while maintaining a more neutral perspective.
- Step 4 (Preliminary response drafting): Utilizing the clinical context and legal issues identified in previous steps, this step combined information to draft a preliminary response. It organized overlapping information coherently to produce a comprehensive reply addressing the counseling request.
- Step 5 (Response refinement): In this final step, the preliminary response underwent refinement through three sub-processes: filtering, which removed irrelevant or inappropriate content; standardization, which aligned vocabulary usage across responses for consistency; and tone randomization, which adjusted the response’s tone by emulating a randomly retrieved human response from the development dataset, used as a reference.
3. Evaluation
Evaluators received a counseling request, a human expert response, and a chatbot-generated response, randomly labeled as “Response A” and “Response B” to ensure blinding. Evaluators remained unaware of their peers’ evaluations. Each response was independently rated across four dimensions using a 5-point Likert scale: quality of information, understanding and reasoning, information-opinion spectrum, and overall satisfaction (Supplement A). Higher scores indicated superior performance for the first three dimensions, though not necessarily for the Information-Opinion Spectrum.
The evaluation dimensions were adapted from existing frameworks for assessing LLMs in the medical domain [21] but modified to better suit the needs of this study. Dimensions such as empathetic expression and human safety were excluded. Empathetic expression was omitted because counseling responses are typically formal, and expressions of empathy might inadvertently reveal the response source. Human safety was deemed irrelevant, as counseling addresses retrospective adverse events rather than prospective clinical decision-making.
The information-opinion spectrum was specifically introduced to evaluate the extent to which a response provides objective information versus subjective opinion. For instance, clarifying the implications of a content-certified letter or providing general medical knowledge exemplifies objective information. However, when subjective interpretations are introduced, the content shifts toward opinion [22]. Examples include assessing the presence or degree of medical negligence or evaluating the relative benefits and risks of legal strategies, as these move beyond mere factual information and into subjective, opinion-based reasoning.
The evaluation team comprised 38 experts, including nurses, physicians, attorneys, and individuals holding advanced law degrees, affiliated with the Agency. These experts were selected due to their extensive experience in navigating both layperson and professional language, their roles as intermediaries between patients and healthcare providers, and their representativeness of the human experts typically involved in medical dispute counseling.
Each of the 50 validation samples was reviewed by three experts, resulting in a total of 150 paired evaluations. Every evaluation panel included at least one medical and one legal expert. After rating each response, evaluators had the option to provide free-text comments and completed a survey to guess which response was AI-generated. The survey utilized five confidence levels, ranging from “I’m certain it’s A” to “I’m certain it’s B.”
4. Statistical Analysis
The Wilcoxon signed-rank test was used to compare chatbot responses against human expert responses, testing the hypothesis that the chatbot would outperform humans. Scores from the three evaluators were averaged for each sample across all evaluation dimensions. Mean scores for chatbot and human responses, along with the gaps between these scores, were calculated. Pearson correlation coefficients were computed to analyze how individual evaluation dimensions contributed to overall satisfaction.
Responses from the survey, in which experts guessed the AI-generated response, were first used to assess the accuracy of expert perceptions. Chatbot responses were then grouped into two categories: correct guesses and incorrect guesses. High-confidence guesses were weighted twice as heavily as low-confidence guesses. The Mann-Whitney U test was used to compare these two groups. All statistical analyses were performed using Python (version 3.11.10), employing the Matplotlib, Pandas, and SciPy libraries.
III. Results
The chatbot significantly outperformed human experts in the quality of information dimension (W = 1090.5, p < 0.001), with a mean chatbot score of 4.23 compared to 3.25 for human experts, resulting in an average score gap of 0.98. On the understanding and reasoning dimension, the chatbot again surpassed human experts (W = 772.0, p < 0.001), achieving a mean score of 4.04 compared to 3.21 for humans, reflecting a score gap of 0.83. The chatbot also scored higher on the information-opinion spectrum dimension (W = 933.5, p < 0.001), with an average score of 3.49, while human experts scored an average of 2.95, creating a mean gap of 0.54. Regarding overall satisfaction, the chatbot received significantly higher ratings than human experts (W = 1035.0, p < 0.001), with an average chatbot score of 3.69 compared to the human average of 3.11, resulting in a mean score gap of 0.58 (Figure 3, Table 2).
Pearson correlation analyses between evaluation dimensions revealed that scores for quality of information, understanding and reasoning, and overall satisfaction demonstrated strong linear correlations with each other (r > 0.6). Conversely, scores for the information-opinion spectrum appeared distinct, exhibiting relatively weaker correlations with the other three dimensions (r < 0.4), marking it as an outlier among evaluation dimensions (Figure 4).

Heatmap illustrating Pearson correlation coefficients between the four evaluation dimensions. Red indicates a high correlation, while blue indicates a low correlation.
The results from the survey, which asked evaluators to identify the AI-generated response, indicated that evaluators’ perceptions were generally accurate. The Mann-Whitney U test comparing correct versus incorrect identification groups showed a statistically significant, though modest, effect for quality of information (p = 0.049, r = 0.141). However, the differences observed for the remaining three dimensions—understanding and reasoning (p = 0.831, r = 0.016), information-opinion spectrum (p = 0.800, r = 0.019), and overall satisfaction (p = 0.777, r = 0.021)—were statistically insignificant.
IV. Discussion
The CoT approach demonstrated strong performance in medical dispute counseling. The chatbot significantly outperformed human experts in quality of information, understanding and reasoning, and overall satisfaction, despite not utilizing advanced methods such as retrieval-augmented generation (RAG) or model fine-tuning. Evaluators were generally able to identify AI-generated responses accurately, although this recognition did not significantly influence the evaluation outcomes. The small but significant effect size observed for quality of information suggests the chatbot’s informational style may have made its responses identifiable, possibly because evaluators were already familiar with tools such as Claude or ChatGPT.
Although the chatbot was explicitly prompted to minimize opinion-driven content, it nevertheless generated more subjective responses compared to human experts. Strong correlations among quality of information, understanding and reasoning, and overall satisfaction indicate these factors significantly influenced evaluator satisfaction. However, the chatbot’s advantage in these dimensions did not produce an equally substantial difference in overall satisfaction scores; the average score gap for overall satisfaction (0.58) was smaller than that for quality of information (0.98) and understanding and reasoning (0.83).
These findings highlight a limitation of the current evaluation methods: factors beyond the quality of information and understanding and reasoning significantly influenced overall satisfaction. For instance, some chatbot responses, despite high scores in individual dimensions, appeared detached from real-world contexts (sample 2 in Supplement B).
Additionally, some chatbot responses contained factually incorrect legal information, likely due to LLMs like Claude 3.5 and ChatGPT being trained on broad web datasets that do not account for jurisdictional differences or evolving legal frameworks. For example, the chatbot produced misleading content regarding abortion rights in South Korea (sample 1 in Supplement B). Such inaccuracies represent a form of hallucinated legal content [23]. Structured methods, such as RAG, could mitigate these inaccuracies by systematically organizing relevant legal information into vectorized libraries, enabling more precise referencing [24,25].
Furthermore, the chatbot occasionally made problematic assumptions, explicitly presuming malpractice or implicitly accepting contentious points as factual while explaining legal issues. Addressing these concerns through prompt engineering proved challenging, as constraints needed to be clear, broadly applicable, and minimal; the appropriate stance frequently varied from case to case.
Another concern was the chatbot’s inability to adequately account for the variability of healthcare environments across countries and institutions. In legal contexts, malpractice assessments depend on whether a healthcare provider met the expected standard of care [26], evaluated through expert testimony that considers real-world factors such as available technology, institutional capacity, and prevailing clinical guidelines [27]. Unlike codified laws, these factors are highly context-dependent, reducing the feasibility of structured approaches. Additionally, legal counseling is inherently more subjective than medical counseling, where demonstrable patient harm provides a clearer criterion [28].
In this context, a limitation of the present study is that data were restricted to Korean-language cases within South Korea’s specific healthcare system, where universal insurance imposes financial accountability on providers for denied claims. Consequently, disputes over medical expenses are rare, except in non-reimbursed sectors like aesthetic medicine. This contrasts sharply with healthcare systems such as that of the United States, where billing disputes are common. Thus, caution is warranted when generalizing these findings to other healthcare settings and languages.
Another substantial limitation is the absence of laypersons among the evaluators, preventing assessment of how patients might perceive AI-generated responses. Patients often have strong emotional stakes and might perceive AI counseling as impersonal or incapable of fully understanding their individual grievances [29]. This contrasts with the outcomes of our study, where the chatbot received higher scores in several key evaluation dimensions. Introducing LLMs in public-facing counseling contexts will require increased transparency and sensitivity toward patient perspectives to enhance acceptance in medical dispute resolution.
This study demonstrated that LLMs have considerable potential for medical dispute counseling, surpassing human experts across all three quality-focused dimensions. Nevertheless, our analysis also revealed a nuanced trade-off: permitting greater subjectivity produced detailed, interest-centered responses but increased the risk of misleading assumptions, while overly constraining model output led to generic or excessively neutral responses [23].
A collaborative approach combining human experts with AI could offer a balanced solution, employing AI as an auxiliary tool while maintaining human oversight for critical decisions. For example, LLMs could aid in reviewing extensive electronic health records (EHRs) [30], summarizing crucial information to enhance review efficiency and quality. They could also assist patients in comprehending complex medical or legal terminology within documents such as review statements and settlement recommendations, thus improving communication and transparency during dispute resolution processes.
Future developments should build upon our findings to align LLM applications more effectively with real-world contexts. By further integrating these technologies into medical dispute resolution, these models can serve as valuable assistants, ultimately supporting patients and their families as they navigate the challenging process of medical disputes.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.
Supplementary Materials
Supplementary materials can be found via https://doi.org/10.4258/hir.2025.31.2.200.