Healthc Inform Res Search

CLOSE


Healthc Inform Res > Volume 31(2); 2025 > Article
Kim, Lee, and Bae: Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts

Abstract

Objectives

Assessing medical disputes requires both medical and legal expertise, presenting challenges for patients seeking clarity regarding potential malpractice claims. This study aimed to develop and evaluate a chatbot based on a chain-of-thought pipeline using a large language model (LLM) for providing medical dispute counseling and compare its performance with responses from human experts.

Methods

Retrospective counseling cases (n = 279) were collected from the Korea Medical Dispute Mediation and Arbitration Agency’s website, from which 50 cases were randomly selected as a validation dataset. The Claude 3.5 Sonnet model processed each counseling request through a five-step chain-of-thought pipeline. Thirty-eight experts evaluated the chatbot’s responses against the original human expert responses, rating them across four dimensions on a 5-point Likert scale. Statistical analyses were conducted using Wilcoxon signed-rank tests.

Results

The chatbot significantly outperformed human experts in quality of information (p < 0.001), understanding and reasoning (p < 0.001), and overall satisfaction (p < 0.001). It also demonstrated a stronger tendency to produce opinion-driven content (p < 0.001). Despite generally high scores, evaluators noted specific instances where the chatbot encountered difficulties.

Conclusions

A chain-of-thought–based LLM chatbot shows promise for enhancing the quality of medical dispute counseling, outperforming human experts across key evaluation metrics. Future research should address inaccuracies resulting from legal and contextual variability, investigate patient acceptance, and further refine the chatbot’s performance in domain-specific applications.

I. Introduction

Assessing medical disputes is a challenging task requiring both medical and legal expertise to accurately interpret incidents and evaluate disputed claims. Often entangled with deep emotional conflicts due to the physical and psychological damage experienced by patients, improving patient awareness of their dispute situation is essential for effective resolution [1]. As a preliminary step toward mediation, medical dispute counseling plays a vital role in guiding patient decision-making. Empowering patients to fully understand and address their grievances is fundamental to patient-centered care, enabling patients to navigate this interdisciplinary area more effectively.
Current medical dispute counseling practices encounter significant challenges, some of which large language models (LLMs) could address. Providing adequate counseling necessitates expertise in both medicine and law; however, consultants typically specialize in only one field, potentially leading to important oversights. High costs and the shortage of qualified professionals further strain judicial agencies, often relegating counseling to a lower priority. Additionally, counseling quality frequently varies, resulting in misunderstandings, unmet expectations, and dissatisfaction with the resolution process.
Recent studies applying LLMs in medical and legal domains provide a unique opportunity to overcome these challenges. Wu et al. [2] demonstrated a chain-of-thought (CoT) method for identifying errors in clinical notes, an approach that could potentially flag disputed claims in medical disputes. Cui et al. [3] introduced a multi-agent LLM approach for complex legal counseling, while Shi et al. [4] structured legal counseling into systematic, step-by-step processes using CoT. These methodologies may facilitate developing an LLM specifically for medical dispute counseling, complementing human experts by bridging knowledge gaps, easing workload burdens, and enhancing counseling quality and consistency.
The rapid integration of healthcare artificial intelligence (AI) has spurred broader discussions concerning patient well-being. Nevertheless, key concerns persist, including patient safety, privacy, inclusivity, and liability [5]. Efforts to address these concerns have resulted in various regulatory developments focused on responsible innovation [6,7]. However, the direct application of advanced natural language processing capabilities of LLMs for patient empowerment remains relatively unexplored.
Bedi et al. [8] conducted a systematic review of 519 studies examining LLM applications in healthcare, categorizing them according to task. Most studies concentrated on assessing medical knowledge (44.5%) or diagnosis (19.5%), while fewer addressed patient-centric tasks such as informed decision-making (17.7%) or communication (7.5%). These findings indicate that prior research primarily targeted provider perspectives, potentially overlooking the potential of LLMs to empower patients.
Decker et al. [9] notably demonstrated the use of LLMs as a direct approach to enhance patient rights by evaluating ChatGPT’s ability to communicate surgical risks, benefits, and alternatives. Allen et al. [10] highlighted issues with delegating informed consent processes to residents, suggesting LLMs as a potentially more comprehensive alternative. These studies underscore the potential of healthcare LLMs to empower patients while strengthening institutional compliance with legal and ethical obligations.
Although patient bills of rights (PBRs) exist in various forms, legislatively defined PBRs at national or state levels uniquely include the right to file grievances [11]. In South Korea, Article 4(3) of the Medical Service Act mandates healthcare institutions to prominently display rights, including (1) access to treatment, (2) informed consent, (3) confidentiality, and (4) the right to request counseling and mediation in medical disputes. While prior healthcare LLM research has addressed the first three rights, relatively little attention has been given to the fourth, where patient and provider interests may be in conflict.
Thus, this study aimed to explore the potential application of LLMs employing the CoT method specifically tailored for medical dispute counseling, offering insights into both opportunities and limitations within the medical dispute domain.

II. Methods

For this study, retrospective medical dispute counseling cases were collected from the website of the Korea Medical Dispute Mediation and Arbitration Agency (hereinafter referred to as the “Agency”), a public institution under the Ministry of Health and Welfare responsible for providing alternative dispute resolution (ADR) services, including counseling, for medical disputes.
The cases were collected on December 17, 2024, in compliance with South Korea’s Public Data Provision Act and Personal Information Protection Act. The study was exempted from review by the Institutional Review Board of Kangwon National University (Approval No. KWNUIRB-2024-12-013).
Each case comprised a single online exchange conducted in Korean between a patient or family member and a human expert affiliated with the Agency, consisting of a counseling request and a corresponding response.
We manually reviewed the cases to reach a consensus on effective counseling practices, recognizing the discretionary nature inherent in medical dispute counseling. Some experts advocated a patient-oriented approach, while others preferred a more facilitative stance, depending on their professional backgrounds and experiences. The authors’ consensus guided the pipeline’s development, but final evaluations involved a broader expert panel reflecting diverse perspectives.

1. Case Collection, Exclusion, and De-identification

Initially, 444 cases were collected from the Agency’s website. After applying exclusion criteria (Table 1), 279 cases remained eligible. From these, 50 cases were randomly selected as the validation dataset, and the remaining cases formed the development dataset. The development dataset supported pipeline creation, including prompt design, pipeline development, and the generation of in-context learning (ICL) examples [12]. The validation dataset remained strictly separate, reserved exclusively for the final comparative evaluation (Figure 1).
The Agency published the cases online as publicly accessible examples, thoroughly de-identified to exclude any unnecessary personal information. At the time of data collection, only the counseling request and, if disclosed, the patient’s sex and age were retained, ensuring adequate medical context.

2. Chain-of-Thought Pipeline Development

We developed a question-answering framework using the Claude 3.5 Sonnet model, which, at the time of this study, was regarded as a leading baseline for generative tasks—a crucial attribute for employing the CoT approach. Claude was previously reported to outperform contemporary models, such as GPT-4, in general medical knowledge and legal reasoning tasks [13,14].
The CoT methodology was implemented to decompose the counseling task into intermediate subtasks [15,16]. Each subtask was executed using individual prompts, with outputs sequentially passed forward through chat templates [17]. ICL examples were incorporated into each subtask [18]. Development was conducted using the Anthropic API with default parameters for the Claude 3.5 Sonnet model and the LangChain API. All coding was performed in Python (version 3.11.10).
Counseling requests typically provide limited evidence, lacking opposing statements and objective records. Thus, a primary concern during pipeline development was minimizing hallucinations, which could result in premature assumptions of malpractice. Although existing methodologies for identifying clinical note errors were employed, any “errors” identified in counseling requests were treated strictly as potential legal issues, not definitive instances of malpractice.
The pipeline processes a patient’s counseling request through five sequential steps:
  • - Step 1 (Clinical context generation): The counseling request was first analyzed to extract clinical keywords using an initial prompt, followed by a second prompt elaborating on these keywords to provide general background information. This step constructed a clinical context without interpreting or prematurely expanding on the patient’s specific claims, which were subsequently addressed in later pipeline steps.

  • - Step 2 (Legal issue identification): The counseling request was analyzed to identify potential malpractice-related legal issues. Building upon prior research, such as that by Wu et al. [2], who classified clinical errors into diagnosis, intervention, and management categories, the present study expanded these categories based on prior literature [19,20] into 10 malpractice categories—diagnosis and examination, medication and prescription, injection, anesthesia, surgery, hospital transfer, infection control, safety management, informed consent, and patient monitoring. Each malpractice category had its own dedicated prompt functioning as a filter to assess applicability to the counseling request. Outputs included a binary flag (1 for applicable, 0 for not applicable) and a concise rationale (Figure 2). Few-shot examples guided the chatbot’s reasoning across diverse cases. Categories flagged as applicable were subsequently passed to the Legal Issue Elaboration step, with their rationales removed.

  • - Step 3 (Legal issue elaboration): Categories identified as applicable (flag = 1) proceeded to this step, with initial rationales removed to prevent premature assumptions of malpractice. A distinct prompt further elaborated each flagged category as a legal issue, striving to minimize assumptions. This two-step process enabled the pipeline to expand on legal issues while maintaining a more neutral perspective.

  • - Step 4 (Preliminary response drafting): Utilizing the clinical context and legal issues identified in previous steps, this step combined information to draft a preliminary response. It organized overlapping information coherently to produce a comprehensive reply addressing the counseling request.

  • - Step 5 (Response refinement): In this final step, the preliminary response underwent refinement through three sub-processes: filtering, which removed irrelevant or inappropriate content; standardization, which aligned vocabulary usage across responses for consistency; and tone randomization, which adjusted the response’s tone by emulating a randomly retrieved human response from the development dataset, used as a reference.

3. Evaluation

Evaluators received a counseling request, a human expert response, and a chatbot-generated response, randomly labeled as “Response A” and “Response B” to ensure blinding. Evaluators remained unaware of their peers’ evaluations. Each response was independently rated across four dimensions using a 5-point Likert scale: quality of information, understanding and reasoning, information-opinion spectrum, and overall satisfaction (Supplement A). Higher scores indicated superior performance for the first three dimensions, though not necessarily for the Information-Opinion Spectrum.
The evaluation dimensions were adapted from existing frameworks for assessing LLMs in the medical domain [21] but modified to better suit the needs of this study. Dimensions such as empathetic expression and human safety were excluded. Empathetic expression was omitted because counseling responses are typically formal, and expressions of empathy might inadvertently reveal the response source. Human safety was deemed irrelevant, as counseling addresses retrospective adverse events rather than prospective clinical decision-making.
The information-opinion spectrum was specifically introduced to evaluate the extent to which a response provides objective information versus subjective opinion. For instance, clarifying the implications of a content-certified letter or providing general medical knowledge exemplifies objective information. However, when subjective interpretations are introduced, the content shifts toward opinion [22]. Examples include assessing the presence or degree of medical negligence or evaluating the relative benefits and risks of legal strategies, as these move beyond mere factual information and into subjective, opinion-based reasoning.
The evaluation team comprised 38 experts, including nurses, physicians, attorneys, and individuals holding advanced law degrees, affiliated with the Agency. These experts were selected due to their extensive experience in navigating both layperson and professional language, their roles as intermediaries between patients and healthcare providers, and their representativeness of the human experts typically involved in medical dispute counseling.
Each of the 50 validation samples was reviewed by three experts, resulting in a total of 150 paired evaluations. Every evaluation panel included at least one medical and one legal expert. After rating each response, evaluators had the option to provide free-text comments and completed a survey to guess which response was AI-generated. The survey utilized five confidence levels, ranging from “I’m certain it’s A” to “I’m certain it’s B.”

4. Statistical Analysis

The Wilcoxon signed-rank test was used to compare chatbot responses against human expert responses, testing the hypothesis that the chatbot would outperform humans. Scores from the three evaluators were averaged for each sample across all evaluation dimensions. Mean scores for chatbot and human responses, along with the gaps between these scores, were calculated. Pearson correlation coefficients were computed to analyze how individual evaluation dimensions contributed to overall satisfaction.
Responses from the survey, in which experts guessed the AI-generated response, were first used to assess the accuracy of expert perceptions. Chatbot responses were then grouped into two categories: correct guesses and incorrect guesses. High-confidence guesses were weighted twice as heavily as low-confidence guesses. The Mann-Whitney U test was used to compare these two groups. All statistical analyses were performed using Python (version 3.11.10), employing the Matplotlib, Pandas, and SciPy libraries.

III. Results

The chatbot significantly outperformed human experts in the quality of information dimension (W = 1090.5, p < 0.001), with a mean chatbot score of 4.23 compared to 3.25 for human experts, resulting in an average score gap of 0.98. On the understanding and reasoning dimension, the chatbot again surpassed human experts (W = 772.0, p < 0.001), achieving a mean score of 4.04 compared to 3.21 for humans, reflecting a score gap of 0.83. The chatbot also scored higher on the information-opinion spectrum dimension (W = 933.5, p < 0.001), with an average score of 3.49, while human experts scored an average of 2.95, creating a mean gap of 0.54. Regarding overall satisfaction, the chatbot received significantly higher ratings than human experts (W = 1035.0, p < 0.001), with an average chatbot score of 3.69 compared to the human average of 3.11, resulting in a mean score gap of 0.58 (Figure 3, Table 2).
Pearson correlation analyses between evaluation dimensions revealed that scores for quality of information, understanding and reasoning, and overall satisfaction demonstrated strong linear correlations with each other (r > 0.6). Conversely, scores for the information-opinion spectrum appeared distinct, exhibiting relatively weaker correlations with the other three dimensions (r < 0.4), marking it as an outlier among evaluation dimensions (Figure 4).
The results from the survey, which asked evaluators to identify the AI-generated response, indicated that evaluators’ perceptions were generally accurate. The Mann-Whitney U test comparing correct versus incorrect identification groups showed a statistically significant, though modest, effect for quality of information (p = 0.049, r = 0.141). However, the differences observed for the remaining three dimensions—understanding and reasoning (p = 0.831, r = 0.016), information-opinion spectrum (p = 0.800, r = 0.019), and overall satisfaction (p = 0.777, r = 0.021)—were statistically insignificant.

IV. Discussion

The CoT approach demonstrated strong performance in medical dispute counseling. The chatbot significantly outperformed human experts in quality of information, understanding and reasoning, and overall satisfaction, despite not utilizing advanced methods such as retrieval-augmented generation (RAG) or model fine-tuning. Evaluators were generally able to identify AI-generated responses accurately, although this recognition did not significantly influence the evaluation outcomes. The small but significant effect size observed for quality of information suggests the chatbot’s informational style may have made its responses identifiable, possibly because evaluators were already familiar with tools such as Claude or ChatGPT.
Although the chatbot was explicitly prompted to minimize opinion-driven content, it nevertheless generated more subjective responses compared to human experts. Strong correlations among quality of information, understanding and reasoning, and overall satisfaction indicate these factors significantly influenced evaluator satisfaction. However, the chatbot’s advantage in these dimensions did not produce an equally substantial difference in overall satisfaction scores; the average score gap for overall satisfaction (0.58) was smaller than that for quality of information (0.98) and understanding and reasoning (0.83).
These findings highlight a limitation of the current evaluation methods: factors beyond the quality of information and understanding and reasoning significantly influenced overall satisfaction. For instance, some chatbot responses, despite high scores in individual dimensions, appeared detached from real-world contexts (sample 2 in Supplement B).
Additionally, some chatbot responses contained factually incorrect legal information, likely due to LLMs like Claude 3.5 and ChatGPT being trained on broad web datasets that do not account for jurisdictional differences or evolving legal frameworks. For example, the chatbot produced misleading content regarding abortion rights in South Korea (sample 1 in Supplement B). Such inaccuracies represent a form of hallucinated legal content [23]. Structured methods, such as RAG, could mitigate these inaccuracies by systematically organizing relevant legal information into vectorized libraries, enabling more precise referencing [24,25].
Furthermore, the chatbot occasionally made problematic assumptions, explicitly presuming malpractice or implicitly accepting contentious points as factual while explaining legal issues. Addressing these concerns through prompt engineering proved challenging, as constraints needed to be clear, broadly applicable, and minimal; the appropriate stance frequently varied from case to case.
Another concern was the chatbot’s inability to adequately account for the variability of healthcare environments across countries and institutions. In legal contexts, malpractice assessments depend on whether a healthcare provider met the expected standard of care [26], evaluated through expert testimony that considers real-world factors such as available technology, institutional capacity, and prevailing clinical guidelines [27]. Unlike codified laws, these factors are highly context-dependent, reducing the feasibility of structured approaches. Additionally, legal counseling is inherently more subjective than medical counseling, where demonstrable patient harm provides a clearer criterion [28].
In this context, a limitation of the present study is that data were restricted to Korean-language cases within South Korea’s specific healthcare system, where universal insurance imposes financial accountability on providers for denied claims. Consequently, disputes over medical expenses are rare, except in non-reimbursed sectors like aesthetic medicine. This contrasts sharply with healthcare systems such as that of the United States, where billing disputes are common. Thus, caution is warranted when generalizing these findings to other healthcare settings and languages.
Another substantial limitation is the absence of laypersons among the evaluators, preventing assessment of how patients might perceive AI-generated responses. Patients often have strong emotional stakes and might perceive AI counseling as impersonal or incapable of fully understanding their individual grievances [29]. This contrasts with the outcomes of our study, where the chatbot received higher scores in several key evaluation dimensions. Introducing LLMs in public-facing counseling contexts will require increased transparency and sensitivity toward patient perspectives to enhance acceptance in medical dispute resolution.
This study demonstrated that LLMs have considerable potential for medical dispute counseling, surpassing human experts across all three quality-focused dimensions. Nevertheless, our analysis also revealed a nuanced trade-off: permitting greater subjectivity produced detailed, interest-centered responses but increased the risk of misleading assumptions, while overly constraining model output led to generic or excessively neutral responses [23].
A collaborative approach combining human experts with AI could offer a balanced solution, employing AI as an auxiliary tool while maintaining human oversight for critical decisions. For example, LLMs could aid in reviewing extensive electronic health records (EHRs) [30], summarizing crucial information to enhance review efficiency and quality. They could also assist patients in comprehending complex medical or legal terminology within documents such as review statements and settlement recommendations, thus improving communication and transparency during dispute resolution processes.
Future developments should build upon our findings to align LLM applications more effectively with real-world contexts. By further integrating these technologies into medical dispute resolution, these models can serve as valuable assistants, ultimately supporting patients and their families as they navigate the challenging process of medical disputes.

Notes

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

Figure 1
Overview of data collection, division, and usage for pipeline development and evaluation.
hir-2025-31-2-200f1.jpg
Figure 2
Chain-of-thought (CoT) prompt used for the diagnosis and examination malpractice category in the Legal Issue Identification step, with examples of in-context learning (ICL) to guide model reasoning.
hir-2025-31-2-200f2.jpg
Figure 3
Score distribution for the four evaluation dimensions, comparing chatbot and human responses.
hir-2025-31-2-200f3.jpg
Figure 4
Heatmap illustrating Pearson correlation coefficients between the four evaluation dimensions. Red indicates a high correlation, while blue indicates a low correlation.
hir-2025-31-2-200f4.jpg
Table 1
Exclusion criteria for the collected samples, with corresponding examples
Exclusion criteria Description
Requests concerning administrative process Inquiring about the conditions under which a mediation application may be dismissed.
Inquiring about the refund policy for mediation fees when mediation is unsuccessful.
General legal inquiries Inquiring about the designation of the liable party for adjudication after the treating physician’s departure from the institution.
Inquiring about the admissibility of the Agency’s appraisal statements as evidence in litigation following mediation failure.
Cases related to alternative medicine Traditional Korean medicine
Acupuncture
Table 2
Wilcoxon signed-rank test outcomes for the four evaluation dimensions
Dimension W static p-value
Quality of information 1090.5 <0.001
Understanding and reasoning 772.0 <0.001
Information-opinion spectrum 933.5 <0.001
Overall satisfaction 1035.0 <0.001

References

1. Amirthalingam K. Medical dispute resolution, patient safety and the doctor-patient relationship. Singapore Med J 2017;58(12):681-4. https://doi.org/10.11622/smedj.2017073
crossref pmid pmc
2. Wu Z, Hasan A, Wu J, Kim Y, Cheung JP, Zhang T, et al. KnowLab_AIMed at MEDIQA-CORR 2024: Chain-of-Though (CoT) prompting strategies for medical error detection and correction. Proceedings of the 6th Clinical Natural Language Processing Workshop (ClinicalNLP@NAACL); 2024 Jun 21. Mexico City, Mexico; p. 353-9. https://doi.org/10.18653/v1/2024.clinicalnlp-1.33
crossref
3. Cui J, Ning M, Li Z, Chen B, Yan Y, Li H, et al. Chatlaw: a multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 20]. Available from: https://arxiv.org/abs/2306.16092

4. Shi J, Guo Q, Liao Y, Liang S. LegalGPT: legal chain of thought for the legal large language model multi-agent framework. In: Huang DS, Si Z, Chen W, editors. Advanced intelligent computing technology and applications. Singapore: Springer; 2024. p. 25-37. https://doi.org/10.1007/978-981-97-5678-0_3
crossref pmid
5. Naik N, Hameed BM, Shetty DK, Swain D, Shah M, Paul R, et al. Legal and ethical consideration in artificial intelligence in healthcare: who takes responsibility? Front Surg 2022;9:862322. https://doi.org/10.3389/fsurg.2022.862322
crossref pmid pmc
6. Food and Drug Administration; Health Canada; Medicines and Healthcare products Regulatory Agency. Good machine learning practice for medical device development: guiding principles [Internet]. London, UK: Medicines and Healthcare products Regulatory Agency; 2021 [cited at 2025 Apr 15]. Available from: https://assets.publishing.service.gov.uk/media/617921168fa8f52978e14c0b/GMLP_Guiding_Principles_FINAL.pdf

7. Shumway DO, Hartman HJ. Medical malpractice liability in large language model artificial intelligence: legal review and policy recommendations. J Osteopath Med 2024;124(7):287-90. https://doi.org/10.1515/jom-2023–0229
crossref pmid
8. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 2025;333(4):319-28. https://doi.org/10.1001/jama.2024.21700
crossref pmid pmc
9. Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open 2023;6(10):e2336997. https://doi.org/10.1001/jamanetworkopen.2023.36997
crossref pmid pmc
10. Allen JW, Earp BD, Koplin J, Wilkinson D. Consent-GPT: is it ethical to delegate procedural consent to conversational AI? J Med Ethics 2024;50(2):77-83. https://doi.org/10.1136/jme-2023-109347
crossref pmid pmc
11. Paasche-Orlow MK, Jacob DM, Hochhauser M, Parker RM. National survey of patients’ bill of rights statutes. J Gen Intern Med 2009;24(4):489-94. https://doi.org/10.1007/s11606-009-0914-z
crossref pmid pmc
12. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877-901.

13. Lin SY, Hsu YY, Ju SW, Yeh PC, Hsu WH, Kao CH. Assessing AI efficacy in medical knowledge tests: A study using Taiwan’s internal medicine exam questions from 2020 to 2023. Digit Health 2024;10:20552076241291404. https://doi.org/10.1177/20552076241291404
crossref pmid pmc
14. Schweitzer S, Conrads M, Naeve J. Claude rules: An evaluation of large language models’ applicability to solve cases in German business law. Procedia Comput Sci 2024;246:2675-83. https://doi.org/10.1016/j.procs.2024.09.406
crossref
15. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022;35:24824-37.

16. Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit Med 2024;7:20. https://doi.org/10.1038/s41746-024-01010-1
crossref pmid pmc
17. Alzghoul R, Ayaabdelhaq A, Tabaza A, Altamimi A. CLD-MEC at MEDIQA-CORR 2024 Task: GPT-4 multistage clinical chain of thought prompting for medical errors detection and correction. Proceedings of the 6th Clinical Natural Language Processing Workshop (ClinicalNLP@NAACL); 2024 Jun 21. Mexico City, Mexico; p. 537-56. https://doi.org/10.18653/v1/2024.clinicalnlp-1.52
crossref
18. Zhao H, Yilahun H, Hamdulla A. Pipeline chain-ofthought: A prompt method for large language model relation extraction. Proceedings of 2023 International Conference on Asian Language Processing (IALP); 2023 Nov 18–20. Singapore; p. 31-6. https://doi.org/10.1109/IALP61005.2023.10337264
crossref
19. Jung S, Hwang H, Hong M. A study on the compensation for medical malpractice liability [Internet]. Seoul, Korea: Korea Insurance Research Institute; 2020 [cited at 2025 Apr 15]. Available from: https://www.kiri.or.kr/report/downloadFile.do?docId=5873

20. Madea B, Vennedey C, Dettmeyer R, Preuss J. Outcome of preliminary proceedings against medical practitioners suspected of malpractice. Dtsch Med Wochenschr 2006;131(38):2073-8. https://doi.org/10.1055/s-2006-951332

21. Tam TY, Sivarajkumar S, Kapoor S, Stolyar AV, Polanska K, McCarthy KR, et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit Med 2024;7(1):258. https://doi.org/10.1038/s41746-024-01258-7
crossref pmid pmc
22. Cheong I, Xia K, Feng KK, Chen QZ, Zhang AX. (A) I am not a lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency; 2024 Jun 3–6. Rio de Janeiro Brazil; p. 2454-69. https://doi.org/10.1145/3630106.3659048
crossref
23. Dahl M, Magesh V, Suzgun M, Ho DE. Large legal fictions: profiling legal hallucinations in large language models. J Leg Anal 2024;16(1):64-93. https://doi.org/10.1093/jla/laae003
crossref
24. Louis A, van Dijck G, Spanakis G. Interpretable long-form legal question answering with retrieval-augmented large language models. Proc AAAI Conf Artif Intell 2024;38(20):22266-75. https://doi.org/10.1609/aaai.v38i20.30232
crossref
25. Wiratunga N, Abeyratne R, Jayawardena L, Martin K, Massie S, Nkisi-Orji I, et al. CBR-RAG: case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In: Recio-Garcia JA, Orozcodel-Castillo MG, Bridge D, , editors. Case-based reasoning research and development. Cham, Switzerland: Springer; 2024. p. 445-60. https://doi.org/10.1007/978-3-031-63646-2_29
crossref pmid
26. Taragin MI, Willett LR, Wilczek AP, Trout R, Carson JL. The influence of standard of care and severity of injury on the resolution of medical malpractice claims. Ann Intern Med 1992;117(9):780-4. https://doi.org/10.7326/0003-4819-117-9-780
crossref pmid
27. Peeples R, Harris CT, Metzloff TB. The process of managing medical malpractice cases: the role of standard of care. Wake Forest Law Rev 2002;37:877-902. https://doi.org/10.2139/ssrn.347760
crossref
28. Bernstein IA, Zhang YV, Govil D, Majid I, Chang RT, Sun Y, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open 2023;6(8):e2330320. https://doi.org/10.1001/jamanetworkopen.2023.30320
crossref pmid pmc
29. Reis M, Reis F, Kunde W. Influence of believed AI involvement on the perception of digital medical advice. Nat Med 2024;30(11):3098-100. https://doi.org/10.1038/s41591-024-03180-7
crossref pmid pmc
30. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med 2022;5(1):194. https://doi.org/10.1038/s41746-022-00742-2
crossref pmid pmc
TOOLS
Share :
Facebook Twitter Linked In Google+ Line it
METRICS Graph View
  • 0 Crossref
  •     Scopus
  • 258 View
  • 21 Download
Related articles in Healthc Inform Res


ABOUT
ARTICLE CATEGORY

Browse all articles >

BROWSE ARTICLES
FOR CONTRIBUTORS
Editorial Office
1618 Kyungheegung Achim Bldg 3, 34, Sajik-ro 8-gil, Jongno-gu, Seoul 03174, Korea
Tel: +82-2-733-7637, +82-2-734-7637    E-mail: hir@kosmi.org                

Copyright © 2025 by Korean Society of Medical Informatics.

Developed in M2community

Close layer
prev next