Healthc Inform Res Search

CLOSE


Healthc Inform Res > Volume 31(2); 2025 > Article
Seo, Park, Byun, Choi, Choi, and Shin: Advancing Korean Medical Large Language Models: Automated Pipeline for Korean Medical Preference Dataset Construction

Abstract

Objectives

Developing large language models (LLMs) in biomedicine requires access to high-quality training and alignment tuning datasets. However, publicly available Korean medical preference datasets are scarce, hindering the advancement of Korean medical LLMs. This study constructs and evaluates the efficacy of the Korean Medical Preference Dataset (KoMeP), an alignment tuning dataset constructed with an automated pipeline, minimizing the high costs of human annotation.

Methods

KoMeP was generated using the DAHL score, an automated hallucination evaluation metric. Five LLMs (Dolly-v2-3B, MPT-7B, GPT-4o, Qwen-2-7B, Llama-3-8B) produced responses to 8,573 biomedical examination questions, from which 5,551 preference pairs were extracted. Each pair consisted of a “chosen” response and a “rejected” response, as determined by their DAHL scores. The dataset was evaluated when trained through two different alignment tuning methods, direct preference optimization (DPO) and odds ratio preference optimization (ORPO) respectively across five different models. The KorMedMCQA benchmark was employed to assess the effectiveness of alignment tuning.

Results

Models trained with DPO consistently improved KorMedMCQA performance; notably, Llama-3.1-8B showed a 43.96% increase. In contrast, ORPO training produced inconsistent results. Additionally, English-to-Korean transfer learning proved effective, particularly for English-centric models like Gemma-2, whereas Korean-to-English transfer learning achieved limited success. Instruction tuning with KoMeP yielded mixed outcomes, which suggests challenges in dataset formatting.

Conclusions

KoMeP is the first publicly available Korean medical preference dataset and significantly improves alignment tuning performance in LLMs. The DPO method outperforms ORPO in alignment tuning. Future work should focus on expanding KoMeP, developing a Korean-native dataset, and refining alignment tuning methods to produce safer and more reliable Korean medical LLMs.

I. Introduction

Large language models (LLMs) have become prominent in various domains, including biomedicine [1]. However, because the risks associated with inaccurate or untruthful outputs are high, LLMs in the biomedical domain must adhere to rigorous reliability standards [2]. Errors generated by these models can have severe consequences, especially in clinical applications such as treatment planning and clinical report generation [3]. This concern emphasizes the need for high-quality, carefully curated medical training data.
High-quality pre-training and alignment-tuning data are indispensable for LLMs in medicine to generate truthful and trustworthy responses. Nevertheless, obtaining such data is challenging, particularly for languages with limited resources like Korean. Alignment tuning entails fine-tuning LLMs so that their outputs align with defined goals, values, and contextual requirements, with an emphasis on factual accuracy, ethical compliance, and domain relevance. This is accomplished using preference datasets that capture human judgments regarding responses to given questions. In the medical domain, such tuning is critical because inaccuracies or biases can affect patient safety, influence clinical decisions, and undermine public trust. Publicly available Korean medical data is notably scarce, and no open-source Korean medical preference datasets have been specifically designed for alignment tuning. Constructing such datasets is prohibitively expensive, as it typically requires numerous domain-expert annotators. This resource gap significantly hampers the development of accurate and reliable Korean medical models.
To address this gap, we propose a novel automated pipeline for constructing high-quality Korean medical preference data that supports the development of aligned Korean medical LLMs. This pipeline leverages the DAHL (Domain-specific Automated Hallucination Evaluation of Long-Form Text) score [4] to label each response, thereby minimizing the need for costly human annotation. We have also publicly released our pipeline along with a dataset of 5,551 preference data pairs, which are referred to as Korean Medical Preference (KoMeP) data. To evaluate the dataset’s effectiveness, we trained five models across three different model families—Gemma-2 (2B & 9B), Qwen-2 (1.5B & 7B), and Llama-3.1 (8B)—using direct preference optimization (DPO) [5] and odds ratio preference optimization (ORPO) [6]. The models were assessed before and after alignment tuning using the Korean Medical Multiple-Choice Question Answering (KorMedMCQA) dataset [7], which is currently the only benchmark for evaluating Korean medical question-answering abilities in LLMs. Our results indicate that alignment tuning with KoMeP generally improves model performance, with DPO training producing more substantial improvements than ORPO training. Additional ablation experiments focused on transfer learning (from Korean to English and vice versa) and instruction tuning further informed our analysis.
The key contributions of this research are (1) developing a pipeline for automatically constructing a Korean medical preference dataset and (2) publicly releasing both the pipeline and the dataset generated through it.

II. Methods

1. KoMeP

1) DAHL score

Constructing preference data for alignment tuning has traditionally required human annotation to label and rank responses; however, this approach is both time-consuming and costly. To address this, we utilize the DAHL score [4], an automated hallucination evaluation metric developed specifically for long-form biomedical text generation, enabling more affordable and efficient annotation. DAHL provides both benchmark questions and a comprehensive pipeline to assess the factual accuracy of model responses. The evaluation relies on two key components: the splitter model and the checker model. The benchmark consists of 8,573 biomedical questions sourced from research papers in PubMed Central (https://pmc.ncbi.nlm.nih.gov/), spanning 29 categories. For each question, the hallucination evaluation pipeline follows three main steps:
  • Splitting: The response is segmented into atomic units—each containing a single piece of information—using the splitter model, with GPT-4.5 as the default.

  • Factuality checking: The checker model verifies each atomic unit for factual accuracy using the Perplexity API6 with Llama-3-8b-Instruct [8] as the base model. The Perplexity API is favored because it employs a retrieval-augmented generation system that enhances verification reliability compared to a standalone language model.

  • Scoring: The DAHL score is computed as the proportion of factual atomic units relative to the total units in the response.

The accuracy score for an individual response is its DAHL score, while the average score across all 8,573 responses represents the target model’s DAHL score. In this study, we substitute human-annotated labels with the DAHL score to provide a cost-effective and scalable alternative for constructing preference data.

2) Data construction

Using 8,573 biomedical examination questions from DAHL, we generated responses for each question with five different models: Dolly-v2-3B [9], MPT-7B [10], GPT-4o, Qwen-2-7B [11], and Llama-3-8B [12]. We incorporated a variety of models, not solely state-of-the-art ones, to obtain a wider variation of responses. This process produced five distinct responses per question, and the DAHL score was computed for each response. We retained only the responses with the highest and lowest DAHL scores to construct the preference dataset. The response with the highest score was designated as the “chosen” response, while the one with the lowest score was labeled as the “rejected” response. The original question then served as the “prompt.” This construction pipeline is illustrated in Figure 1. To refine the dataset further, we applied two filtering criteria. First, entries were excluded if all five responses exhibited identical DAHL Scores, as this would indicate no discernible difference in factual quality between the “chosen” and “rejected” responses. Second, entries were removed if the “chosen” response repeated the question. Following these filters, the resulting preference dataset comprised 5,551 entries with three columns: “prompt,” “chosen,” and “rejected.” Because the DAHL dataset and evaluation pipeline are tailored for English, the initial preference dataset was in English. We therefore translated all three columns into Korean using GPT-4o, with the translations subsequently verified by human annotators. This process yielded a Korean medical preference dataset (KoMeP) containing 5,551 entries, ready for use in alignment tuning.

3) Data analysis

The constructed KoMeP dataset contains 5,551 rows, each comprising “prompt,” “chosen,” and “rejected” columns. As shown in Table 1, “chosen” responses are slightly longer than “rejected” responses—their length is approximately 1.2 times greater when measured by character or sentence count. This pattern is consistent with other preference datasets, as reward models in RLHF (reinforcement learning from human feedback) tend to favor longer responses over shorter ones [13].
Additionally, the dataset incorporates 29 medical categories adopted from [14], and Figure 2 visualizes their distribution. Public health, oncology, pharmacology, genetics, microbiology, cardiology, immunology, neurology, and radiology dominate most categories, while each remaining category accounts for less than 3%.
Figure 3 presents an example from KoMeP, showing the Korean text followed by its English translation. In this example, the “chosen” response provides a truthful answer to the “prompt” question, earning a DAHL score of 1, whereas the “rejected” response offers a clinical example related to the question instead of a direct answer. Because the “rejected” response addresses a specific case, its components cannot be judged as entirely accurate, resulting in a DAHL score of 0. Thus, the response that accurately and truthfully addressed the question—and consequently earned a higher DAHL score—was selected as “chosen.”
In most cases, however, the DAHL scores for both “chosen” and “rejected” responses typically fall between 0 and 1 with smaller score differences. In some instances, “rejected” responses received a score of 0 simply by repeating the question instead of answering it, and it was rare for a response to score 0 merely for providing a well-formed, albeit hallucinated, answer.

2. Alignment Tuning

Alignment tuning typically follows instruction tuning via supervised learning; while instruction tuning trains models to follow user-provided instructions, alignment tuning ensures compliance with human values, ethics, and expectations. A crucial goal of alignment tuning is to improve model truthfulness and mitigate hallucinations—particularly important in high-stakes domains such as biomedicine. Thus, developing high-quality preference datasets for alignment tuning in the medical domain is paramount. Our dataset specifically focuses on enhancing the factual accuracy of responses, as measured by the DAHL Score proposed by [4]. To validate the efficacy of KoMeP, we conducted alignment tuning on various models and evaluated their performance using the KorMedMCQA benchmark.
For this experiment, we selected five models from three distinct model families: Gemma-2 [15], Qwen-2 [11], and Llama-3.1 [12]. Due to resource constraints, the experiments were performed using relatively small LLMs, each having no more than 9 billion parameters.
We applied two alignment tuning methodologies: DPO [5] and ORPO [6]. DPO directly optimizes for the highest preference score to align the model’s predictions with the target preferences, while ORPO optimizes the odds ratio across preference categories by penalizing less-favored outputs to enforce stronger distinctions. We did not use PPO-based reinforcement learning because our dataset contains only 5,551 preference pairs with two labels (chosen and rejected), whereas effective reward model training typically requires at least 10K to 700K examples [16]. Since alignment tuning is performed post-instruction tuning to adapt models for specific tasks, we used instruction-tuned models as our baseline. For attention mechanisms, we used Eager for Gemma models and Flash Attention 2 [17] for the others. Each model was trained for up to three epochs, and we report the results of the best-performing configuration. The KorMed-MCQA dataset [7], a Korean medical quality assurance (QA) benchmark covering questions from medical licensing exams for doctors, nurses, and pharmacists, was used to evaluate model performance via lm-evaluation-harness [18].

3. Ablation Study

1) Transfer learning

To investigate the feasibility of transfer learning from Korean to English using DPO training with KoMeP, we evaluated the trained models on two widely used English medical benchmark datasets: MedMCQA [19] and PubMedQA [20]. Model performance was compared before and after DPO training.
Since KoMeP was originally translated from English, we also possess an English version of the medical preference data. After observing that transfer learning from Korean to English was not very effective, we additionally explored transfer learning from English to Korean. Using the English version of our dataset, we applied DPO training to the five models and compared their performance on KorMedMCQA pre- and post-training.

2) Instruction tuning

To further assess KoMeP’s utility, we reformatted it into an instruction-tuning dataset. Specifically, the prompt and chosen columns were restructured into a QA task format, adapting the chatting template for each model, as illustrated in Figure 4. For this experiment, we used relatively smaller models (Gemma-2-2B, Qwen-2-1.5B, and Qwen-2-7B). Each model underwent instruction tuning with the restructured dataset for up to three epochs, maintaining the experimental settings described earlier. Their performance on KorMedMCQA was evaluated before and after fine-tuning.

III. Results

1. Alignment Tuning Performance

The effectiveness of KoMeP in alignment tuning was evaluated using the KorMedMCQA benchmark, which measures Korean medical multiple-choice question-answering performance. Five different models—Gemma-2 (2B & 9B), Qwen-2 (1.5B & 7B), and Llama-3.1 (8B)—were trained using the DPO and ORPO algorithms.
The results, as shown in Figure 5, demonstrate improvements in most cases. Specifically, models trained on KoMeP—including Gemma-2-2B (ORPO), Gemma-2-9B (DPO), Qwen-2-1.5B (DPO and ORPO), Qwen-2-7B (DPO), and Llama-3.1-8B (DPO and ORPO)—generally improved their performance on KorMedMCQA. However, Gemma-2-2B (DPO), Gemma-2-9B (ORPO), and Qwen-2-7B (ORPO) showed decreased performance.
Gemma-2 and Qwen-2 underwent post-training using SFT (supervised fine-tuning) plus RLHF, whereas Llama-3.1 was trained with SFT+DPO. Notably, Llama-3.1, when trained on KoMeP using DPO, exhibited the most significant improvement at 43.96%. Although the ORPO algorithm is a novel and efficient approach, it appears to be empirically less robust than DPO. Consequently, when models already fine-tuned with RLHF or DPO were further trained with ORPO, the enhancements in KorMedMCQA performance were less pronounced. In fact, two out of five models (Gemma-2-9B and Qwen-2-7B) experienced a decline in performance. Among the Gemma-2 models, performance was inconsistent after training; specifically, Gemma-2-2B decreased from a baseline of 0.0565 to 0.0458 after DPO tuning, while Gemma-2-9B improved from 0.0525 to 0.0848 with DPO but dropped to 0.0215 with ORPO. In contrast, the Qwen-2 and Llama-3.1 models consistently showed performance improvements following alignment tuning.

2. Transfer Learning Results

1) Korean to English

As shown in Figure 6, the results provide no consistent evidence of effective transfer learning from Korean to English in the biomedical domain. For MedMCQA, two models (Gemma-2-9B and Qwen-2-7B) improved, while Gemma-2-2B maintained its performance and Llama-3.1-8B showed reduced performance. On PubMedQA, only Llama-3.1-8B demonstrated improved performance, whereas all other models experienced a decline.

2) English to Korean

As shown in Figure 7, transfer learning from English to Korean was more effective for most models compared to the Korean-to-English scenario. Performance improvements for Qwen-2 were similar regardless of whether the model was trained on the English or Korean medical preference dataset. However, Gemma-2 demonstrated a significantly greater performance increase when trained on the English dataset.

3. Instruction Tuning Performance

The results presented in Figure 8 show modest improvements for Qwen-2-1.5B and Qwen-2-7B, but a performance decline in Gemma-2-2B. These findings suggest that, in its current form, our dataset is less effective for instruction tuning.

IV. Discussion

1. Key Findings

Our findings confirm that KoMeP is an effective dataset for alignment tuning in Korean medical language models.

1) DPO vs. ORPO

DPO training led to consistent performance improvements across most models, while ORPO training produced inconsistent results—with some models even exhibiting performance degradation. This outcome suggests that ORPO may not be well-suited for models already fine-tuned with RLHF, rather than pointing to an issue with our dataset. The overall pattern is further supported by the consistent performance gains observed with DPO.

2) Multilingual models vs. monolingual models

Models pretrained on multilingual data (Qwen-2 and Llama-3.1) performed better with alignment tuning compared to the primarily English-based Gemma-2 models. This divergence can be attributed to the lower Korean proficiency of Gemma-2, likely resulting from a smaller proportion of Korean data in its pre-training corpus. Gemma-2 models, which are officially not multilingual [15], showed lower baseline scores (0.0565 for Gemma-2-2B and 0.0525 for Gemma-2-9B) compared to the multilingual Qwen-2 and Llama-3.1 models [11,12]. This disparity highlights the importance of pre-training data; models exposed to more Korean data during pre-training exhibit better alignment tuning results [21] and underscore the need for comprehensive target-language data to maximize alignment tuning effectiveness [22].

3) Transfer learning

When tested on MedMCQA and PubMedQA, models trained with KoMeP exhibited only limited performance improvements. While some models (e.g., Gemma-2-9B and Qwen-2-7B) showed modest gains on MedMCQA, performance on PubMedQA generally declined. This discrepancy may be due to differences in the pre-training data composition. As noted, Gemma-2 is not an officially multilingual model, whereas Qwen-2 is. Therefore, alignment tuning for Gemma-2—primarily an English-oriented model—had a more pronounced impact on English data.
When the English version of KoMeP was used for training, performance on KorMedMCQA improved significantly across most models, especially for Gemma-2, which benefited more from English data compared to Korean. In contrast, Qwen-2, being inherently multilingual, showed little performance difference between training on English versus Korean data. Unexpectedly, Llama-3.1 experienced a substantial performance decline when trained with English preference data. One possible explanation is that, as the most recent model among those tested, Llama-3.1 may have been pretrained on cleaner and more comprehensive English data than was provided by our English preference dataset.

4) Instruction tuning

KoMeP was reformatted into an instruction-tuning dataset by structuring the prompts and responses in a QA format. Three models (Gemma-2-2B, Qwen-2-1.5B, Qwen-2-7B) were trained using this approach. While the Qwen-2 models showed slight improvements in KorMedMCQA performance, Gemma-2-2B experienced a minor decline, possibly due to its limited Korean language proficiency. This outcome may be attributed to both the dataset’s format and its size. Although the data was converted into a QA format suitable for instruction tuning, it remains tailored for long-form QA generation tasks, whereas KorMedMCQA is a multi-class classification benchmark that requires selecting from predefined options.
Furthermore, with only 5,551 entries focused on a single task, the dataset’s impact may be limited when used in isolation. Incorporating KoMeP into a larger, more diverse dataset covering multiple tasks could further enhance its utility—a prospect that merits additional investigation.

2. Limitations and Future Work

Although KoMeP currently includes only two labels—chosen and rejected—the construction pipeline can potentially be extended to incorporate more granular scoring by directly using the DAHL score as the preference label. However, further experimental validation is required to determine the effectiveness of such an approach. Additionally, because the DAHL system has not been verified for Korean, we relied on English-to-Korean translated data for this study. Developing an original Korean dataset will be crucial in future work. Lastly, while this study utilized 8,573 questions from the DAHL dataset, the construction pipeline is adaptable and can be applied to other question sets. Expanding KoMeP by incorporating additional questions will be an important step toward improving the dataset’s coverage and overall utility.

Notes

Conflict of Interest

Jinwook Choi is an editor of Healthcare Informatics Research; however, he was not involved in this article’s peer reviewer selection, evaluation, and decision process. Otherwise, no potential conflict of interest relevant to this article was reported.

Acknowledgments

This work was supported by the Research Grant from Seoul National University (Grant No. 100-20220084).

Figure 1
Data construction pipeline of the Korean Medical Preference (KoMeP). Responses for each question in the DAHL dataset were generated using five different large language models. Each response was evaluated using the DAHL score as the preference label. The response with the highest score was labeled “chosen,” while the one with the lowest score was labeled “rejected.” This process was repeated for all 8,573 questions in the DAHL dataset. After a filtering process, 5,551 entries remained. DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text.
hir-2025-31-2-166f1.jpg
Figure 2
Categorical distribution of the Korean Medical Preference (KoMeP).
hir-2025-31-2-166f2.jpg
Figure 3
Example and translation of “prompt,” “chosen,” and “rejected.”
hir-2025-31-2-166f3.jpg
Figure 4
Transforming KoMeP into an instruction-tuning dataset. The “prompt,” “chosen,” and “rejected” columns were repurposed. The “prompt” contained the user’s question, while the “chosen” column provided the expected answer. The data were then formatted according to the specific template of each tested model. The example format shown here follows the Llama-3 Instruct template. KoMeP: Korean Medical Preference.
hir-2025-31-2-166f4.jpg
Figure 5
Performance on KorMedMCQA on models before and after alignment tuning with KoMeP. Scores highlighted in blue indicate improved performance, red indicates decreased performance, and green indicates unchanged performance. Most models demonstrated enhanced performance on KorMedMCQA when DPO is trained. KorMedMCQA: Korean Medical Multiple-Choice Question Answering, KoMeP: Korean Medical Preference, DPO: direct preference optimization, ORPO: odds ratio preference optimization.
hir-2025-31-2-166f5.jpg
Figure 6
Transfer learning from Korean to English. DPO training with KoMeP occasionally improves performance on MedMCQA, an English medical benchmark dataset, but generally results in performance drops on PubMedQA, another English medical benchmark. This suggests that transfer learning from Korean to English during alignment tuning is not consistently effective. KorMedMCQA: Korean Medical Multiple-Choice Question Answering, KoMeP: Korean Medical Preference, DPO: direct preference optimization.
hir-2025-31-2-166f6.jpg
Figure 7
Transfer learning from English to Korean. DPO training with our English medical preference data improves performance on KorMedMCQA, a Korean medical benchmark dataset, for all models except Llama-3.1-8b. This highlights the effectiveness of transfer learning from English to Korean, in contrast to the limited success observed with transfer learning from Korean to English. Notably, for Gemma-2, DPO training with the English dataset resulted in even greater improvements on KorMedMCQA than training with the Korean dataset (KoMeP). KorMedMCQA: Korean Medical Multiple- Choice Question Answering, KoMeP: Korean Medical Preference, DPO: direct preference optimization, ORPO: odds ratio preference optimization.
hir-2025-31-2-166f7.jpg
Figure 8
Model performance on KorMedMCQA, when instruction-tuned with KoMeP, adapted in the instruction format. Gemma-2-2b shows little drop in performance, and Qwen-2-1.5b and 7b show slight performance improvement. KorMedMCQA: Korean Medical Multiple-Choice Question Answering, KoMeP: Korean Medical Preference.
hir-2025-31-2-166f8.jpg
Table 1
Length of responses
Chosen Rejected
Average number of characters 709 560
Average number of sentences 9.51 7.79

Chosen responses tended to be longer than rejected responses.

References

1. Wang C, Li M, He J, Wang Z, Darzi E, Chen Z, et al. A survey for large language models in biomedicine [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2409.00133

2. Lu Z, Peng Y, Cohen T, Ghassemi M, Weng C, Tian S. Large language models in biomedicine and health: current research landscape and future directions. J Am Med Inform Assoc 2024;31(9):1801-11. https://doi.org/10.1093/jamia/ocae202
crossref pmid pmc
3. Bi Z, Dip SA, Hajialigol D, Kommu S, Liu H, Lu M, et al. AI for biomedicine in the era of large language models [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2403.15673

4. Seo J, Lim J, Jang D, Shin H. DAHL: domain-specific automated hallucination evaluation of long-form text through a benchmark dataset in biomedicine [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2411.09255

5. Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: your language model is secretly a reward model. Adv Neural Inf Process Syst 2023;36:53728-41.

6. Hong J, Lee N, Thorne J. ORPO: monolithic preference optimization without reference model. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; 2024 Nov 12–16. Miami, FL, USA; p. 11170-89. https://doi.org/10.18653/v1/2024.emnlpmain.626
crossref
7. Kweon S, Choi B, Chu G, Song J, Hyeon D, Gan S, et al. KorMedMCQA: multi-choice question answering benchmark for Korean healthcare professional licensing examinations [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2403.01469

8. Meta AI. Introducing Llama 3,1: our most capable models to data [Internet]. Menlo Park (CA): Meta; 2024 [cited at 2024 Apr 27]. Available from: https://ai.meta.com/blog/meta-llama-3-1/

9. Conover M, Hayes M, Mathur A, Xie J, Wan J, Shah S, et al. Free Dolly: introducing the world’s first truly open instruction-tuned LLM [Internet]. San Francisco (CA): Databricks; 2023 [cited at 2024 Apr 27]. Available from: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

10. The Mosaic Research Team. Introducing MPT-7B: a new standard for open-source, commercially usable LLMs [Internet]. San Francisco (CA): Databricks; 2023 [cited at 2024 Apr 27]. Available from: https://www.databricks.com/blog/mpt-7b

11. Yang A, Yang B, Hui B, Zheng B, Yu B, Zhou C, et al. Qwen2 technical report [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2407.10671

12. Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The Llama 3 herd of models [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2407.21783

13. Shen W, Zheng R, Zhan W, Zhao J, Dou S, Gui T, et al. Loose lips sink ships: mitigating length bias in reinforcement learning from human feedback. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg (PA): Association for Computational Linguistics; 2023. p. 2589-73. https://doi.org/10.18653/v1/2023.findings-emnlp.188
crossref
14. Pal A, Umapathi LK, Sankarasubbu M. Med-HALT: medical domain hallucination test for large language models. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL); 2023 Dec 6–7. Singapore; p. 314-34. https://doi.org/10.18653/v1/2023.conll-1.21
crossref
15. Riviere M, Pathak S, Sessa PG, Hardin C, Bhupatiraju S, Hussenot L, et al. Gemma 2: improving open language models at a practical size [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2408.00118

16. Shen JH, Sharma A, Qin J. Towards data-centric rlhf: Simple metrics for preference dataset comparison [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2409.09603

17. Dao T. Flashattention-2: faster attention with better parallelism and work partitioning [Internet]. Ithaca (NY): arXiv.org; 2023 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2307.08691

18. Sutawika L, Schoelkopf H, Gao L, Abbasi B, Biderman S, Tow J, et al. EleutherAI/lm-evaluation-harness: v0.4.3 (v0.4.3) [Internet]. Geneva, Switzerland: Zenodo; 2024 [cited at 2025 Apr 13]. Available from: https://doi.org/10.5281/zenodo.12608602

19. Pal A, Umapathi LK, Sankarasubbu M. 2022;April;MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. Proceedings of Conference on Health, Inference, and Learning (CHIL); 2022 Apr 7–8. Virtual Event; p. 248-60.

20. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7. Hong Kong, China; p. 2567-77. https://doi.org/10.18653/v1/D19-1259
crossref
21. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzman F, et al. Unsupervised cross-lingual representation learning at scale [Internet]. Ithaca (NY): arXiv.org; 2020 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.1911.02116

22. Zhu W, Lv Y, Dong Q, Yuan F, Xu J, Huang S, et al. Extrapolating large language models to non-English by aligning languages [Internet]. Ithaca (NY): arXiv.org; 2023 [cited at 2025 Apr 13]. Available from: https://doi.org/10.48550/arXiv.2308.04948

TOOLS
Share :
Facebook Twitter Linked In Google+ Line it
METRICS Graph View
  • 0 Crossref
  •     Scopus
  • 164 View
  • 13 Download
Related articles in Healthc Inform Res


ABOUT
ARTICLE CATEGORY

Browse all articles >

BROWSE ARTICLES
FOR CONTRIBUTORS
Editorial Office
1618 Kyungheegung Achim Bldg 3, 34, Sajik-ro 8-gil, Jongno-gu, Seoul 03174, Korea
Tel: +82-2-733-7637, +82-2-734-7637    E-mail: hir@kosmi.org                

Copyright © 2025 by Korean Society of Medical Informatics.

Developed in M2community

Close layer
prev next