1. Wang C, Li M, He J, Wang Z, Darzi E, Chen Z, et al. A survey for large language models in biomedicine [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2409.00133
3. Bi Z, Dip SA, Hajialigol D, Kommu S, Liu H, Lu M, et al. AI for biomedicine in the era of large language models [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2403.15673
4. Seo J, Lim J, Jang D, Shin H. DAHL: domain-specific automated hallucination evaluation of long-form text through a benchmark dataset in biomedicine [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2411.09255
5. Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: your language model is secretly a reward model. Adv Neural Inf Process Syst 2023;36:53728-41.
6. Hong J, Lee N, Thorne J. ORPO: monolithic preference optimization without reference model. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; 2024 Nov 12–16. Miami, FL, USA; p. 11170-89.
https://doi.org/10.18653/v1/2024.emnlpmain.626
7. Kweon S, Choi B, Chu G, Song J, Hyeon D, Gan S, et al. KorMedMCQA: multi-choice question answering benchmark for Korean healthcare professional licensing examinations [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2403.01469
10. The Mosaic Research Team. Introducing MPT-7B: a new standard for open-source, commercially usable LLMs [Internet]. San Francisco (CA): Databricks; 2023 [cited at 2024 Apr 27]. Available from:
https://www.databricks.com/blog/mpt-7b
12. Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The Llama 3 herd of models [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2407.21783
13. Shen W, Zheng R, Zhan W, Zhao J, Dou S, Gui T, et al. Loose lips sink ships: mitigating length bias in reinforcement learning from human feedback. In: Bouamor H, Pino J, Bali K, editors. Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg (PA): Association for Computational Linguistics; 2023. p. 2589-73.
https://doi.org/10.18653/v1/2023.findings-emnlp.188
14. Pal A, Umapathi LK, Sankarasubbu M. Med-HALT: medical domain hallucination test for large language models. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL); 2023 Dec 6–7. Singapore; p. 314-34.
https://doi.org/10.18653/v1/2023.conll-1.21
15. Riviere M, Pathak S, Sessa PG, Hardin C, Bhupatiraju S, Hussenot L, et al. Gemma 2: improving open language models at a practical size [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2408.00118
16. Shen JH, Sharma A, Qin J. Towards data-centric rlhf: Simple metrics for preference dataset comparison [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2409.09603
17. Dao T. Flashattention-2: faster attention with better parallelism and work partitioning [Internet]. Ithaca (NY): arXiv.org; 2023 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2307.08691
18. Sutawika L, Schoelkopf H, Gao L, Abbasi B, Biderman S, Tow J, et al. EleutherAI/lm-evaluation-harness: v0.4.3 (v0.4.3) [Internet]. Geneva, Switzerland: Zenodo; 2024 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.5281/zenodo.12608602
19. Pal A, Umapathi LK, Sankarasubbu M. 2022;April;MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. Proceedings of Conference on Health, Inference, and Learning (CHIL); 2022 Apr 7–8. Virtual Event; p. 248-60.
20. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7. Hong Kong, China; p. 2567-77.
https://doi.org/10.18653/v1/D19-1259
21. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzman F, et al. Unsupervised cross-lingual representation learning at scale [Internet]. Ithaca (NY): arXiv.org; 2020 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.1911.02116
22. Zhu W, Lv Y, Dong Q, Yuan F, Xu J, Huang S, et al. Extrapolating large language models to non-English by aligning languages [Internet]. Ithaca (NY): arXiv.org; 2023 [cited at 2025 Apr 13]. Available from:
https://doi.org/10.48550/arXiv.2308.04948