1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need [Internet]. Ithaca (NY): arXive.org; 2017 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/1706.03762
2. Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, et al. Large language models: a survey [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2402.06196v1
6. Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission [Internet]. Ithaca (NY): arXive.org; 2019 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/1904.05342v1
20. Brodeur PG, Buckley TA, Kanjee Z, Goh E, Ling EB, Jain P, et al. Superhuman performance of a large language model on the reasoning tasks of a physician [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2412.10849
21. McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, et al. Towards accurate differential diagnosis with large language models. Nature. 2025 Apr 9 [Epub].
https://doi.org/10.1038/s41586-025-08869-4
23. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang PC, et al. Towards generalist biomedical AI [Internet]. Ithaca (NY): arXive.org; 2023 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2307.14334
24. Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery 2023;93(5):1090-8.
https://doi.org/10.1227/neu.0000000000002551
25. Saab K, Tu T, Weng WH, Tanno R, Stutz D, Wulczyn E, et al. Capabilities of Gemini models in medicine [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2404.18416
28. Chen H, Fang Z, Singla Y, Dredze M. Benchmarking large language models on answering and explaining challenging medical questions [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2402.18060v1
38. Rawte V, Sheth A, Das A. A survey of hallucination in large foundation models [Internet]. Ithaca (NY): arXive. org; 2023 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2309.05922
39. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation [Internet]. Ithaca (NY): arXive.org; 2022 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2202.03629v1
40. Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions [Internet]. Ithaca (NY): arXive.org; 2023 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2311.05232v1
41. Tonmoy SM, Zaman SM, Jain V, Rani A, Rawte V, Chadha A, et al. A comprehensive survey of hallucination mitigation techniques in large language models [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2401.01313
43. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med 2025;31(1):60-9.
https://doi.org/10.1038/s41591-024-03425-5
45. Behnia R, Ebrahimi M, Pacheco J, Padmanabhan B. Privately fine-tuning large language models with differential privacy [Internet]. Ithaca (NY): arXive.org; 2022 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2210.15042v1
46. Charles Z, Ganesh A, McKenna R, McMahan HB, Mitchell N, Pillutla K, et al. Fine-tuning large language models with user-level differential privacy [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2407.07737
47. Zhao H, Chen H, Yang F, Liu N, Deng H, Cai H, et al. Explainability for large language models: a survey [Internet]. Ithaca (NY): arXive.org; 2023 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2309.01029
48. Goldshmidt R, Horovicz M. TokenSHAP: interpreting large language models with Monte Carlo Shapley value estimation [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2407.10114
49. Kau A, He X, Nambissan A, Astudillo A, Yin H, Aryani A. Combining knowledge graphs and large language models [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2407.06564
50. Li Z, Shi Y, Liu Z, Yang F, Payani A, Liu N, et al. Language ranker: a metric for quantifying LLM performance across high and low-resource languages [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2404.11553
56. Kim Y, Park C, Jeong H, Grau-Vilchez C, Chan YS, Xu X, et al. A demonstration of adaptive collaboration of large language models for medical decision-making [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2411.00248
57. Kim Y, Park C, Jeong H, Chan YS, Xu X, McDuff D, et al. MDAgents: an adaptive collaboration of LLMs for medical decision-making [Internet]. Ithaca (NY): arXive.org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2404.15155
59. Hong EK, Roh B, Park B, Jo JB, Bae W, Soung Park J, et al. Value of using a generative AI model in chest radiography reporting: a reader study. Radiology 2025;314(3):e241646.
https://doi.org/10.1148/radiol.241646
60. Kim MJ, Pertsch K, Karamcheti S, Xiao T, Balakrishna A, Nair S, et al. OpenVLA: an open-source vision-language-action model [Internet]. Ithaca (NY): arXive. org; 2024 [cited at 2025 Apr 15]. Available from:
https://arxiv.org/abs/2406.09246