1. Fink MA, Bischoff A, Fink CA, Moll M, Kroschke J, Dulz L, et al. Potential of ChatGPT and GPT-4 for data mining of Free-Text CT reports on lung cancer. Radiology 2023;308(3):e231362.
https://doi.org/10.1148/radiol.231362
2. Gu K, Lee JH, Shin J, Hwang JA, Min JH, Jeong WK, et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int 2024;44(7):1578-87.
https://doi.org/10.1111/liv.15891
8. Lau W, Payne TH, Uzuner O, Yetisgen M. Extraction and analysis of clinically important follow-up recommendations in a large radiology dataset. AMIA Jt Summits Transl Sci Proc 2020;2020:335-44.
9. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc AAAI Conf Artif Intell 2019;33(1):590-7.
https://doi.org/10.1609/aaai.v33i01.3301590
10. Adams LC, Truhn D, Busch F, Kader A, Niehues SM, Makowski MR, et al. Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 2023;307(4):e230725.
https://doi.org/10.1148/radiol.230725
13. Kim S, Kim D, Shin HJ, Lee SH, Kang Y, Jeong S, et al. Large-scale validation of the feasibility of GPT-4 as a proofreading tool for head CT reports. Radiology 2025;314(1):e240701.
https://doi.org/10.1148/radiol.240701
14. Nguyen D, Swanson D, Newbury A, Kim YH. Evaluation of ChatGPT and Google Bard using prompt engineering in cancer screening algorithms. Acad Radiol 2024;31(5):1799-804.
https://doi.org/10.1016/j.acra.2023.11.002
16. Savage CH, Park H, Kwak K, Smith AD, Rothenberg SA, Parekh VS, et al. General-purpose large language models versus a domain-specific natural language processing tool for label extraction from chest radiograph reports. AJR Am J Roentgenol 2024;222(4):e2330573.
https://doi.org/10.2214/AJR.23.30573
17. Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, et al. A survey on in-context learning [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Jul 1]. Available from:
https://arxiv.org/abs/2301.00234
18. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022;35:24824-37.
19. Liu J, Shen D, Zhang Y, Dolan B, Carin L, Chen W. What makes good in-context examples for GPT-3? [Internet]. Ithaca (NY): arXiv.org; 2021 [cited at 2025 Jul 1]. Available from:
https://arxiv.org/abs/2101.06804
22. Larson PA, Berland LL, Griffith B, Kahn CE Jr, Liebscher LA. Actionable findings and the role of IT support: report of the ACR Actionable Reporting Work Group. J Am Coll Radiol 2014;11(6):552-8.
https://doi.org/10.1016/j.jacr.2013.12.016
23. Stureborg R, Alikaniotis D, Suhara Y. Large language models are inconsistent and biased evaluators [Internet]. Ithaca (NY): arXiv.org; 2024 [cited at 2025 Jul 1]. Available from:
https://arxiv.org/abs/2405.01724
24. Krishna S, Bhambra N, Bleakney R, Bhayana R. Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board-style examination. Radiology 2024;311(2):e232715.
https://doi.org/10.1148/radiol.232715
29. Lopez-Ubeda P, Martin-Noguerol T, Luna A. Automatic classification and prioritisation of actionable BI-RADS categories using natural language processing models. Clin Radiol 2024;79(1):e1-e7.
https://doi.org/10.1016/j.crad.2023.09.009
30. Wei J, Wei J, Tay Y, Tran D, Webson A, Lu Y, et al. Larger language models do in-context learning differently [Internet]. Ithaca (NY): arXiv.org; 2023 [cited at 2025 Jul 1]. Available from:
https://arxiv.org/abs/2303.03846