I. Introduction
Recruiting a sufficient number of participants who meet specified eligibility criteria is essential for every clinical trial. The manual review of clinical data and identification of eligible cases often constitute the most labor-intensive aspect of the trial process [
1].
In recent years, the development of electronic health records (EHR) has partially streamlined this process. However, the reliance on unstructured and ambiguous text in EHRs continues to pose significant challenges. To address this issue, researchers have increasingly adopted natural language processing (NLP) techniques [
2].
The field of NLP is currently undergoing a revolution driven by rapid advances in LLMs. The application of LLMs to the medical domain has attracted considerable interest [
3–
5], with use cases ranging from clinical documentation [
6,
7] and decision support [
8,
9], to knowledge-based information retrieval and generation [
10,
11], medical research [
12–
15], and data processing and analysis [
16,
17]. However, the use of LLMs for clinical trial matching remains in its early stages [
18].
The primary objective of this study is to investigate the effectiveness of prompt-based learning models for cohort selection in clinical trials, utilizing unstructured data from EHRs. We aimed to determine whether prompt-based methods can achieve results comparable to, or better than, those obtained using traditional NLP techniques. Our focus was not limited to benchmarking model performance, but also to assessing whether structured summarization could enhance eligibility classification.
Our focus is on a specific challenge—cohort identification in the 2018 National NLP Clinical Challenges (n2c2)—which provided a dataset of free-text medical records from 288 patients.
IV. Discussion
The application of a prompt-based learning model in this study yielded promising results. To further illustrate the process, we summarized the results for one representative criterion, “ALCOHOL-ABUSE,” in
Supplementary Materials. Supplement B provides the list of SNOMED CT terms identified for “ALCOHOL-ABUSE.”
Supplement C presents, for each test record, the selected sentences containing one of these terms, the ground truth label, and the model’s prediction for comparison.
Our model achieved overall micro and macro F-scores of 0.9061 and 0.8060, respectively, demonstrating acceptable performance across all criteria. To evaluate performance by criterion, we categorized the criteria based on the NLP methods required, as described by Stubbs et al. [
27]:
(1) Concept extraction: Four criteria—ABDOMINAL, MAJOR-DIABETES, CREATININE, and HBA1C—primarily required the extraction of clinical terms. We addressed this by mapping each criterion to relevant SNOMED CT concepts. While standardized ontologies improve consistency, reproducibility, and completeness, challenges remain. Many records in the challenge dataset used rare medical terms, general pathology terms rather than precise category names, or uncommon abbreviations.
(2) Temporal reasoning: Another four criteria—DIET-SUPP-2MOS, MI-6MOS, ADVANCED-CAD, and KETO- 1YR—necessitated temporal processing. Since the data consisted of medical records from different time points, these criteria required time-aware analysis. Relevant sentences and their associated timestamps were extracted for these criteria.
For DIETSUPP-2MOS, a key advantage was leveraging the SNOMED CT ontology to identify supplement names, instead of relying on manually curated dictionaries. The main limitation, however, was incomplete coverage of commercial names and abbreviations.
For MI-6MOS, although our system achieved the second-highest F-score among ML methods, the main limitation was false negatives due to gaps in the SNOMED CT terms used for sentence selection. To address this, we used more general concepts such as “ischemic heart disease” or “disorders of coronary artery” rather than only “myocardial infarction,” but some records still lacked direct references to infarction.
For ADVANCED-CAD, results showed accurate selection of relevant sentences and identification of required time points. However, the exact definitions for required signs and symptoms were not always specified. For example, some antihypertensive drugs may be prescribed to patients with ischemic heart disease absent definitive hypertension, and it is unclear whether such cases meet the criterion.
(3) Inference: Our model outperformed other ML methods for ASP-FOR-MI and ENGLISH, but MAKES-DECISIONS was more challenging. Error analysis revealed that most misclassifications were due to difficulties in concept extraction rather than logical inference. MAKES-DECISIONS encompasses a broad clinical pathology with diverse signs and symptoms, making extraction of all relevant SNOMED CT concepts challenging.
For DRUG-ABUSE and ALCOHOL-ABUSE, our model performed well, with only a few false negatives due to missing rare drug names in the SNOMED CT concept list.
Compared to rule-based systems, ML models offer greater efficiency and adaptability by learning directly from data. Their scalability is evident in their ability to utilize larger, more diverse datasets without the need for rule modifications. Thus, ML techniques are well-suited for handling increasingly complex clinical data [
28].
Moreover, the strengths of GPT models for cohort selection become even more apparent with larger datasets. First, in clinical practice, generating labeled training data is often cumbersome; GPT models are advantageous because they do not require task-specific training. Second, GPT models are ideal when a language model with a broad knowledge base is needed, as they are extensively pre-trained on a variety of text sources. Third, pre-processing is typically one of the most time-consuming and expertise-dependent steps in NLP pipelines. Achieving strong results with GPT models without extensive pre-processing can expedite the clinical matching process and reduce reliance on domain experts [
29,
30].
Additionally, we introduced an extractive summarization method using the SNOMED CT ontology and an annotation tool to capture essential information from source texts. This approach not only addresses the input length limitations of LLMs but also enhances prompt-based model performance.
Supplement D provides an example: it displays an unsummarized free-text record alongside its summarized version, the prompt for one eligibility criterion (major abdominal surgery), the true label, the GPT-3.5 prediction using summarized data, and the GPT-4 prediction without summarization. Notably, GPT-4 misclassified the record, while GPT-3.5 with summarization produced the correct label—highlighting the value of summarization for removing irrelevant content and improving classification accuracy.
While newer models such as GPT-4—with larger context windows and improved reasoning—offer promising alternatives, they come with higher computational costs and infrastructure requirements. Our study demonstrates that an ontology-based summarization pipeline can significantly enhance LLM performance, even for models with limited input capacity.
In conclusion, our automated clinical trial matching solution streamlines what is typically a manual, time-consuming recruitment process by leveraging advanced technologies. A key direction for future work is the integration of structured summarization with chain-of-thought (CoT) prompting to further improve clinical eligibility classification. CoT techniques facilitate the decomposition of complex instructions into intermediate reasoning steps, which may enhance accuracy, particularly for tasks involving inference, temporality, or implicit logic.