I. Introduction
Recent advancements in artificial intelligence (AI) have significantly transformed healthcare by addressing both clinical and administrative challenges. Among these innovations, chatbots have emerged as promising tools for improving healthcare service delivery and patient engagement [
1–
6]. By leveraging natural language processing and machine learning, these systems can assist with tasks such as providing medical guidance, scheduling appointments, and supporting remote monitoring [
7].
However, conventional chatbot architectures often struggle with contextual reasoning, limiting their effectiveness in handling complex medical inquiries. They may generate responses that lack relevance or consistency, undermining user trust and reliability [
8–
10]. Retrieval-augmented generation (RAG) technology addresses these limitations by combining retrieval-based search with generative capabilities, enabling chatbots to access relevant information and generate contextually appropriate responses [
11]. This approach enhances information retrieval efficiency, reduces response times, and improves the overall usability of chatbots in healthcare scenarios, including clinical decision support and patient education [
12–
20].
In this study, we developed and evaluated a RAG-based chatbot system aimed at improving hospital operations by integrating electronic medical record (EMR) system manuals as external knowledge sources. By optimizing both information retrieval and response generation, the system seeks to streamline administrative processes such as billing and insurance procedures, ultimately reducing the workload for healthcare staff and improving service efficiency.
The study is structured as follows: Section II describes the methodology, including data preparation, embedding model fine-tuning, vector database implementation, query processing, and response generation. Section III presents the experimental results and model evaluations. Section IV discusses the implications, potential applications, and limitations, and offers directions for future research.
II. Methods
1. System Architecture for RAG-based EMR Chatbot
The proposed system architecture integrates a RAG-based chatbot with the EMR system to enhance both operational efficiency and information accessibility (
Figure 1).
At its core, the system employs a vector database to manage embedding generation, vectorized storage, and similarity search. This design enables seamless interaction between user queries and the EMR database while maintaining high retrieval accuracy. Indexed EMR documents are stored together with their corresponding vector embeddings, which are generated by a fine-tuned, domain-specific embedding model (see Section II-2). Incoming user queries are processed using the same embedding model to ensure alignment with the pre-indexed data, thus enabling accurate similarity search and retrieval (Section II-3). Retrieved documents are subsequently processed by the response generation module, which utilizes a pre-trained large language model (LLM) to generate coherent, contextually relevant responses. This approach ensures both precision and usability for healthcare professionals and patients (Section II-4). The LLM is further fine-tuned to prioritize medical accuracy and clarity, while adhering to domain-specific requirements. By integrating vector-based retrieval and language generation, the system delivers accurate, efficient, and secure access to EMR data, streamlining both administrative and clinical workflows.
2. Fine-tuning Embedding Model
To maximize retrieval accuracy and efficiency, the embedding model is fine-tuned through a process tailored to the unique manual data of each participating hospital. This hospital-specific approach maintains dedicated, fine-tuned models for each facility, ensuring optimal performance. The fine-tuning process converts textual information from hospital manuals into high-dimensional embedding vectors, which are systematically stored in the vector database to facilitate efficient query processing. The following steps detail the fine-tuning workflow, reflecting best practices and recent advancements in the field.
1) Embedding model selection
This study evaluated several multilingual embedding models for EMR manual processing, ultimately selecting Multilingual-E5-Large [
15] and BGE-M3 [
16] based on retrieval performance, computational efficiency, and multilingual capabilities. Commercial models such as text-embedding-ada-002 and text-embedding-3-large were excluded due to higher costs, limited options for fine-tuning, and lower performance in non-English languages.
Both Multilingual-E5-Large and BGE-M3 were chosen for their support of long sequences and computational efficiency, ensuring smooth integration into the RAG-based chatbot system.
2) Data source for embedding model training
The primary dataset consisted of the EMR system usage manual, which covered system functionality, user operations, maintenance procedures, and troubleshooting. Its structured content provided a robust foundation for training embedding models specialized in healthcare administration. The manual included detailed descriptions of system architecture, operational workflows, and step-by-step guidelines, enabling the models to effectively process EMR-related queries.
3) Data preprocessing
The dataset, summarized in
Table 1, underwent thorough preprocessing. Text was extracted from 267 PDFs (totaling 1,631 pages) covering system functionality and maintenance topics. After extraction, a cleaning process removed irrelevant characters, formatting artifacts, and errors to ensure high data quality. Query augmentation generated three unique queries per page, expanding the dataset to 5,931 entries.
The final dataset was structured into three main columns: query, file name, and document text. This rigorous preprocessing workflow ensured both consistency and diversity, providing a strong foundation for embedding model fine-tuning.
4) Training dataset
The training dataset was developed using a structured query generation and validation process. The GPT-3.5-turbo API [
19] was utilized to generate question-document pairs from the EMR manual, segmenting the text and generating contextually relevant questions via structured prompts. For each manual page, three distinct questions were generated to simulate real-world queries from healthcare professionals, resulting in 5,931 question-document pairs covering all key EMR functionalities and use cases. To ensure accuracy and clinical relevance, three experienced nurses validated the dataset, refining terminology and clarifying context as needed. Their expert review ensured that the dataset was well-aligned with practical healthcare information needs.
The GPT-3.5-turbo prompt template was constructed in both Korean and English to maintain consistency in question generation and enhance the effectiveness of model training (
Figure 2).
5) Test set creation
A test set of 134 question-answer pairs was constructed by medical experts to evaluate retrieval performance. These questions encompassed factual, procedural, and scenario-based queries related to EMR system functionality, with each answer linked to a ground truth reference in the manual for accuracy. Three experienced nurses validated this dataset as well, reviewing each pair for clinical relevance and accuracy. Their feedback refined terminology and improved contextual clarity, ensuring alignment with real-world healthcare workflows. The validated test set was used as a benchmark to assess retrieval accuracy and relevance, ensuring that the chatbot could effectively support healthcare professionals’ information needs.
6) Model training and evaluation
A structured fine-tuning approach was applied to the two candidate embedding models, Multilingual-E5-Large and BGE-M3. The training dataset consisted of 5,931 question-document pairs, carefully curated from the EMR manual. Both the questions and document texts were tokenized using the respective native tokenizers of the embedding models to ensure optimal data formatting, which enhances the models’ ability to generate meaningful embeddings. The fine-tuning process was based on contrastive learning, with cosine similarity used as the optimization objective. This approach encouraged the models to maximize similarity between matched question-document pairs while minimizing similarity for unmatched pairs.
We employed top-K accuracy metrics for model evaluation on a validated test set of 46 queries. The top-K accuracy was calculated as follows:
where N is the total number of test queries, K represents the number of documents retrieved for each query, docij represents the j-th retrieved document for query i, and yi represents the correct document for query i. The indicator function 1(docij = yi) equals 1 if docij matches yi, and 0 otherwise. This metric was specifically chosen to reflect the study’s goal of developing an efficient information retrieval system, where presenting the correct result within the top five responses is essential for practical usability in healthcare workflows.
The comparative evaluation of the two fine-tuned models highlighted their respective strengths and weaknesses. These results informed the selection of the optimal embedding model for integration into the RAG-based chatbot system, as detailed in the Results section.
3. Vector Database Indexing and Query Processing
The vector database infrastructure was constructed through a systematic series of processes designed to maximize the performance of the RAG system.
Figure 1 illustrates the detailed pipeline for vector database indexing and query processing.
1) Text preprocessing and chunking
The EMR system manual was preprocessed to prepare the text for embedding generation. A sliding window approach, using a 512-token window with a 128-token overlap, was employed to maintain contextual integrity while minimizing redundancy. Standard text normalization steps—including lowercase conversion, special character handling, and whitespace normalization—were applied to ensure clean, structured input.
2) Embedding generation
Two multilingual models, Multilingual-E5-Large and BGE-M3, were used to generate 1024-dimensional embeddings. Unlike E5-Large, which only supports English, Multilingual-E5-Large accommodates multiple languages. Both models underwent identical preprocessing to allow for precise performance comparisons. The embedding generation pipeline was optimized for parallel execution, which enhanced computational efficiency and increased processing speed for large-scale document embedding.
3) Vector indexing and database implementation
The vector database implemented a structured indexing and similarity search approach. Various strategies—including the Hierarchical Navigable Small World (HNSW) index—were evaluated. Distance metrics such as inner product and L2 (Euclidean) were tested to identify the most effective indexing method. Ultimately, the database adopted an IndexFlatL2 index, using a unified 1024-dimensional structure across models to ensure consistent vector representation and facilitate comparative analysis.
4) Metadata management system
A comprehensive metadata management system was developed to provide rich contextual information for each document embedding. The metadata schema included detailed document-specific attributes to improve retrieval precision and contextual understanding. Key metadata elements comprised unique document identifiers, sequential chunk positioning, verbatim content preservation, source section classification, source manual filename, and specific page numbers.
5) Query processing pipeline
The pipeline processed user queries in parallel for both Multilingual-E5-Large and BGE-M3 models. Each query was converted into a vector representation and matched with stored document embeddings. FAISS (Facebook AI Similarity Search) indices were used initially for performance evaluation, enabling direct comparisons between models under identical conditions [
21]. For the final system, Weaviate replaced FAISS, offering improved scalability and advanced metadata management. Relevant passages were retrieved based on similarity scores and then used as context for response generation. This hybrid approach—FAISS for benchmarking and Weaviate for deployment—ensured rigorous evaluation and a scalable production system.
6) Integration with the LLM
Retrieved content was passed to the LLM for response generation. The LLM synthesized accurate, contextually appropriate natural language responses. The system demonstrated robust performance, consistently maintaining an average query latency below 10 ms while delivering high retrieval accuracy.
7) Parallel architecture design
The parallel query processing architecture enabled comparative analysis of retrieval performance across embedding models. This systematic design established a solid foundation for evaluating the real-world performance of the RAG-based chatbot system in EMR operational settings.
4. Response Generation
The Upstage Solar Mini Chat API [
20] was selected over OpenAI’s GPT-3.5-turbo for its superior Korean text generation capabilities, making it particularly well-suited for RAG-based healthcare chatbots. Solar Mini Chat excels at generating contextually accurate responses by effectively incorporating supplementary input data beyond the initial query. It offers faster text generation than GPT-4-turbo, ensuring real-time responsiveness crucial for interactive chatbot applications. Additionally, its customization features support domain-specific models in law, finance, and healthcare, enhancing adaptability to diverse use cases. Comparative analysis demonstrated that Solar Mini Chat outperformed GPT-3.5-turbo in both text quality and cost efficiency, making it a practical choice for large-scale healthcare deployments. Its rapid response times, advanced handling of the Korean language, and significant cost advantages were key factors in its adoption as the text generation module for our RAG-based chatbot system. Ultimately, Solar Mini Chat ensured accurate, user-friendly, and cost-effective responses for end users.
III. Results
1. Fine-tuned Embedding Model Performance
A series of systematic hyperparameter optimization experiments were conducted to maximize the accuracy of the embedding models. The optimization results for Multilingual-E5-Large and BGE-M3 are summarized below.
For Multilingual-E5-Large, optimal performance was achieved by fine-tuning 12 training layers with a gradient accumulation of 2. The model was trained with a learning rate of 1e-4 and a batch size of 8 for 7 epochs.
For BGE-M3, a distinct set of optimal parameters emerged from our optimization process. The model performed best with a smaller train batch size of 1 and a more conservative learning rate of 1e-5. Training was performed with the fp32 data type over 5 epochs. The temperature was set to 0.02, and maximum lengths for queries and passages were configured at 64 and 1,024 tokens, respectively. Notably, self-distillation was enabled during training, which provided additional benefits to model performance.
Performance results for both models, using these optimized configurations, are presented in
Table 2.
The fine-tuning experiments demonstrated substantial improvements in retrieval performance for both models when evaluated on the EMR system manual corpus. Both models began from an identical baseline accuracy of 84.126% and showed marked improvements after fine-tuning, though the magnitude of enhancement differed significantly.
For Multilingual-E5-Large, our hyperparameter optimization indicated that an aggressive learning rate of 1e-4 combined with a moderate batch size of 8 yielded optimal performance. After fine-tuning, the model achieved an accuracy of 89.682%, representing a 5.556 percentage point increase over baseline. The best configuration, involving 12 training layers and gradient accumulation of 2, suggests that substantial parameter updating was necessary for the model to adapt to the domain-specific characteristics of the EMR manual content.
For BGE-M3, systematic optimization showed that a conservative learning rate of 1e-5, combined with self-distillation, produced superior results. After fine-tuning, the model reached an accuracy of 97.619%, a significant 13.493 percentage point improvement from baseline. The optimal configuration utilized a longer passage context (1,024 tokens) and a shorter query length (64 tokens), which was particularly effective in capturing the hierarchical structure of EMR manual content. The use of the fp32 data type and a low temperature of 0.02 during training contributed to stable convergence and robust performance.
The dramatic performance gap between the two models (a final accuracy difference of 7.937 percentage points) suggests that BGE-M3’s architecture and training strategy are especially well-suited for Korean EMR manual embedding tasks. Its superior performance can be attributed to effective self-distillation and optimal handling of long passage contexts, all while maintaining computational efficiency.
2. RAG-based EMR Chatbot System
As detailed above, comparative analysis of the two embedding models showed that BGE-M3 outperformed Multilingual-E5-Large in both retrieval accuracy and overall system performance. Consequently, BGE-M3 was selected as the optimal embedding model for integration into the RAGbased chatbot system (
Figure 3).
Figure 3A displays the chat interface where users interact with the chatbot. For example, a user inquires, “How do I request a patient transfer?” and the chatbot responds with a detailed explanation of the steps involved, including selection options and the entry of relevant details in the EMR system. Beneath the chat, several files are listed with their respective names and page numbers, likely providing links to related manuals or supplementary instructions.
Figure 3B presents a document viewer displaying a page titled “입원 환자리스트 - 전과전동/현위치 등록 (1/2),” which translates as “Inpatient List – Transfer/Current Location Registration (1/2).” The document includes structured guidelines, step-by-step instructions, and checkboxes for various options required for patient transfers, thereby offering clear procedural guidance to hospital staff.
IV. Discussion
This study fine-tuned an embedding model to enhance the accuracy and contextual relevance of responses for EMR system users, extending beyond basic search capabilities to better support healthcare administration.
One major application of this approach is the development of an administrative assistant chatbot. Healthcare professionals often spend substantial amounts of time on documentation, record management, and patient discharge procedures. By leveraging the proposed embedding model, an EMR-aware chatbot can efficiently retrieve procedural guidelines and relevant information, thereby reducing staff workload and enabling medical professionals to devote more time to patient care.
Another significant application is in billing and insurance support. Accurate retrieval of medical codes and billing procedures is critical for maintaining operational efficiency. The model’s high precision in retrieving domain-specific information can help minimize billing errors and accelerate insurance claim processing. Furthermore, rapid access to coding guidelines, such as Current Procedural Terminology (CPT) codes, can further streamline administrative workflows and increase efficiency.
Collectively, these applications illustrate how embedding models can optimize healthcare operations by facilitating accurate and timely information retrieval, ultimately improving both administrative processes and the quality of patient care delivery.
1. Limitations
A key limitation of this study is the absence of direct human evaluation. Although automated assessments (Precision@5 of 82.6% accuracy) demonstrated strong performance, the system has yet to be validated with real-world healthcare users. User feedback is critical for assessing practical usability and identifying areas for improvement. Further studies are required to evaluate the long-term impact on healthcare efficiency, cost reduction, and administrative workload. Real-world testing should focus on metrics such as time savings, error rates, and user satisfaction. Moreover, this study did not assess the model’s potential role in supporting clinical decision-making. Future research should examine integration into clinical workflows to expand the model’s utility beyond administrative tasks.
2. Future Research Directions
Future studies should incorporate structured user evaluations, including usability testing with healthcare professionals, to assess the system’s practical effectiveness. Qualitative assessments in clinical environments can provide valuable insights into real-world performance and areas for further refinement. By addressing these limitations and exploring expanded applications, the model can be further improved to better support healthcare organizations and enhance overall administrative efficiency.