I. Introduction
A standardized and controlled vocabulary in a national healthcare system facilitates semantic interoperability and collaborative research [
1]. For medical diagnosis, the Korean Standard Classification of Diseases and Causes of Death (KCD-7), an extension of the tenth revision of the International Statistical Classification of Diseases and Related Health Problems 10th revision (ICD-10), is widely acknowledged as the
de facto standard vocabulary because it is a mandatory terminology for claims operations. However, there has been no widely accepted standardized vocabulary system that incorporates drugs, medical services, and devices in Korea. The Korean Standard Terminology of Medicine (KOSTOM) was developed in 2004 to provide a standardized and comprehensive vocabulary of medical terminology [
2]. However, because of a lack of commitment and inadequate publicity, the KOSTOM vocabulary has been seldom adopted in routine clinical practice or in big data analytics in medicine and healthcare [
3].
The Health Insurance Review and Assessment Service (HIRA) has developed and maintains the Electronic Data Interchange (EDI) code system, or EDI vocabulary, to classify and identify drugs, medical services, and devices. HIRA mandates use of this vocabulary to obtain reimbursement in the fee-for-service system. For this reason, every Korean Electronic Health Record (EHR) system uses the EDI vocabulary for most drugs, medical procedures, and devices. However, most hospitals have developed their own medical vocabulary systems because of the limited granularity of the EDI vocabulary [
4]. Furthermore, the EDI vocabulary has not been acknowledged as a standard vocabulary in the way that the Current Procedural Terminology, fourth edition has in the United States because the quality of the EDI has never been audited. To standardize this
de facto Korean medical vocabulary, there was an effort to map the EDI vocabulary to the Systematized Nomenclature of Medicine–Clinical Terms (SNOMED-CT) [
5]. Nonetheless, this did not lead to substantive quality improvement of the EDI vocabulary itself.
1. Challenges in EDI Vocabulary as a Controlled Vocabulary
We identified the following five main problems disrupting the EDI’s maintenance as a controlled medical vocabulary: lack of concept identifier (ID) version control, lack of ID permanence, use of semantic concept identifiers, non-unique identifiers, and lack of formal definitions.
First, the EDI has no controlled life cycle for its terms. The validity dates for EDI codes are not recorded in the official monthly announcements, but newly added and expired codes are announced in monthly announcements. Second, the identifiers and concepts of the EDI are not permanent. There are EDI vocabularies that are no longer used because of having expired or having been replaced by other vocabularies. We have confirmed that some of their expired codes have been reused in other vocabularies. Outdated EDI identifiers can be assigned to new concepts. That is, outdated EDI IDs can be assigned to new concepts. Third, the EDI vocabulary uses semantic concept identifiers. For example, the EDI ID of a drug includes information on the country, company, unit, and packaging type. This ontological system makes it difficult to apply a single rule if the number of tracked contents exceeds the digits allotted to represent the specific contents. Fourth, the EDI vocabulary has some duplicated identifiers because there is no unified EDI encoding system across domains. For example, 13 codes are duplicated between medical services and devices. Among these, “Chest [Direct], radiologist reading” in medical services and “TRI-MO” in devices share the EDI ID G2101006. Fifth, although the EDI includes a modifier for reimbursing the additional price of service (e.g., emergency services or nighttime services) according to the national reimbursement policy, the concept definitions do not include information related to the modifiers. For example, the EDI ID N0333 means “Craniotomy or Craniectomy for Decompression.” If the identical medical service is performed at night, it is recorded as EDI ID N0333010, but the conceptual definition remains “Craniotomy or Craniectomy for Decompression.” Furthermore, Korean definitions of items in the EDI vocabulary vary across time, usually because of non-semantic punctuation.
2. Observational Medical Outcomes Partnership Vocabulary
Observational Health Data Sciences and Informatics (OHDSI) is an international, multi-stakeholder, interdisciplinary initiative for collaborative medical research, which uses an open-source standardized data structure and provides analytic solutions. As a successor to the Observational Medical Outcomes Partnership (OMOP), OHDSI adopts the OMOP common data model (CDM) as its standard data structure and the OMOP vocabulary as its standard semantics [
6]. Multiple medical vocabulary systems are organized in the united controlled vocabulary system of the OMOP-CDM to provide comprehensive coverage for diverse healthcare databases across countries [
7]. The OMOP vocabulary system comprises standard and non-standard vocabularies across various healthcare data domains, including condition (a medical diagnosis), drug, procedure, measurement, and device. For the condition domain, the SNOMED-CT and ICDO (International Classification of Diseases for Oncology) vocabularies are used for the standard vocabulary, and ICD-10, ICD-10-CM, or KCD7 are classified as non-standard vocabulary. The OHDSI vocabulary subgroup evolved and maintained both standard and non-standard OMOP vocabulary based on desiderata for controlled medical vocabularies, such as concept orientation, concept permanence, non-semantic concept identifiers, polyhierarchy, formal definitions, multiple granularities, and graceful evolution [
8].
3. Objectives
Our ultimate goal was to improve the EDI vocabulary for a controlled and standardized vocabulary system. For this purpose, we incorporated the EDI vocabulary into the OMOP Standardized Vocabulary through a semi-automated process.
II. Methods
For this study, we used the EDI concept list that was released on the HIRA website in October 2019. The EDI has separate vocabularies for drugs, medical services, and devices. These three domains have no unified system in the EDI vocabulary. A complete list of valid EDI codes in each of these three domains is independently released with a description every month.
Figure 1 presents the overall process. First, we assigned a permanent, non-semantic, and unique concept identifier to each EDI concept. A “permanent” identifier refers to a concept identifier that will not be re-assigned to a new concept, and the identifier will contain expired data after the concept expires. A “non-semantic” and “unique” identifier means that the concept identifier
per se is a random unique number without any meaningful information. Second, we established correspondences for all EDI vocabulary items for the four domains of the OMOP (drug, procedure, measurement, and device) with a hierarchy. Third, we translated the Korean definitions of EDI terms into English by leveraging Google Cloud Translation API to generate formal English definitions of all concepts.
We built a semi-automated process to incorporate the EDI vocabulary into the OMOP Standardized Vocabulary, including code cleaning, classification, building hierarchy, and vocabulary insertion in the OMOP-CDM version 5.3.1 database. We deployed the open-source click-to-run R software, EdiToOmop, found on the OHDSI’s official GitHub repository [
9].
1. Classification of Domains, Application of Management Systems and Building Hierarchy
Clinical events are classified into the domains of drug, device, condition, and procedure in OMOP. EDI concepts are divided into drugs, devices, and medical services, but the scope of medical services is too broad for the OMOP Standardized Vocabularies. Because of this discrepancy in domain classification between the EDI and OMOP Standardized Vocabularies, we subclassified EDI medical services into procedures and measurements to match the OMOP domains. To ensure that each concept’s meaning would be clear and unique, we added more descriptive matter to the concept definitions to explain the modifier codes of the original EDI ID, such as emergency use.
Once registered in the OMOP Standardized Vocabularies, a permanent, unique, and non-semantic numeric OMOP identifier was assigned to each EDI concept. This identifier, called a concept ID, prevented duplication and tracked the concept’s history from the first appearance to the deprecation of EDI concepts. Three attributes define the validity of concepts in the OMOP Standardized Vocabularies: “valid start date,” “valid end date,” and “invalid reason.” When an EDI concept is newly registered or deprecated, the term’s date is updated or expired and is recorded. If a concept is valid, the “invalid reason” for the concept is recorded as “NULL.” If a concept is replaced by another concept or deleted, the “invalid reason” for the concept is recorded as “U” or “D,” respectively.
The OMOP Standardized Vocabulary provides vertical and horizontal hierarchical relationships between concepts. In this project, we built a formal vertical hierarchy for EDI concepts. As with the ICD-9 and ICD-10 code system, the first five digits of the EDI IDs in the medical service domain represent the ancestor terms for longer, descendent EDI IDs. The remaining digits are usually added as modifiers to the same service for reimbursement. Thus, the descent concept contains all of the information for the ancestor concept, creating a vertical hierarchy.
2. Translation
For incorporation into the OMOP Standardized Vocabularies, the English definition for each EDI term is essential. We identified 266,140 concept definitions without an English description in the EDI vocabulary domains of medical services and devices. The translation of these terms involved three steps. To increase efficiency, we leveraged a Google translation tool. We used the Google.Cloud.Translation.V3, a .NET client library in the Google Cloud Translation API for the initial translation. Because Google-translated definitions may have misrepresented the meaning of a Korean term or may not have recognized an abbreviated term, two registered nurses reviewed and modified the English definitions. As a second modification, we developed a glossary for Korean words that were often not translated correctly into English by the software. Google Translation API provides customized translation functions that refer to a glossary. We created a glossary containing 749 terms of devices and 6,079 terms of service. This includes modifiers for reimbursing the additional price of service. Referring to the glossary, a secondary translation was conducted for 266,140 words that needed to be retranslated. After the secondary translation using the glossary, a medical worker audited the translation to ensure precision.
3. Auditing of Vocabulary
Qualitative criteria indicate that our EDI vocabulary restructuring process improved data quality for the health terminology system. Cimino [
8], Chute et al. [
10], and Rosenbloom et al. [
11] presented qualitative evaluation criteria for terminology. Additionally, Lee [
12] synthesized the criteria and included an index to determine whether the terminology system could support multiple languages. Based on Lee’s study [
12], we defined the following 11 criteria for evaluating terminology and evaluating the incorporation of the EDI vocabulary into the OMOP Standardized Vocabularies: concept orientation, concept permanence, coverage, relation, multiple hierarchy, compositionality, non-semantic concept identifiers, version control, formal definitions, synonyms uniquely identified and mapped to relevant concepts, and multi-language.
Another aspect of the EDI in the OMOP Standardized Vocabularies is the hierarchical relationships that we constructed. Furthermore, a mapping relation from non-standard to standard has been built. Thus, EDI concepts acquire relationships with other standard vocabularies. For example, the concept “ICU Patient Care-General” (OMOP Concept ID: 42360788) in the EDI is related to the concept of “Critical Care Medicine Care Management” (OMOP Concept ID: 44804818) in SNOMED-CT as shown in
Figure 2.
The criterion for formal definition is related to multiple hierarchies. In the converted EDI vocabulary, each term acquires a formal definition, allowing concepts to have relationships with other concepts. For example, hierarchy defines parent/child relationships between concepts, such that “Intravenous Catheterization for Hemodialysis” (EDI ID: O7016) is the parent concept for “Intravenous Catheterization for Hemodialysis, second surgery” (EDI ID: O7016001).
A given unique integer identifier managed synonyms for unique concepts, and related concepts were mapped to each other. Moreover, we have given EDI terms of unique English versions. Through the EdiToOmop package, newly added or deprecated EDI IDs can be updated in the OMOP Standardized Vocabularies semi-automatically.
III. Results
The R package EdiToOmop was developed to automate the incorporation of the EDI vocabulary into the OMOP Standardized Vocabularies. Of 313,453 EDI concepts, 313,431 were incorporated, with 270,387 medical services classified as measurements or procedures. Of the 12,991 measurement codes, 1,301 were classified as ancestor codes, and 11,681 were classified as descent codes. For procedure codes, of 257,396 concepts, 7,038 were classified as ancestor codes, and 250,358 were classified as descent codes.
Table 1 presents the numbers of concepts in the original EDI vocabulary and the reclassified domains using a simple hierarchy for incorporation into the OMOP Standardized Vocabularies.
Redacted EDI concepts were uploaded to OMOP, published at OHDSI’s public and official vocabulary website, ATHENA [
13], as shown in
Figure 3. We removed 26 EDI concepts from among medical services because their codes (EDI IDs) were duplicated in the EDI device domain. They were already deprecated in the EDI vocabulary. The OHDSI vocabulary team assigned a unique OMOP identifier to all EDI concepts in February 2020.
We translated 273,449 Korean definitions of EDI concepts using the Google Cloud Translation API. After manual review, only 890 terms (0.33%) did not need further modification. The other 272,559 terms were retranslated with reference to the glossary. We present the results of the initial translation (without glossary) and second translation (referring glossary) in
Figure 4. As seen in
Figure 4, the translation procedures, including glossary constraints, achieved better performance for the meaning of abbreviations, medical terms, and descriptions.
The incorporation of the EDI vocabulary into the OMOP Standardized Vocabularies brought about three obvious improvements: (1) uniqueness and exclusivity of concepts, (2) hierarchies and relationships between concepts, (3) and a management system for vocabulary. The 11 criteria used to audit the current EDI vocabulary and the converted EDI were used more specifically. The criteria of concept orientation, coverage, non-semantic concept identifiers, and synonyms uniquely identified and mapped to relevant concepts were used to evaluate how unique and exclusive the concepts were. The systematic nature of the hierarchical structure was evaluated in terms of relation, multiple hierarchy, and formal definitions. The incorporated EDI vocabulary featured a more structured management system, evaluated in terms of concept permanence, version control, and multi-language. For all criteria except compositionality, converted EDI showed a better quality index than the original EDI, as shown in
Table 2.
As previously stated, the criteria of concept orientation, non-semantic concept identifiers, coverage, and synonyms uniquely identified and mapped to relevant concepts were used to evaluate how unique and exclusive the concepts were. Concept orientation stipulates that a concept must correspond to a single meaning. Concept orientation is impaired in the current EDI vocabulary because it uses the same concept definitions for several concepts, despite the fact that they have different concept identifiers. In this case, concepts can be distinguished by a modifier for reimbursing the additional price of service. After incorporation into the OMOP vocabulary, converted EDI concepts gain unique concept definitions. The current EDI vocabulary uses semantic identifiers that have the advantage of having a meaning for each digit of the codes, which enables the easy identification of a single hierarchy (e.g., A3133 is parent code of A3133100, A3133200, and so on). However, if the rule for assigning concept codes changes for some reason, this convenience can become a constraint. Also, for vocabularies with multiple hierarchies, semantic identifiers can cause confusion [
8]. Through incorporation into the OMOP vocabulary, a non-semantic identifier was assigned to every concept in the EDI vocabulary to meet the non-semantic concept identifier criterion. We classified the medical service domain of the current EDI vocabulary as measurements and procedures. Although the converted EDI vocabulary has more specific domains, both have consistent and obvious coverage. Regarding the “synonyms uniquely identified and mapped to relevant concepts” criterion, the current EDI vocabulary does not have the structure of concept relationship, and it contains some duplicated identifiers. However, concepts in the converted EDI vocabulary have defined relationships between associated concepts and unique Korean definitions as concept synonyms of English definitions that meet this criterion.
The systematic nature of the hierarchical structure was evaluated in terms of relation, multiple hierarchies, formal definition, and compositionality criteria. Relation refers to the existing connections between related concepts. The current EDI does not maintain any relation between concepts, whereas the converted EDI has the structure of concept relation, allowing related concepts in other vocabularies to be identified. The hierarchy of concepts is established by defining the horizontal/vertical relationships of concepts. We constructed a single vertical hierarchy in the converted EDI vocabulary. However, it does not fully meet the multiple hierarchy criteria. The formal definition refers to a structure with concept relations that can be indexed and processed by a computer. The current EDI vocabulary lacks formal definition because even Korean definitions of EDI concepts vary across the versions of the EDI that have been released. Furthermore, the EDI vocabulary does not provide a system to search for related concepts based on the definitions of concepts. The converted EDI is available at the official OMOP vocabulary website, ATHENA [
13], where users can easily search for related concepts using formal English definitions. Compositionality refers to the fact that composite concepts can be divided into simple atomic concepts. This provides an intuitive understanding of complex concepts, but the current and converted EDI vocabularies do not meet this criterion.
The incorporated EDI vocabulary featured a more structured management system when it is evaluated according to concept permanence, version control, and multi-language criteria. Concept permanence means that expired or modified concepts and identifiers remain permanently. The current EDI vocabulary removes the expired concepts and reassigns the deprecated identifiers to newly added concepts. The converted EDI vocabulary maintains the expired concepts and identifiers. Version control is the corollary of concept permanence. The converted EDI vocabulary enables versioning through storing metadata of the start and the expiry date for each concept. In addition, the current EDI provides English definitions only for some concepts, whereas the converted EDI provides unique English definitions for all concepts.
IV. Discussion
We audited the Korean EDI as a controlled medical vocabulary in use in the Korean EHR system. By incorporating the EDI vocabulary into the OMOP Standardized Vocabulary, we enhanced many aspects of a controlled vocabulary, such as concept permanence, consistency, versioning, hierarchy, relations between concepts, formal definitions, unique and non-semantic identifiers, as well as expressive Korean and English definitions of concepts, while maintaining the EDI’s coverage. As a controlled vocabulary, the EDI in the OMOP vocabulary can provide a cohort database with unified terms and normalized concepts to researchers with similar research purposes. We also developed and deployed an open-source R package to automate this procedure.
The objective of this study was not to investigate errors in the EDI vocabulary. Rather, the ultimate aim of this study was to further improve the EDI vocabulary for a controlled and standardized vocabulary system. The EDI vocabulary itself was created for the purely administrative purpose of facilitating nationwide insurance. It was not designed as a comprehensive medical ontology. Nonetheless, the EDI vocabulary has become the
de facto vocabulary for observational medical research in Korea because of the rapid expansion of the secondary use of Korean EHR and the administrative claims database for real-world evidence [
14,
15]. Unlike the EDI, the converted EDI concepts in the OMOP Standardized Vocabularies were assigned unique identifiers, and they may have exclusive definitions. In the OMOP, each concept corresponds to no more than one meaning and is exclusive, resulting in better concept orientation.
This study provides significant advantages for big data analysis when using a Korean medical database. First, it helps to build a standard process for transforming Korean observational databases into OMOP-CDM. We recommend storing the OMOP concept IDs of EDI concepts in the “_SOURCE_CONCEPT_ID” fields of drug exposure, procedure occurrence, device exposure, and measurement tables. Then, EDI concept-based collaborative research can be performed across Korean databases without the need for further vocabulary mapping. Second, it may enhance the transparency and reproducibility of Korean medical research. Until now, most studies using the Korean Administrative Claims Database have not provided actual EDI identifiers because no English documentation for EDI concepts has existed [
16]. Because our study provides formal English definitions and a hierarchy of EDI concepts, it may precipitate the reporting of EDI identifiers in scientific papers, enhancing the reproducibility of research. Third, our study paves the way for international collaborative research using Korean databases. In response to the coronavirus disease 2019 (COVID-19) pandemic, HIRA launched a global research collaboration project with clinical data from Korean patients with COVID-19 on March 27, 2020 [
17]. Although there was no official English document describing the dataset’s medical vocabulary, all of the necessary information for KCD-7 and the EDI vocabulary was accessible through the ATHENA web portal [
13].
As originally intended, incorporation of the EDI vocabulary into the OMOP Standardized Vocabularies provides the infrastructure for standard mapping. As of September 2020, we had published corresponding standard concepts for 37,869 EDI procedure concepts and 675 measurement concepts [
18]. Most of the OMOP standard concepts for procedures and measurements were derived from SNOMED-CT and the LOINC (Logical Observation Identifiers Names and Codes) vocabulary, respectively.
This study had several limitations. First, only vocabulary current as of October 2019 has been incorporated into OMOP. To provide updates, Korean definitions of new terms should be translated into English by a human translator. By August 2020, a total of 447 EDI codes for Korean synonyms had been changed, 10,320 codes had been newly added, and 7,873 codes had been deprecated. Regular updates of changes in the EDI vocabulary to the OMOP vocabulary should be conducted going forward. Second, the quality of English definitions of EDI concepts has not been fully evaluated by professional medical staff. All information is publicly available [
13], and the overall quality can be improved through open discussion [
19]. Third, we used the list of EDI concept list released on the HIRA website, but it does not include concepts that had expired before October 2019.
By incorporating the EDI vocabulary into the OMOP Standardized Vocabularies, Korean medical terms can become standardized. This research developed a promising approach to mapping Korean medical information into a global standard system of terminology, but comprehensive official vocabulary mapping remains to be done in the future.