I. Introduction
The goal of adopting healthcare information technology is to optimize the collection, sharing, and utilization of data generated in the field of healthcare. Utilizing clinical data efficiently requires interoperability. Specifically, semantic interoperability, which preserves the semantics of the exchanged data, can be achieved using standardized terminologies. The standardization of clinical data makes it possible to efficiently conduct health information exchange and multinational network studies.
SNOMED Clinical Terms (SNOMED CT) is the most widely used system of clinical terminology with fine granularity and an extensive hierarchy; it is also increasingly used for clinical data entry and retrieval. SNOMED CT is used in 40 member countries and by more than 5,000 individuals and organizations around the world [
1]. In August 2020, Korea became the 39th member of SNOMED International, and domestic healthcare institutions are actively working to achieve semantic interoperability and standardization of health data using SNOMED CT. SNOMED CT has been used in Korean national IT initiatives, such as the program for the certification of Electronic Medical Records (EMR) systems, for health information exchange, for data-driven hospitals, and for national registries (e.g., the Cancer Registration and Statistics Program) [
2].
Seoul National University Bundang Hospital (SNUBH) has been using standard terminologies, including SNOMED CT, for semantic interoperability and data utilization since it was founded in 2003. Specifically, all diagnoses, chief complaints, and surgical procedure codes were standardized using SNOMED CT and International Classification of Diseases ninth revision with clinical modification (ICD-9-CM). In addition, significant numbers of clinical observation records (e.g., vital signs and pain scores), radiology and pathology reports, and laboratory test results were mapped to the corresponding Logical Observation Identifier Names and Codes.
However, there are concerns about information loss whenever a mapping is performed [
3]. Mapping concepts from one taxonomy to another can create semantic inconsistency due to hierarchy incongruence [
4,
5]. There is a difference in granularity between terminologies; for example, a source concept may be either too specific or too general to be directly mapped to a target concept [
6]. Despite these concerns, Reich et al. [
4] showed that, although there are vocabulary differences in mapping from ICD-9-CM to SNOMED CT and these differences cause differences in cohorts, studies that used these mappings showed minimal differences compared with those of the original studies. Hripcsak et al. [
3] also showed that mapping data from source ICD billing codes to SNOMED CT codes had only a very small effect on the generated patient cohorts. However, another study showed substantial inconsistencies and disagreements between patient cohorts generated by standardized vocabularies and original codes across network sites [
7].
In this study, we evaluated the effectiveness of the use of standardized vocabularies to generate epilepsy patient cohorts with local medical codes, SNOMED CT, and International Classification of Diseases tenth revision (ICD-10)/Korean Classification of Diseases-7 (KCD-7) and compared the cohorts in terms of the number and age distribution of the patients by year.
IV. Discussion
To analyze the effects of data standardization (vocabulary mapping), previous studies compared patient cohorts [
3,
6,
7] and evaluated the prevalence of specific health outcomes [
4] and estimates of drug-heath outcome associations [
4] across the mapped vocabularies in various databases.
This study evaluated the effect of data standardization in terms of generating a cohort of patients with epilepsy. We considered the patient cohort created using local codes as the reference and compared it with cohorts generated by SNOMED CT and ICD-10/KCD-7 in terms of the number and age distribution of the patients by year.
SNOMED CT is designed for direct use by healthcare providers during the process of care, whereas ICD-10 is designed for use by medical coders once an episode of care is completed. ICD-10 is a classification system that consists of groups of mutually exclusive categories for data aggregation. SNOMED CT, in contrast, is a health terminology that satisfies the requirements for reference terminologies, including concept orientation, formal definitions, poly-hierarchy, and multiple granularities [
11]. Since SNOMED CT allows coding at any level of granularity that is appropriate for the clinical situation using a sub-type relationship, it is suited for documenting clinical information or ideas within EMRs. The SNOMED CT hierarchy allows facile incorporation of new concepts and increased granularity, eliminating the need to rely on ambiguous classifications such as NOS (not otherwise specified) and NEC (not elsewhere classifiable) codes, as are used in ICD-10 codes [
12]. This increased granularity also benefits clinical research.
Of the patients included in the cohort generated by SNOMED CT, 88 patients were excluded from the reference. The local diagnoses of these patients had spelling errors, such as “benign myoclonic epilopsy [sic] in infancy,” “benign myoclonic epilopsy [sic] in infancy, not intractable,” and some had local names without “%epilep%” (e.g., benign neonatal familial convulsions, seizure with specific mode of precipitation); hence, we could not include these codes in the reference. In other words, SNOMED CT detected hidden patients with epilepsy that could not be identified using local codes and included them in the cohort.
The epilepsy cohort generated by local codes contained patients diagnosed with situation-related seizures, but these patients were not included in the SNOMED CT cohort. Since 230431001 |Situation-related seizures (disorder)| concept is a sibling of 84757009 |Epilepsy (disorder)| concept in the SNOMED CT hierarchy, nine patients with situation-related seizures were not included in the cohort generated by 84757009 |Epilepsy (disorder)| concept and its descendants.
In addition, 42 patients included in the reference were missed from the cohort generated by the G40.XX, G41.X, and F80.3 codes, since the local codes for “hippocampal sclerosis” and “posttraumatic epilepsy” have been mapped to G37.9 (demyelinating disease of central nervous system, unspecified) and T90.5 (sequelae of intracranial injury), respectively. Most of the 713 patients included in the cohort generated by ICD-10/KCD-7 codes were patients with seizures, including epileptic seizures, simple/complex partial seizures, grand/petit mal seizures, and generalized tonicclonic seizures. As the ICD-10/KCD-7 allows complex concepts to be encoded, patients with diagnoses and symptoms other than epilepsy might be included. Thus, the use of ICD-10/KCD-7 may hinder the homogeneity of study subjects when organizing the cohort.
Although we did not analyze the effects of data standardization on specific study outcomes (e.g., estimates of drug-disease associations), as was done in the study of Reich et al. [
4], we found substantial differences in the number and age distribution of patients from the reference when we used ICD-10/KCD-7 codes, not SNOMED CT concepts, to generate a targeted patient cohort. This finding indicates that SNOMED CT is more suitable for representing clinical concepts or ideas than ICD-10/KCD-7 and is beneficial for clinical studies. Moreover, we evaluated the quality of mapping between vocabularies.
Our study has several limitations. First, we only generated patient cohorts with epilepsy at a single healthcare institution to evaluate the effect of data standardization. Second, we only used diagnosis codes to generate patient cohorts with epilepsy. Hripcsak et al. [
3] used public phenotypes from the eMERGE initiative (
https://phekb.org/phenotypes) to test the effect of mapping diagnosis codes from ICD-9-CM/ICD-10-CM to SNOMED CT on patient cohorts. The eMERGE initiative was chosen because the phenotype definitions were validated and the phenotypes were explained in each case, thereby allowing us to assess intent. Therefore, we could add the inclusion criterion of one or more prescriptions of antiepileptic drugs to identify subjects with epilepsy, as defined by the eMERGE initiative [
13]. Third, we searched the diagnosis name with “%epilep%” to define epilepsy-related local codes. Thus, patients with the codes for “benign neonatal familial convulsions” and “benign myoclonic epilopsy [sic] in infancy” were missing from the reference. Fourth, the conversion from one vocabulary to another depends on the quality of the mapping tables and the mapping skills of the medical coders [
4]. Thus, the mapping results can vary according to the mapping purpose and institution. In-depth understanding and training for standard terminologies are required to improve the quality of mapping between vocabularies. Fifth, the concept set used to define a phenotype depends on the version of a vocabulary. There are a total of 249 SNOMED CT concepts for epilepsy and its descendants in SNOMED CT International released on February 28, 2022. Since we used the 298 concepts for epilepsy in SNOMED CT International released on April 1, 2020, there may be differences in the cohort composition according to the SNOMED CT version.
We plan to expand our empirical research by phenotyping other health outcomes of interest (e.g., heart failure, diabetes mellitus) or identifying the effect of data standardization on estimates of drug-health outcomes associations, as in previous studies [
3,
4].