Clinical Data Element Ontology for Unified Indexing and Retrieval of Data Elements across Multiple Metadata Registries
Article information
Abstract
Objectives
Classification of data elements (DEs), which is used in clinical documents is challenging, even in across ISO/IEC 11179 compliant clinical metadata registries (MDRs) due to no existence of reliable standard for identifying DEs. We suggest the Clinical Data Element Ontology (CDEO) for unified indexing and retrieval of DEs across MDRs.
Methods
The CDEO was developed through harmonization of existing clinical document models and empirical analysis of MDRs. For specific classification as using data element concept (DEC), The Simple Knowledge Organization System was chosen to represent and organize the DECs. Six basic requirements also were set that the CDEO must meet, including indexing target to be a DEC, organizing DECs using their semantic relationships. For evaluation of the CDEO, three indexers mapped 400 DECs to more than 1 CDEO term in order to determine whether the CDEO produces a consistent index to a given DEC. The level of agreement among the indexers was determined by calculating the intraclass correlation coefficient (ICC).
Results
We developed CDEO with 578 concepts. Through two application use-case scenarios, usability of the CDEO is evaluated and it fully met all of the considered requirements. The ICC among the three indexers was estimated to be 0.59 (95% confidence interval, 0.52-0.66).
Conclusions
The CDEO organizes DECs originating from different MDRs into a single unified conceptual structure. It enables highly selective search and retrieval of relevant DEs from multiple MDRs for clinical documentation and clinical research data aggregation.
I. Introduction
The International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) 11179 Metadata Registry (MDR) standard [1] provides a framework that enables the semantic interoperability of data originating from various sources with exact definitions of data elements (DEs). The utility of the MDR standard has been widely recognized, and an increasing number of healthcare stakeholders have been adopting it for the management of metadata for clinical trials and the aggregation of clinical research data, sharing, reuse, and clinical documentation in Electronic Health Records [2, 3, 4, 5, 6, 7, 8, 9, 10, 11].
An appropriate classification scheme is needed to enable searching relevant DEs registered within/across multiple MDRs. Although Part 2 of the MDR standard provides a conceptual basis for classifying DEs [12], it does not specify how to create a classification scheme, or how to designate a particular one. Currently there is no classification scheme in healthcare MDRs by which to organize the DEs into well-organized conceptual categories. The CDE browser (http://cdebrowser.nci.nih.gov/CDEBrowser/) in the Cancer Data Standards Registry and Repository (caDSR) [13] simply organizes DEs into 35 contexts, which represent the sources of the DEs themselves. As Nadkarni and Brandt [14] correctly pointed out, the CDE Browser lacks interconcept relationships and a synonyms search function, and it simply returns a long unorganized list of DEs and their contexts. The browse registry in the Australian Metadata Online Registry (METeOR, http://meteor.aihw.gov.au/) [15] provides an improved search interface, whereby the following six object class groups are displayed in the tree structure: Entity, Life event, Person/group of persons, Service episode, Service/Care event, and Service/Care provider. Properties are classified into 29 groups, such as Accommodation/living characteristics, Birth event, Client characteristic, Communication characteristics, and Crisis event. However, it is far from clear whether the classification scheme is semantically sound and comprehensive. The same problem exists in the Clinical and Histopathological Metadata Registry (CHMR) that was created by three of the present authors [16]. The CHMR currently lacks a means by which to effectively browse and identify DEs by semantic groups.
Existing reference terminologies, such as Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) and Medical Subject Headings (MeSH), may be used as alternatives for indexing and retrieving DEs from MDRs. However, their functions and structures are not optimized for DEs. The content structure of SNOMED-CT, which was devised to support effective clinical recording of data across specialties and sites of care, is too complex for browsing and retrieving relevant DEs, and MeSH, which was devised for the indexing and retrieval of biomedical journal articles, lacks the terms and content structures required for consistent indexing and retrieval of DEs across multiple clinical MDRs. Locating all relevant DEs across multiple MDRs, at least in the clinical domain, is very difficult because no reliable standard exists for identifying DEs, and there are several different local classification schemes. This problem demands a solution for organizing clinical DEs according to a global concept system.
The purpose of the present study was to develop a system, named the Clinical Data Element Ontology (CDEO), which can be applied to-but is not limited to-ISO/IEC 11179 compliant MDRs. As a global reference concept system, it was intended that the CDEO would enable the unified indexing of DEs registered in multiple MDRs.
1. Related Work
For consistent document naming in the Health Level 7 (HL7)/Logical Observation Identifiers Names and Codes (LOINC), Frazier et al. [17] proposed the Document Ontology (DO), which initially comprised seven primitive axes, but ultimately reduced them to five. Shapiro et al. [18] extended the Clinical Category axis of the DO (renamed as Subject Matter Domain) to organize and facilitate searching for clinical documents. Chen et al. [19] mapped local clinical document names to the DO and to specific LOINC codes. They identified some limitations in the DO with respect to coverage, granularity, and loss of meaning. Another studies [20, 21] produced an ontology-based definition management for the creation and maintenance of clinical document templates. While the focus of these previous studies was to developing ontologies for the classification of the clinical documents themselves, the present study of the CDEO focused on the indexing and retrieval of DEs, which are the building blocks of clinical documents. We believe that the CDEO is also useful for organizing clinical documents.
II. Methods
1. Setting the Basic Requirements
An ontology with a strong structure is required for effective unified indexing and efficient retrieval of DEs [22]. Therefore, the following basic requirements that must be met by the CDEO were set:
1) Index data element concept (DEC). This is based on a DEC being an abstract of one or more DEs in MDRs, and thus facilitates coherent grouping and selective searching for DEs.
2) Organize DECs into a poly-hierarchical structure, taking into consideration their conceptual relationships, such as broader, narrower, and related concepts.
3) Control the terms and language variants of a concept in order to enable consistent indexing and multilingual searches for DECs across different MDRs.
4) Enable each DEC to have a unique resource identifier (URI) so that it can be uniquely identified and referenced by different MDRs.
5) Be able to identify the two main components of a DEC, object class and property.
6) Assist users to postcoordinate concepts to find relevant DECs.
The above requirements were used to evaluate the CDEO.
2. Data Representation of the CDEO using the Simple Knowledge Organization System
The Simple Knowledge Organization System (SKOS) [23] was chosen to represent and organize concepts in the CDEO. This decision was based on the understanding that SKOS is an official W3C standard and is highly regarded worldwide. Moreover, in the new version of ISO/IEC 11179-3 standard [24], SKOS supports the concept system that is used for modeling a classification scheme (Figure 1).
SKOS is a Resource Description Framework (RDF) vocabulary for expressing the basic structure and content of concept systems, such as thesauri, classification schemes, subject headings, or terminologies within the framework of the Semantic Web. The key elements of the SKOS include the following:
1) Concept classes (skos:ConceptScheme, skos:Concept, and skos:Collection).
2) Labeling properties (skos:prefLabel, skos:altLabel, and skos:hiddenLabel).
3) Semantic relation properties (skos:narrower, skos:broader, and skos:related).
4) Documentation properties (skos:note, skos:definition, skos:changeNote, skos:scopeNote, skos:editorialNote, skos:historyNote, and skos:example).
Figure 2 illustrates how a concept is organized and represented with the SKOS model. The preferred label, the alternative label, and the definition for the concept 'Patient' is represented in different languages: 'Patient' in English, '환자' in Korean, and '病人' in Chinese. Its broader (Person by Clinical Role), narrower (Outpatient, Inpatient), and related (Doctor) concepts are represented by the semantic relation properties.
3. Development Strategy
The CDEO was developed using both top-down and bottom-up approaches; that is, harmonization of existing clinical document models was complemented by empirical analysis of the MDRs.
1) Top-down approach
Several clinical document models provided a powerful reference point for the creation of the top concepts for the CDEO. Each model reflects its own perspective on clinical documentation. The Clinical Investigation Record (CIR) ontology reflects the investigator's clinical care delivery process, with the five information types: Observation, Opinion, Instruction, Action, and Administrative event [25]. The DO focuses on clinical document classification and naming with five axes: Service, Kind of Narrative, Setting, Role, and Subject Matter Domain. The HL7 Reference Information Model (RIM) represents a static and behavioral model of healthcare workflow with three foundation classes: Acts, Entities, and Roles [26]. These models are essentially prescriptive, which is somewhat inconsistent with the reality of clinical documentation. Although they provide suitable class levels, these models are barely sufficient to incorporate all concepts included in clinical DEs. Thus, we identified the overlap among the three models, and harmonized their views to define a manageable number of DEC groupings.
2) Bottom-up approach
In searching for empirical evidence for grouping DECs (as at May 23, 2013), we analyzed the occurrence of the concepts in the object classes and the properties in the caDSR (n = 4,414 and n = 5,638, respectively) and the CHMR (n = 2,091 and n = 878, respectively). The most frequently occurring concepts in the object classes in the caDSR included 'procedure,' 'patient,' and 'therapy,' while the most frequently occurring for property were 'name,' 'duration,' 'date,' and 'outcome.' Many concepts (960 in total) in the object classes also occurred in the properties, which indicate that they probably share a conceptual domain.
A DEC is the combination of an object class and a property. We thus simply adopted this structure to our ontology and added Qualifier to limit or specialize the meaning of the concept being represented. The distribution of concepts into the Object Class and Property is determined by the meaning dependency. For instance, the object class 'person' is meaningful when it is used alone, whereas the property 'age' is not. Properties should be associated with one or more object classes to convey the meaning of a DEC. The choice to make groups for Object Class, Property, and Qualifier appears to be appropriate for our purpose.
We aimed to ensure that the CDEO covers the comprehensive and detailed themes of DECs. Thus, after defining the top concepts, subconcepts were populated from the caDSR-the most comprehensive metadata repository in the biomedical domain- and from existing vocabularies such as SNOMED-CT and MeSH. When a new concept was introduced, for the sake of semantic soundness, a methodological question was asked: is concept 'A' truly a subtype of concept 'B'?
4. Indexing with the CDEO for Evaluation
The usefulness of the CDEO as a means of categorizing DECs according to some shared high-level characteristics was evaluated. We investigated whether the CDEO produces a 'coherent' index to a given DEC across different indexers who might have different perspectives. One medical record administrator and two PhD students at the Division of Biomedical Informatics indexed DECs using the CDEO without any guidance with respect to choosing terms. They were allowed to assign more than one CDEO term to a given DEC whenever they considered it appropriate. Four hundred DECs were randomly selected from the caDSR. The inter-indexer agreement level was determined using the intraclass correlation coefficient (ICC), which can be used to assess the rating reliability by comparing the variability of different ratings of the same subject to the total variation, and was computed using the Psych package in R 3.0. For each DEC, if one indexer had two shared terms with the other two indexers, who each shared three terms, he obtained a score of 0.67, and the other two indexers obtained a score of 1. This validation process made it possible to evaluate different interpretations of the same DEC as well as the consistency of DEC assignments.
III. Results
1. Structure of the CDEO
The structure of the CDEO arises partly as a compromise between existing clinical document models. Figure 3 shows that five of the nine top concepts in the object class correspond to those in the three clinical document models. Agent corresponds to the subtypes of Entity in the RIM and to Practice Setting and Clinical Category in the DO via Social Agent and Social Group. Activity corresponds to Act in the RIM and to Documenting Act (i.e., report) and Temporal Event (i.e., discharge) in the initial version of the DO; Clinical Finding corresponds to Observation and Sign and Symptom corresponds to Opinion in the CIR; and Artifact-via Document-corresponds to Kind of Narrative in the DO. Anatomy and Phenomenon were introduced because they include many subconcepts that appear frequently in MDRs.
The last two top concepts are Property and Qualifier (Table 1). Property includes Geospatial (i.e., Location), Identification (i.e., Name), Qualitative (i.e., Color), Quantitative (i.e., Frequency), Temporal (i.e., Date), and Unclassified (i.e., Impact) properties. The Qualifier specializes or limits the meaning of concepts. CDEO classifies the Qualifier into five groups: Ordinal (i.e., First), Quantitative (i.e., Maximum), Sentential (i.e., Unique), Spatial (i.e., Lower), and Temporal (i.e., Early).
2. SKOS Representation of the CDEO
Implementation of the CDEO with SKOS resulted in 578 concepts (as at October 3, 2014), including 9 top concepts. The prefix 'cdeo' was declared for the dereferenceable URI (http://www.snubi.org/software/cdeo/), where an MDR application can identify cdeo.rdf. Table 2 lists the namespace prefixes and their URIs used in the CDEO.
3. Use-Case Scenarios
In this section we present two application use-case scenarios to demonstrate the utility of the CDEO. The first scenario is shown in Figure 4A. An indexer loads RDF data from the CDEO and the metadata objects arising from an MDR. It then displays the CDEO concepts in an appropriate style (i.e., a tree view). Given a DEC, the user identifies appropriate CDEO concepts to map. For each object class and property, the concept source (CDEO) and identifier (the URI of the CDEO concept) are in turn inserted back into the MDR (see Table 3). When a user queries against the MDR, the matched metadata objects are presented to the user.
The second scenario is illustrated in Figure 4B. The CDEO is intended for use as a global reference concept system for MDRs, since existed ontology is limited for instance, MeSH is used for biomedical research articles; thus, it does not contain any indexed DECs. Instead, the CDEO is imported into the Index Ontology (IDO), an ontology that simply comprises one class (ido:DataElementConcept) and two properties (ido:hasObjectClass and ido:hasProperty). Given a DEC originating from one of the listed MDRs, the mapping is conducted and the mapping results are stored in the IDO in RDF format. In a search session, the user navigates concepts in a CDEO-supported browser and coordinates the terms-'Person' for object class and 'Age' for property-to identify the correct DEC: 'Patient age.' Meanwhile, the browser's backend fetches the URIs arising from the IDO to the structured query language (SQL) query string for MDRs to present the user with the matched metadata objects.
4. Evaluation Relative to the Requirements
To demonstrate the utility of the CDEO, a competency question should be answered: Does the CDEO ever satisfy all of the six requirements considered at the beginning of the study? The answer is definitely yes. The CDEO inherits the merits of the SKOS. It uniquely identifies DECs through the namespace URI. It also controls the labels and language variations of the object class and property through labeling properties and language encoding. Thus, a search for '환자' returns all DECs including 'patient' with their source URIs. The CDEO also provides a poly-hierarchical organization of object classes, properties, and qualifiers, and indexing for each object class and property in the DEC. This enables users to coordinate terms for more precise mapping on the semantic structure of the CDEO.
5. Evaluation Result with Classification Agreement among Indexers
Four hundred DECs were mapped to the CDEO terms, and the ICC among the three indexers was estimated to be 0.59 (95% confidence interval, 0.52-0.66), which is a moderate level of inter-indexer agreement, indicating that the CDEO is need necessary to provide users semantic information or definition for each CDEO term in order to reduce vogue for indexing or categorizing DECs.
IV. Discussion
This study has developed a novel global reference concept system, called the CDEO, for unified indexing and effective locating of DEs registered across multiple clinical MDRs. This ontology, which was implemented with SKOS, met all of the considered requirements, and its utility was demonstrated through the application of use-case scenarios, which yielded a moderate level of inter-indexer agreement.
Several implications of the CDEO are worth noting. As a global reference concept system, the CDEO organizes DECs, which often arise from different MDRs, into a single conceptual structure. It provides contextual knowledge about the DECs while simultaneously considering the semantic relationships between them. Thus, the CDEO promotes the unified indexing of DECs registered across different MDRs.
In terms of the findability of DEs, the user interface may utilize the structure of the CDEO for highly selective searching and retrieval of relevant DEs. The multilingual representation of DECs also enables users to browse and search DEs in different languages.
We expect that the CDEO will also be useful for clinical documentation. It groups DECs according to semantic structure, and lists all DEs bound to them. Thus, it facilitates the selection of appropriate DEs for the generation of a specific clinical document form.
The CDEO is also a useful tool for clinical trials and research that depend upon the efficient collection of well-defined and appointed DEs. The definition of data exchange protocols, data quality management, and statistical data analysis models-which are necessary but very time-consuming for the development of DE collections-sustain these DEs and their definitions, value lists, and plausibility checks. The semantic structure of the CDEO reduces the burden of identifying and maintaining relevant DEs, and hence it facilitates the aggregation and synthesis of data arising from different sources.
The present study was subject to some ideological limitations. We do not claim that the CDEO includes an exhaustive list of concepts. However, this ontology is not designed for use in deep indexing; rather, it organizes similarly faceted DECs together. Moreover, postcoordination of broader concepts would cover concepts that are not yet included in the ontology. It could be argued that primitive axes are essentially counterintuitive, yet the model may well contain a flaw. However, the model proposed herein is not the answer, but rather one of the possible answers.
Concerning the browsing interface, seemingly counterintuitive deployment can be complemented by a keyword search, which is a quick way to provide users with assistance in locating potentially useful DEs. We would simply claim that the CDEO is a reference concept system for identifying DECs that can be used to increase the organization and findability of DEs. An unavoidable issue is that within the same reference concept system, indexers may interpret the same DE differently, as indicated by the level of indexer agreement found in this study; this called the indexer effect [11]. This index inconsistency could be minimized by providing a CDEO indexing guide.
The implications and limitations of the CDEO have been identified. Further work includes polishing and refining the CDEO axes so that they provide more granular and enhanced coverage. The addition of language-encoding (i.e., @fr) to the labeling properties and lexical variants is one way of enhancing the utility of this model. The implementation of applications for (semiautomatic) indexing and semantic searches is another possible step. The initial version of CDEO includes only a small number of concepts, leaving room for this ontology to be extended.
We hope that the present work will be a starting point for professional vetting, discussion, and endorsement of the CDEO. We also expect the CDEO to promote the adoption of MDR standards in the clinical domain.
Acknowledgments
The authors thank Insuk Goh and Eun Yang Ahn for their work on mapping data element concepts of our Clinial and Histopathological Metadata Registry to the Clinical Data Element Ontology. This work was supported by the ICT Standardization program of the Ministry of Science, ICT & Future Planning. This research was supported by a grant of the Korean Health Technology, R&D Project, Ministry of Health and Welfare (No. HI13C2164).
Notes
No potential conflict of interest relevant to this article was reported.