Construction of an Electrocardiogram Database Including 12 Lead Waveforms
Article information
Abstract
Objectives
Electrocardiogram (ECG) data are important for the study of cardiovascular disease and adverse drug reactions. Although the development of analytical techniques such as machine learning has improved our ability to extract useful information from ECGs, there is a lack of easily available ECG data for research purposes. We previously published an article on a database of ECG parameters and related clinical data (ECG-ViEW), which we have now updated with additional 12-lead waveform information.
Methods
All ECGs stored in portable document format (PDF) were collected from a tertiary teaching hospital in Korea over a 23-year study period. We developed software which can extract all ECG parameters and waveform information from the ECG reports in PDF format and stored it in a database (meta data) and a text file (raw waveform).
Results
Our database includes all parameters (ventricular rate, PR interval, QRS duration, QT/QTc interval, P-R-T axes, and interpretations) and 12-lead waveforms (for leads I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, and V6) from 1,039,550 ECGs (from 447,445 patients). Demographics, drug exposure data, diagnosis history, and laboratory test results (serum calcium, magnesium, and potassium levels) were also extracted from electronic medical records and linked to the ECG information.
Conclusions
Electrocardiogram information that includes 12 lead waveforms was extracted and transformed into a form that can be analyzed. The description and programming codes in this case report could be a reference for other researchers to build ECG databases using their own local ECG repository.
I. Introduction
Electrocardiogram (ECG) has been widely used to diagnose various cardiovascular diseases including arrhythmia and acute coronary syndrome [123] because it is a non-invasive and convenient tool for measuring the continuous wave sequence characterizing the heart activity [24].
Information from ECGs is also used to detect a prolonged QT interval, which is one of the life-threatening adverse drug reactions (ADRs). A prolonged QT interval leads to an irregular heart beat and can result in various types of cardiac arrest including ventricular fibrillation, ventricular tachyarrhythmia, Torsades de Pointes, and sudden death [567]. Due to its importance as a drug-induced adverse reaction, a prolonged QT interval is strictly monitored and regulated [5].
To address the needs of ECG data analysis, we previously constructed the ECG databases, Electrocardiogram Vigilance with Electronic data Warehouse I (ECG-ViEW I) and ECG-ViEW II, using ECG data measured in a tertiary teaching hospital located in Korea [89]. However, these previous versions of the database had limitations because they did not provide the ECG waveform. To overcome these limitations, we designed an update of the ECG database.
II. Case Description
This study was a retrospective review of Electronic Health Records and was approved by the Ajou University Hospital Institutional Review Board (No. AJIRB-MED-MDB-18-075), which also waived the requirement for informed consent.
1. Data Resources and Patient Characteristics
ECG-ViEW I and II covered three data sources: scanned images of paper-based ECGs, ECGs in portable document format (PDF) from the MUSE system (GE Healthcare, Waukesha, WI, USA), and image files stored in the hospital's Electronic Medical Records (EMRs). In contrast, we used only the PDF ECGs from the MUSE system and no images in this study due to the following reasons. The images scanned from paper-based ECGs or image files from the EMRs are saved as pixels images; the quality of information extracted by optical character recognition (OCR) is dependent on the quality of image, and there is no appropriate way to extract waveforms with high accuracy. On the other hand, the quality of data extracted from PDF files from the MUSE system is stable and well controlled. Moreover, because the waveforms in PDF files are saved in scalable vector graphics (SVG) format, it is possible to export the waveforms and maintain the quality of the raw data [10].
2. ECG Data Extraction
An ECG report typically contains both alphanumeric values and waveform graphs (Figure 1). The upper part of the ECG report is a list of alphanumeric values including demographic information, patient ID, evaluation date, ECG parameters, and interpretations (e.g., normal sinus rhythm). Demographic information refers to basic patient information including name, age, sex, and ethnicity. ECG parameters include ventricular rate, PR interval, QRS duration, QT/QTc, and P-R-T axes. The waveform graphs, which typically cover the middle and bottom part of the ECG, are time series of graphs representing the sensor measurement data.
The alphanumeric data were converted from PDF to eXtensible Markup Language (XML) format to increase the accuracy of the parsing results. The main difference between parsing in PDF and XML is related to the handling of irrelevant data. In PDF format, data are saved and parsed according to object type, and deletion of unnecessary data requires careful manual revision. In contrast, the XML format enables conditional parsing, and thus, relevant data can be extracted using automated code. Second, the XML format provides x- and y-coordinates for each piece of data. The axes of the coordinates start from the upper left corner of the ECG report. Thus, alphanumeric information can be extracted based on the position information (i.e., the location of each piece of data on the ECG report).
The part of the PDF file containing the waveform data is stored as metadata of the image in SVG format. To extract the image data, we used INKSCAPE (open-source software, https://inkscape.org) in Linux and converted the waveforms in PDF to SVG format images. The waveforms in the SVG format were processed using svgpathtools from the Python library to classify the parsed data into three categories: path, attribute, and svg_attribute. We used the ‘path’ data to convert the waveform into a numeric series. ‘path’ can be considered to be composed of real and imaginary numbers describing the starting and ending points (i.e., a complex plane). After transforming this complex plane into a Cartesian coordinate system (Figure 2A), the starting point of each waveform was adjusted to ‘(0,0)’ (Figure 2B). It was required to reset the baseline of each waveform because the starting point of the waveforms in raw data corresponds to a certain position on the ECG report. Then, we adjusted the values of the time series data to indicate the units of millivolt (mV), which corresponds to 10 times the height of the grid unit (square) used in the ECG report (Figure 2C). Finally, we converted the x- and y-coordinates of the vector images to an equidistant time series similar to that obtained from the sensor (Figure 2D). We set the frequency of data at 500 Hz and identified data points on the waveform using linear interpolation between points with known coordinates (the frequency of the raw data was varied from about 200 to 420 Hz).
There are 13 waveforms in a single ECG report: 3-second strips for each of the 12 leads, plus a 10-second strip, usually for lead II. Each waveform was saved in a separate file in comma-separated value (CSV) format, resulting in 13 waveform CSV files per ECG report. The waveform data were saved as a compressed CSV format using gzip, while their metadata were stored in a database to link the waveform data with the corresponding alphanumeric values from the ECG report as well as with the clinical data from the EMR.
3. Software Tools
We used a Java programming tool to extract the PDF files from the MUSE system, and a Linux-based program (pdftohtml) to convert PDF to the XML format. INKSCAPE for Linux was used to convert PDF to the SVG format. The parsing of XML and SVG files was performed using Python, and the svgpathtools library was used to extract the waveform data. The ElementTree library was used to parse the XML formats. All software tools and codes used during the parsing process are available in Supplementary 1.
4. Data Validation and Quality Control
The accuracy of the ECG data extraction was validated according to the correlation between the extracted and calculated QTc values. The parameter QTc can be calculated based on the QT intervals and RR intervals using Bazett formula. Approximately 99.94% of the extracted QTc values and calculated QTc values matched within ± 2 ms. We assume that the cause of this difference between the two values is due to the process of rounding off. The QT and RR values that we used to calculate QTc were already rounded as integers; thus, there might be some difference from the original value used when calculating QTc in the ECG machine. For the extracted results in which the difference with the calculated QTc value was relatively high (> 2 mm), we manually reviewed the results and confirmed that there was no error in the data extraction process.
Because the frequency of the waveforms was adjusted to 500 Hz, there could be little difference in the converted waveform with the raw data at locations where the time series data point does not coincide with the x- and y-coordinates. To validate the quality of the linear interpolation, we compared the original waveform to the converted waveform for randomly chosen waveforms. The difference was not noticeable shown in Figure 3.
5. Data in the Database
The ECG database contains a total of 1,039,550 ECGs from 447,445 patients (Table 1). The mean follow-up period per person was 717 ± 1,534 days.
6. Software Availability
All programming codes for extracting both the alphanumeric and waveform data are provided in Supplementary 1.
III. Discussion
The update of the ECG databases includes a complete dataset in which the relevant data (ECG values, ECG waveforms, and demographic, diagnosis, medication, laboratory, and any other information related to the hospital visit) are provided for all patients covered by the database. Therefore, the database could be used as a data source in various studies including comprehensive clinical evaluation to determine the potential associations between ECG values or patterns and specific diagnoses, medications, or hospital visit characteristics.
We are currently working on collecting biosignal data from patient monitoring devices from more than 100 beds in an emergency room, intensive care units, and an operating room [11]. All biosignals including ECG lead II, peripheral capillary oxygen saturation, respiration, arterial blood pressure, central venous pressure, and end-tidal CO2 data are collected onto a local server. By constructing the ECG database from the ECG reports, we could expand coverage of the biosignal collection into general wards.
The ECG database, described in this article, is one of the largest ECG databases linked to relevant clinical data. This database has integrated all 12 lead ECG waveforms and not just only the numeric parameters of ECG, patient demographics, diagnosis data and drug prescription data. Although the full dataset cannot be made publicly available due to legal restrictions imposed by the Korean government in relation to the Personal Information Protection Act, we expect that the description of the process for constructing the database and the programming codes provided in Supplementary 1 could be a good reference for other researchers to build their own ECG databases using their own local ECG repository.
Acknowledgments
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (No. HI16C0982, HI17C0970, and HI6C0992).
Notes
Conflict of Interest: No potential conflict of interest relevant to this article was reported.
References
Supplementary Materials
Supplementary materials can be found via http://doi.org/10.4258/hir.2018.24.3.242.