Care Evolution Logo Care Evolution


Clinical Trials & Research

Doctors walking with a tablet

EHR data in clinical research (part 3): making the data usable

Welcome to the final part of our series on using electronic health record (EHR) data in clinical research—check out part 1 and part 2 if you missed them. In this installment, we will delve into the challenges associated with EHR data and discuss strategies to enhance its usability for research analytics. As passionate advocates for leveraging EHR data to drive advancements in healthcare, we believe that by addressing these challenges head-on, researchers can unlock valuable insights and shape the future of clinical research. Let’s explore the key considerations for making EHR data usable in research.

The realities of clinical data

To fully harness the potential of EHR data, researchers must confront the realities of clinical data and overcome gaps, variability, and disparities in its availability. The goal is to make the data as easy as possible to access, lower barriers, and ensure it fits seamlessly into existing workflows.

Healthcare interactions and longitudinality
Patients receive care from multiple providers and their data is scattered across different systems. Understanding the longitudinality of patient care becomes crucial to gain a comprehensive view of their medical history. Researchers must develop strategies to aggregate data from various providers and Health Information Exchange (HIE) networks to create a cohesive patient record. Except in cases where identity is asserted (e.g., patient-mediated exchange), record linking is required to do this. Privacy-preserving record linking (PPRL) methods can connect records across different sources.

Variable patient access and data availability
Patients have varying degrees of access to their own data, which affects their ability to share it for research purposes. Researchers must account for this variability in data availability and develop methods to enable patient-mediated exchange to ensure comprehensive data collection. Furthermore, by incorporating additional claims into the dataset, researchers can more effectively identify missing data and providers, thereby enhancing data consistency and completeness. Today, claims are also accessible through patient-mediated exchange, which can prove valuable information, especially in situations where EHR data is limited or sparse.

Enhancing and interpreting data for research

Achieving consistency of form and meaning
Aggregating EHR data without losing fidelity or meaning is a critical challenge. Researchers should convert formats to achieve syntactic consistency, enabling aggregation of records for an individual across systems. Field mapping and field definitions pose additional challenges. Researchers must establish relationships between data fields and address variations in definitions, such as date formats or status indicators, across different sources and over time. Ensuring accurate and consistent data aggregation requires careful attention to these complexities.

Semantic consistency ensures a common understanding of concepts and values across different sources of EHR data. Standardizing terminology is vital for achieving semantic consistency, but challenges arise due to variability in coding. Researchers must tackle issues related to well-coded, poorly coded, proprietary codes, and free-text entries. Term mapping and classification techniques play a crucial role in aligning different codes and capturing the true meaning of the data.

Using multiple sources and feedback by participants to validate data
In traditional research, when a participant asserts in a survey that they have been diagnosed with a particular condition or reported a lab value, this data can not frequently be validated. Access to observational data from clinical interactions enables researchers to corroborate participant responses and improve data fidelity. The reverse is also true. EHR and claims data may contain errors or inaccuracies, some of which have been propagated from system to system. Participants can be offered opportunities to validate or correct data, providing opportunities for both participant engagement and data enhancement.

Where multiple sources are available, determining which source to use or “trust” is part of the analytical process. For example, whether to use EHR data or patient-reported data depends significantly on the data point and scope of source systems. EHR data may be better at obtaining “objective” data such as laboratory results and medication prescriptions more accurately and consistently. Diagnoses, on the other hand, are more nuanced, requiring, for example, interpretation to determine whether a chronic disease is newly diagnosed or long-standing or whether an acute problem is still present or resolved. Contextual data such as clinician observations, physical exam findings, and patient activities are often captured only in free-text notes, requiring natural language processing or manual review to incorporate into analysis. In these cases, patient surveys may be found to be more reliable.

Analytical considerations and data interpretation
Researchers must navigate the data pipeline and interpret EHR data accurately. This involves understanding missing data and its context, distinguishing between the presence/absence of signs, and accounting for secondary use and data pipeline artifacts such as data conversion lossiness and potential bias. Analytical considerations, including interpreting differences and changes in data, differentiating between absolute and relative differences, and understanding temporality, contribute to robust research analysis.

Use case: hypertension in pregnancy

Let’s take a look at a use case. A research team was seeking to study hypertension and maternal health outcomes. They have access to data from several thousand study participants who have granted access to their patient-mediated Fast Healthcare Interoperability Resources (FHIR) records and to claims data via their payor.

A simple analysis of the raw data finds a relatively small number of patients matching a pregnancy-diagnosis code, very few with a gestational hypertension diagnosis, and only a handful with multiple blood pressure recordings. The researchers are disappointed in the dataset and skeptical that it represents reality. What are they missing and why?

First, data in the claims dataset is not linked to data in the EHR dataset. In claims, numerous patients were identified with pregnancy diagnoses but they had no linked blood pressure readings. Multiple rich datasets can be great for research, but typically only if the same patients can be identified across them and the data aggregated into a single common format.

Second, the researchers were confused as to why the EHR dataset, received in the FHIR format, had few clinical notes or labs to confirm diagnoses and no encounters to assess frequency of healthcare system interactions. Upon further inspection, though, the researchers realized that many of the FHIR “Bundles” had embedded encoded documents which were not human-readable. Many of these documents turned out to be C-CDAs (an older HL7 standard for exporting EHR data), which contained a wealth of additional clinical information which was not present in the structured FHIR fields. These data included blood pressure recordings, prenatal visit dates, and additional diagnoses and comorbidities. By extracting and converting the embedded documents into FHIR, the dataset was significantly expanded.

Third, the researchers took a deeper look at the diagnosis, procedure, vitals, and labs in the data and noticed that numerous potential data points were not being included in their analysis because the source data was poorly coded or uncoded/free-text.

Numerous blood pressures, for example, showed up as free-text in observation data:

Auto Cuff Mean Systolic BP-Repeat BP
Resting BP
BP Systolic (First BP Taken)
Systolic BP Standing (Wait 3 Min)
Blood pressure diastolic
Blood pressure, systolic, left arm
Systolic BP-Pt Reported

In order to utilize these observations in analysis, researchers had to map the data to reference terminologies (e.g., LOINC 55284-4 BP Sys/Dias; SNOMED 75367002 Blood Pressure (observable entity)).

Finally, even in those reference terminologies, the researchers found multiple codes related to a data point of interest, typically at greater specificity than needed for analysis. The researchers therefore utilized value sets to group codes. They were able to group LOINC codes by “component” and to identify sets of relevant SNOMED and ICD-10 codes, such as the “Eclampsia” value set published by the Joint Commission in the NLM Value Set Authority Center.

In summary, before the researchers were able to perform any analysis or gain insight from their large dataset, they needed to achieve consistency in data format (syntax), aggregate data across multiple sources, align on definitions for numerators and denominators, and validate the resulting datasets. This included at least the following steps:

  • Extract C-CDAs & convert to FHIR CareEvolution has found that some EHR vendors, while complying with rules mandating patient access to records via API, are simply embedding substantial sections of the medical record as “documents” (e.g., C-CDAs) within FHIR resources, rather than as structured FHIR data. Extracting these data and converting to FHIR allows it to be aggregated. By extracting and converting the embedded documents into FHIR, the dataset was significantly expanded. Note that inclusion of some of these data are not required by current rules and could be removed by vendors or providers. As requirements and implementations evolve, researchers should be aware that real-world data availability may change.
  • Convert X12 to FHIR Post-processed claims data (e.g., 837’s) can be useful to get a longitudinal view of healthcare system interactions. This data needs to be converted to a common format for analysis.
  • Link records The claims data needs to be linked with the EHR data in order to analyze across these data sources.
  • Combine data Once the participant records are matched across sources, the data needs to be aggregated into a single data set.
  • Map termsIf the researchers rely only on coded source data, they would dramatically undercount data such as conditions and immunizations. Standardizing source data by mapping to reference code systems enables even poorly coded or uncoded/free-text data to be included in analysis.
  • Use value sets In the healthcare informatics world there is often more than one way to represent an analytical data point in reference terminologies. Value sets enable analysts to leverage common groupings of codes. Some value sets are publicly available, for example from the NLM Value Set Authority Center.

Even with the above pre-processing, the researchers still had work to do in preparation for analysis. For example, they had to determine how to define “pregnancy” in the data. Would they include only patients with a certain documented pregnancy diagnosis codes or also those with a positive pregnancy test and multiple prenatal visits? Which dates in the data would be relied upon to identify pregnancy timing? How would they identify pregnancy complications or pregnancy-related healthcare encounters? With an aggregated and linked dataset in a single common format with reference term mapping and value sets, however, they were able to get to substantive analytical questions and findings much more quickly.


Making EHR data usable for clinical research requires researchers to address the challenges associated with data availability, syntactic and semantic consistency, and analytical considerations. By employing strategies such as record linking, format conversion, terminology standardization, and classification, researchers can unlock the immense potential of EHR data. As we conclude this series, we encourage researchers to embrace these challenges, enhance the usability of EHR data, and embark on a journey of discovery and innovation. Together, we can shape the future of clinical research through the power of EHR data.

Ready to incorporate EHR data in your next clinical study? Try MyDataHelps™—free for up to 100 participants—or contact us to learn more!