Real World Data - Introduction

April 8th, 2025

Why do we care about real-world data at Cromodata?

Because we care about real people.

In just 10 minutes, we’ll give you an introduction to the world of Real-World Data, so that in future posts you can dive deeper with a solid foundation and an informed perspective.
Click the buttons below if you want a bit of context before getting started.
RWD: Real-World Data: Refers to data generated through routine clinical practice and health-related activities conducted outside of controlled research environments. When we talk about Real-World Data, we’re referring to a large volume of information. It’s anonymous — there are no names attached. The value lies in the broader patterns and insights we can draw from analyzing it.
RWE: Real-World Evidence: Refers to insights derived from the analysis of Real-World Data. This process requires the design and application of robust algorithms and statistical methods aimed at answering specific research questions or addressing defined problems. A quick example to understand this is the data collected during the COVID-19 pandemic. Every day, real-time information was gathered about the number of infected people, their age ranges, and even pre-existing conditions or key clinical data. The data was then anonymized and structured. From that, we got very useful summaries out of a chaotic and dynamic situation. For instance, by October 15, 2020, we knew there were over 38 million COVID-19 cases worldwide, and more than a million deaths. It was also clear that a large percentage of those with severe complications had prior conditions like serious heart disease, severe obesity, or diabetes. This is exactly where a researcher might step in — to explore whether there’s a real relationship between age, pre-existing conditions, and risk levels. To do that, they’d build algorithms to investigate those questions — in other words, to generate evidence. Or even develop risk predictors to help guide prevention and treatment decisions.
Ómics: Genomics, transcriptomics, proteomics, metabolomics, and more. Omics refers to a group of scientific fields that study the biological molecules that make up an organism. These include: genomics, transcriptomics, proteomics, metabolomics, epigenomics, microbiomics, lipidomics, glycomics, interactomics, phenomics, metagenomics, exposomics, and pharmacogenomics. Today, genomics and transcriptomics are the most widely used. Genomics focuses on studying the entire genome of an organism, while transcriptomics looks at all messenger RNA (mRNA) transcripts — that is, gene expression. In all cases, these fields involve analyzing large volumes of biological data, such as genes, proteins, metabolites, and more.
 AI: Artificial Intelligence / ML: Machine Learning/ DL: Deep Learning. Artificial Intelligence (AI) is a branch of computer science that applies specialized tools to perform complex tasks such as reasoning, learning, decision-making, and natural language understanding. Machine Learning (ML) and Deep Learning (DL) are subfields of AI that focus on developing algorithms and statistical models enabling computers to learn from data and enhance their performance without the need for explicit programming for each specific task.
Target learning: It’s a widely used methodological approach that combines statistical inference with machine learning. The objective is to train a model using a classified dataset, in which each input is paired with a corresponding output. It represents one of the primary branches of machine learning (ML).
The use of Real-World Data (RWD) is transforming the way research is conducted and decisions are made across the fields of medicine and public health.

Over the past few years, a range of interconnected developments has driven this shift—particularly within the medical domain. Some of the most significant include:

🔹 The widespread expansion of the internet and social media.
🔹 Remarkable advances in Artificial Intelligence (AI) and quantum computing.
🔹 Breakthroughs in biotechnology, genomics, and the broader “omics” sciences.
🔹 The growing adoption of telemedicine and wearable technologies.
🔹 The accelerated development of personalized medicine.
🔹 A significant increase in data storage and processing capabilities (one of Cromodata’s areas of expertise), enabling access to and utilization of large-scale datasets.

When combined with the rising costs and well-documented limitations of traditional clinical trials, real-world data emerges as a powerful resource to help bridge the gap between clinical research and real-world practice.

Some of the most valuable uses of RWD in healthcare are already well known. Recruiting patients for clinical trials, comparing the effectiveness of drugs and treatments, and monitoring their safety are among the most common in the pharmaceutical world.

Up next, we’ll highlight some of the newer and more promising use cases.

Around 30% of the world’s data volume is generated by the healthcare industry.

Real-World Data (RWD) comes from many sources, but the most common are: Electronic Health Records (EHRs), patient registries, administrative databases (including clinical records, insurance data, and billing information), disease registries, and databases on pharmaceutical products and medical devices.

RWD can be more or less accessible, structured, and refined for use — and that determines how quickly and easily statistical analysis and predictive models can be applied. The ultimate goal is to generate Real-World Evidence (RWE) that can be used to draw conclusions, validate hypotheses, design studies, support regulatory decisions, develop public health policies, or even guide clinical practice.

One of the biggest challenges today in Latin America — and in many parts of the world — is the high level of data fragmentation, which makes RWD hard to use. That’s where a promising niche is emerging: ensuring access to medical datasets that are both interoperable and secure.

This will be key from now on for most advances in medicine — from training AI and developing new drugs, to scientific discoveries, medical research, and precision medicine. It will also help address unmet medical needs, study hard-to-reach subpopulations, and evaluate the long-term safety and effectiveness of treatments.

In the figure below, you’ll see the most common and widely used sources of RWD today.

Below, we’ve included some additional information. In reality, some data sources are actually much broader than they seem, and others are just now starting to be incorporated.
Clinical data includes both information from Electronic Health Records (EHRs) — such as hospitalizations, procedures, treatments, medical visits, diagnoses, symptoms, lab results, imaging, and even clinical notes — and also demographic data, test results, procedures, pathology/histology findings, radiology images, microbiology data, provider notes, admission/discharge summaries, progress reports, functional status, and more.
Genomic and genetic testing data (SNPs/panels); multi-omics data (proteomics, transcriptomics, metabolomics, lipidomics); and other biomarker status.
These are fitness trackers, wearable devices, and other health apps used to measure physical activity and body function. They include mobile devices like smartphones, tablets, monitoring tools, and digital assistants. They also include wearables like smartwatches and fitness bands (e.g., Fitbit, Apple Watch), which track health metrics such as heart rate, physical activity, blood oxygen levels, sleep quality, and more. Other medical-grade devices — such as blood pressure monitors, digital thermometers, pulse oximeters, and blood glucose monitors — also qualify as data generators, as they allow patients to monitor their health in real time
Medical claims and other data related to the use of medications and treatments. This also includes patient-reported records such as surveys, diets, habits, personal health logs, reports of adverse events, quality-of-life measures, and more. It also covers records from insurance companies and billing systems.
Administrative records, concomitant therapies, point-of-sale data, and medical claims.
Climate factors, pollutants, infections, lifestyle habits (diets, habits), personal health records, adverse event reports, quality of life measures, among others.
Disease burden, clinical characteristics, prevalence/incidence, treatment rates, resource use and costs, disease control, quality of life measures, and more.
Historical data on health conditions and allergies related to the patient and their extended family, smoking status, alcohol consumption, general habits, and demographics.

In recent years, several categories of RWD have gained particular relevance — among them laboratory and genomic data (with spatial genomics standing out), pharmaceutical data, oncology data and data on both prevalent and rare diseases, information on social determinants of health (SDOH), and records from specialty pharmacies. Health data provide a unique opportunity to deepen our understanding of rare diseases. In this regard, biopharmaceutical companies often face challenges in recruiting sufficiently large study populations.

“Omic” data — including genomic, epigenomic, microbiomic, pharmacogenomic, transcriptomic, proteomic, and metabolomic datasets, among others — offer transformative potential for both healthcare research and clinical application. Genomic data, in particular, are growing in prominence due to the increasing use of biomarker-targeted therapies, which are central to today’s precision and personalized medicine approaches. Spatial genomics — which merges genomic or transcriptomic sequencing with spatial localization techniques — is an emerging field within omics sciences. It shows strong market potential and valuable clinical applications, particularly in cancer, neuroscience, inflammation and autoimmune diseases, and embryonic development.

Pharmacy-related data are also highly valuable in the context of specialty medications, which currently represent approximately 75% of drugs in development.

Medical imaging data — comprising an estimated 90% of all healthcare data — are playing a central role in the development and validation of new artificial intelligence tools.

The increased availability of medical imaging within real-world data (RWD) in recent years has significantly fueled the development of machine learning (ML) algorithms aimed at enhancing diagnostic precision. Medical imaging represents a highly complex data type, yet one with tremendous potential for disease detection, diagnosis, and monitoring.

When properly used and analyzed, RWD holds the potential to generate valid and unbiased RWE — offering significant cost and time savings compared to controlled clinical trials— and to improve the efficiency of medical and health-related research and decision-making.

Today, there are three main challenges to advancing AI: algorithms, computing power, and data. While the first two have well-established markets supporting them, getting access to quality data for training AI remains a big challenge — and an even greater one in Latin America.

That’s exactly why we do what we do at CROMODATA.