Led by the CDC and implemented by all 50 states and more than 3,000 local jurisdictions and territories, public health surveillance in the United States spans the monitoring of infectious diseases, chronic diseases, injuries, and mental health conditions, as well as social determinants of health. Surveillance can capture data on every aspect relevant to the cause or spread of disease – behavioral risk factors, preventive actions, cases, program or treatment costs, and more.
The COVID-19 pandemic put public health surveillance – and its urgent need for modernized systems and methods – in the spotlight. To identify, contain, and prevent outbreaks, state and local public health agencies undertook the massive task of tracking cases, variants, vaccinations, and hot spots and sharing that data with federal agencies. This was no simple task – for example, Politico reports that “in Washington state, health officials went from tracking 30,000 disease lab reports a month in 2019 to 30,000 a day during certain points in 2020.”
This sharp increase in demand strained what were already significant gaps in public health surveillance data infrastructure and methods, including:
Lack of interoperability
Analytical approaches that cannot scale
Delayed and cumbersome data collection and reporting
Current methods of public health surveillance
Hospitals, healthcare providers, and laboratories use a variety of systems to collect data – some required by law, others on a voluntary basis. Typically, they report data to state and local public health agencies, which share the information with CDC and other federal agencies.
Common sources of public health data
Electronic case reports (eCRs)
Electronic health records (EHRs)
Electronic laboratory reports (ELRs)
Disease and health registries
Health behavior surveys
Agencies aggregate, deidentify, synthesize, and disseminate the information to inform policymaking, public awareness, and research – a process that can often take months or years after the data was initially collected.
Typical flow of public health data
Many current systems rely on disease-specific monitoring and manual data entry, which places a substantial burden on federal data partners. State and local reports to CDC are often delayed because the systems and data are not interoperable.
CDC encourages standardization, but it lacks the authority to receive data directly without establishing a data use agreement with each state and local jurisdiction. As a result, the agency must manually clean the data before conducting the analyses needed to provide a national, aggregated picture of public health. This can significantly delay the sharing of data with providers and other trusted partners with important roles in public health response.
For example, the U.S. Food and Drug Administration (FDA) monitors the safety of regulated medical devices, and the National Cancer Institute tracks cancer trends and statistics. These data are disseminated through agency-specific reporting channels and, in some cases, made available for research in data hubs.
With more modernized data infrastructure, public health leaders can better identify and contain outbreaks, understand disease burden, guide policy changes, evaluate and improve prevention and control strategies, and target research investment.
Current efforts to modernize public health surveillance
The United States has made major advancements in notifiable disease reporting, syndromic surveillance, mortality reporting, and electronic lab reporting in the past decade. Building on these efforts, CDC recently launched a comprehensive Data Modernization Initiative (DMI) and a dedicated Center for Forecasting and Outbreak Analytics (CFA). These initiatives are many years in the making and, taken together, they are leading the charge to transform our public health surveillance system into one that is connected, resilient, adaptable, sustainable, and response-ready.
CDC’s efforts are bolstered by the Office of the National Coordinator for Health Information Technology (ONC)’s work to define standards and practices for interoperable data sharing and inform the incentives driving their adoption. Chief among ONC’s accomplishments: Advancing the Fast Healthcare Interoperability Resources (FHIR®) standard and publishing the Trusted Exchange Framework and Common Agreement (TEFCA) to establish a universal floor for interoperability across the country.
A holistic strategy for next-generation surveillance a reality
As our nation defines and implements the next round of investments to modernize public health surveillance, agency leaders need a holistic strategy and an unwavering focus on the end goal. Defining and implementing a solution for real-time, actionable data and rapid, accurate insights will require a massive acceleration of efforts across lead agencies and data partners.
As they advance public health systems, agencies will need to simultaneously expand, coordinate, standardize, and streamline data collection and sharing. They can do so by adopting a scalable, federated data mesh infrastructure and further expanding data interoperability. With a stronger technological foundation and a greater volume of usable data, agencies can then deploy powerful analytical tools at scale that can provide a comprehensive, decision-ready picture of a given public health threat or situation.
At the same time, public health agencies must pursue intelligent automation tools to ensure that the benefits of surveillance modernization do not create additional burdens on already-strained public health workers.
A scalable, federated data infrastructure
Our nation’s existing network of siloed, disease-specific systems creates significant redundancies and inefficiencies and – equally important – cannot scale to support the level of data aggregation and access that public health agencies need.
To meet the demands of a modern public health data ecosystem, federal agencies need a scalable, federated data mesh.
By leaving data ownership decentralized, a data mesh allows those who are most knowledgeable to control their data. In a public health context, this means health agencies, insurers, academic partners, and others act as nodes in a network.
Rather than reporting directly to CDC, state and local agencies would make their data products – EHR data, laboratory reports, genomic sequencing information, immunization records, etc. – available via the mesh.
Using a self-service platform powered by robust metadata, search features, and a data catalog, authorized data consumers can find, access, aggregate, and analyze the data. They can also access pre-built algorithms and create new data products and reusable algorithms. CDC would serve a crucial governance and stewardship role – developing and enforcing implementation guidelines and standards, establishing a data catalog, and executing a privacy layer. Using a privacy-preserving record linkage (PPRL) technology, the privacy layer would maintain HIPAA compliance by enabling patient matching even with deidentified data. For example, PPRL employs hashing to convert names, birthdates, and addresses into encrypted tokens that preserve the original values.
By operationalizing PPRL with standardized FHIR data components, public health agencies would be able to ingest and collect data from multiple sources and feed those data into scalable analytics and modeling tools.
With appropriate governance, a data mesh would provide access to analysis-ready data products, eliminating the bottlenecks typically associated with centralized reporting and dissemination. As a result, public health agencies could accelerate data aggregation and analysis – and public warnings and outreach – which is particularly critical for fast-moving threats such as infectious diseases.
Data interoperability: From data push to data pull
However, data infrastructure is only as successful as the volume and quality of inputs that feed into it. Achieving America’s public health goals hinges on widespread adoption of application programming interface (API)-based data standards to accommodate the data volumes necessary for rapid digital reporting in a scalable way.
To that end, public health agencies, surveillance programs, and health information exchanges (HIEs) and their network participants must continue progress toward full adoption of FHIR – and specifically, its RESTful API functionalities such as Bulk FHIR.
With FHIR and Bulk FHIR-enabled APIs, public health agencies could shift from a “push” paradigm that relies on providers to send data. Instead, agencies could adopt a query or subscription-based model (“pull” paradigm) to receive automated case updates.
Expanding interoperability with intelligent systems
Currently, only EHR data and social determinants of health (SDOHs) are interoperable via the established standard – aka the United States Core Data for Interoperability (USCDI). These data can and should be augmented by structured health data siloed in other agency systems, as well as data from other, relevant sources, including:
Geospatial data such as walkability and access to care
Remote-sensing data such as wastewater testing and satellite imagery
Mobility data from smartphones, GPS, and sensors along highways and roads
By layering additional data from currently siloed health systems and non-health sources, public health agencies can enrich the baseline USCDI data for truly robust insights. Recent efforts have demonstrated the value of multilayered data to track the spread of COVID-19, understand the effects of social distancing, and predict obesity rates, for example.
These results are encouraging but limited in scope. The lack of interoperability across data sources makes it impossible to scale such approaches for real-time, actionable surveillance. While ONC continues to advance and expand USCDI in collaboration with CDC and other stakeholders, this process is incremental by design. In the meantime, CDC must pursue alternate approaches to bring more data into public health models and simulations.
Machine learning feature stores have strong potential to fill in the gaps. This novel tool provides the flexibility required to ingest data – via direct connection or high-throughput API – from sources that use varying data standards. Once ingested, a ML feature store can harmonize that data with FHIR, making it usable in public health models and simulations.
By extending interoperability and connecting the universe of rich, relevant data, public health agencies can boost the accuracy of prevalence estimates, counter-balance biases in traditional data collection, effectively target control and prevention strategies, and better allocate resources.
Data solutions should follow best practices such as the FAIR guiding principles – which help ensure that data are Findable, Accessible, Interoperable, and Reusable – for scientific data management or stewardship.
Unleashing the potential of a modern data infrastructure
With a federated data mesh infrastructure that allows access to high volumes of rich, interoperable data, a modernized public health surveillance system can deploy advanced analytics and novel technologies to optimize efficiency – all at sufficient scale to produce accurate, real-time insights.
1. Using natural language processing to analyze complex, unstructured data
A tremendous volume of valuable health data is buried in imaging files, lab reports, and clinical notes. Relatively recent advances in natural language processing (NLP) make it possible to analyze these types of unstructured data.
NLP enables computer systems to understand and interpret human language through topic modeling, sentiment analysis, and other techniques. By capturing complex linguistic relationships, NLP goes well beyond keyword searches to identify common themes or attitudes towards a particular topic from medical record notes, as well as social media data and other large, unstructured data sets.
In recent years, the performance of NLP has improved significantly through what’s known as transfer learning – that is, taking a well-honed model and using it to train a new model for a related task. Massive pre-trained language models such as Google’s BERT and OpenAI’s GPT-3 are driving the state of the art across the full range of NLP’s capabilities, enabling the development of more powerful models with less training data and computing resources.
To date, public health researchers have successfully employed NLP models to monitor flu-like symptoms mentioned on Twitter, identify public sentiment related to the COVID-19, and pursue other exciting studies. These applications only begin to scratch the surface of NLP’s potential – particularly when combined with a federated data infrastructure and extended interoperability – to revolutionize how public health surveillance is conducted on a national scale.
2. Large-scale modeling for robust, scenario-based insights
Agent-based modeling (ABM) is a computational method for simulating actions and interactions between people and their environment. Public health researchers use ABM to model disease transmission, social influences on health, health behavior outcomes, and evaluate the efficacy of interventions.
The utility of ABM depends on how well the environment and rules that govern agent behavior are understood. With more and better data, ABM simulations can be used to model increasingly complex scenarios.
For example, public health officials could:
Examine the impact of immunization and introduction of new variants on community spread
Identify at-risk populations
Detect hotspots and conditions that promote the spread of the disease
Proactively evaluate the efficacy and impact of prevention and control strategies
Powered by sufficiently rich data such as demographics, social determinants, vaccination status, geographic and other environmental data, sophisticated agent-based models can predict risk and outcomes, allowing agencies to effectively allocate resources in the interest of public health.
Reducing the burden on public health workers with intelligent automation
Greater data collection and more advanced analysis is crucial to furthering our understanding of – and therefore improving – public health. However, surveillance modernization efforts cannot become another burden on the public health workforce. Public health agencies at all levels already face a dire shortage of workers, with roughly 44 percent considering leaving their jobs within the next five years. This makes the adoption of tools such as intelligent automation (IA) an essential step in this journey.
In public health surveillance, IA could significantly improve infectious disease reporting by automating the collection and transfer of relevant health information from EHRs. When a health worker records a particular symptom or disease case in a patient’s EHR, the IA system could automatically send the data directly to CDC or other agencies, eliminating the administrative burden currently required for reporting. IA systems could also scan and interpret lab reports or clinical notes to uncover disease cases that might otherwise elude health officials and trigger reports to state and local authorities.
IA not only automates predefined, repeated tasks, but also allows the system to learn and adapt. Powered by artificial intelligence and machine learning, an IA system for extracting data from unstructured text can go beyond simple optical character recognition, leveraging NLP to understand context, reduce noise, and improve accuracy.
By employing IA solutions, public health agencies can produce more complete and accurate assessments of disease burden and trends while simultaneously enhancing operational efficiency – eliminating manual, repetitive work and allowing human workers to focus on higher-value tasks.
An action plan for alignment and governance
As federal agencies define and implement a public health surveillance system that integrates rich, interoperable data to power robust analytical tools and IA solutions at scale, long-term success will hinge on alignment with key data partners and clear governance.
They can take these initial steps:
Define one or more discrete, priority use cases to demonstrate the value of data solutions.
Select data partners whose data sources can be integrated into a data mesh solution.
Create a participatory governance framework to address policy, technical, and operational considerations.
By including state and local agencies, HIEs, data aggregators, laboratories, and/or other data partners and focusing on discrete use cases, federal public health leaders can pursue an iterative approach to defining and testing solutions – while simultaneously supporting effective change management across public health stakeholders.
A vision for the future of public health surveillance
As public health agencies integrate – and act on – lessons from the COVID-19 pandemic, strengthening America’s surveillance system represents the highest priority.
By investing in next-generation infrastructure and expanding the universe of available and interoperable data, agencies can establish an analytical pipeline with unprecedented robustness. This pipeline would fuel models and simulations with sufficient power to derive real-time insights – for better policy and programs focused on prevention, control, and response. Armed with the power of intelligent automation, public health agencies can implement these advances without further taxing the workforce – effectively doing more with less.
These strategic investments hold the key to real-time surveillance data and insights that allow our leaders to understand disease burden, predict future risk, develop and evaluate prevention and control strategies, and – ultimately – save lives.