New Directions in AI for Public Health Surveillance

By Daniel B. Neill

NOTE: This is an overview of the entire article, which appeared in the January/February 2012 issue of the IEEE Intelligent Systems magazine.
Click here to read the entire article.

Public health surveillance is the process of detecting, characterizing, tracking, and responding to disease outbreaks, other health threats (such as a bioterrorist attack, radiation leak, or contamination of the food or water supply), and other patterns relevant to the health of populations (such as obesity, drug abuse, mental health, or malnutrition). This surveillance takes place at the local, state, national, and global scales, and often requires coordination of multiple entities (for example, hospitals, pharmacies, and local, state, and federal public health organizations) to achieve a timely, focused, and effective response to emerging health events. The focus of this article is on the role that AI and machine learning methods can play in assisting public health through the early, automatic detection of emerging outbreaks and other health-relevant patterns.

The last decade has seen major advances in analytical methods for outbreak detection. Deployed systems have incorporated many methods, monitoring a variety of public health data sources such as Emergency Department visits and over-the-counter medication sales and enabling more timely and accurate identification of disease outbreaks in practice. While most existing surveillance systems rely heavily on basic statistical methods such as time series analysis together with the expert knowledge of public health practitioners, we believe that the disease surveillance field is entering a major paradigm shift due to a dramatic increase in the number, quantity, and complexity of available data sources. Current disease surveillance systems are relying more and more on massive quantities of data from nontraditional sources, ranging from Internet search queries and user-generated Web content, to detailed electronic medical records, to continuous data streams from sensor networks, cellular telephones, and other location-aware devices.

Practitioners will increasingly rely on tools and systems that use advanced statistical methods to accurately distinguish relevant from irrelevant patterns, scalable algorithms to process the massive quantities of complex, high-dimensional data, and machine learning approaches to continually improve system performance from user feedback. Thus we believe that the next decade of disease surveillance research will require us to address three main challenges:

  1. Making disease surveillance systems more interactive, enabling users to make sense of the mass of available data.
  2. Exploiting the richness and complexity of novel data sources at the societal scale.
  3. Creating new methods which can scale up to massive quantities of data and can integrate information from large numbers of data sources.

This article discusses two current lines of research with potential to address some of these challenges.

While the future holds great potential for utilizing a multitude of data sources for outbreak detection, the fact that much of this data (including electronic health records and non-traditional public health data sources such as Twitter feeds) exists only as unstructured free text poses a major challenge. Many existing disease surveillance systems approach this problem by predefining a set of broad syndrome groupings (or “prodromes”), such as “respiratory illness,” “gastrointestinal illness,” “influenza-like illness,” and others, and monitoring Emergency Department (ED) visit and other data sources for unexpected increases in the number of cases for each prodrome. The analysis classifies each ED case into one of the existing set of prodromes (or “unknown”) by keyword matching, Bayesian network-based classification, or other text classification approaches, and then applies standard temporal or spatio-temporal surveillance methods to the resulting case counts. The prodrome-based approach has two main disadvantages: first, mapping specific chief complaints (such as “coughing up blood”) to a broader symptom category (“respiratory” or “hemorrhagic”) is likely to dilute the outbreak signal, delaying or preventing detection of an emerging outbreak. detection. Additionally, approaches based on mapping cases to existing prodromes have little or no ability to detect novel outbreaks with previously unseen symptom patterns.

In recent work, the author has addressed this problem by proposing a new text-based spatial event-detection approach, the “semantic scan statistic,” which uses free-text data from Emergency Department chief complaints to detect, localize, and characterize emerging outbreaks of disease These scan methods dramatically outperformed the prodrome-based method for synthetically generated “novel” outbreaks and for symptoms that did not correspond to any of the pre-existing prodrome types, reducing the average time to detect (for a fixed false-positive rate of one/month) from 11 days to five.

A second major challenge of future disease surveillance systems will be dealing with the scale of the data. Many potential sources of health information contain massive amounts of data, and we wish to detect emerging disease patterns in such data­ sets in near real time. Additionally, future systems will integrate information from a large number of data sources to achieve more timely detection and improved situational awareness. Thus, future detection systems will require new, computationally efficient algorithms to scale up to the huge amounts of data.

One useful class of methods for solving large-scale detection problems is what the author calls “subset scanning” – treating the pattern-detection problem as a search over subsets of the data, finding those subsets which maximize some measure of interestingness or anomalousness (a “score function”), often subject to additional constraints. The resulting algorithms can scale to hundreds of thousands of data records and integrate information from hundreds of data streams.

These approaches are only a first step toward addressing truly societal-scale data such as Internet search queries, user-generated Web content, and location and proximity data from cellular telephones. For these and other massive data sources, even algorithms that scale linearly with the size of the data will be insufficient. Enabling disease surveillance systems to scale to the data-driven public health practices of the future will require other techniques such as approximate subset scan algorithms that can sample aggregate data at multiple resolutions.


Daniel B. Neill ( is Assistant Professor of Information Systems in the Event and Pattern Detection Laboratory , H.H. Heinz III College, Carnegie-Mellon University.