The Critical Role of Signal Processing and Systems Theory in Systems Biology

By Ilya Shmulevich

Biology has become an exceptionally data-rich science. Powerful technologies now allow us to measure states of living cells in a highly detailed manner producing massive data sets that contain information on DNA sequence, expression of genes or production of RNA, production of proteins and the states of their chemical modifications. These data sets can also include information on the interactions among proteins, DNA, and RNA, as well as the dynamics of such interactions in forming complex regulatory molecular networks. Imaging and microfluidic technologies also make it possible to generate measurements at a single cell level and manipulate cells in highly controlled and specific ways in a high-throughput manner, for instance, by silencing each of the genes in the genome and observing the effects at the molecular and cellular levels, over time, in large populations of cells. The integration of these data sets affords the possibility of constructing mathematical models that predict the behavior of the complex molecular regulatory networks that govern cellular functions and responses to environmental cues. When these networks become disrupted, for example by DNA mutations in cancer, mathematical models can be used to predict what molecular perturbations or interventions would be optimal for driving the molecular networks toward desired states that are of therapeutic benefit to the patient.

A systems biology approach will replace the current practice of drug target identification and design, which is largely based on high-throughput screening approaches, with approaches rooted in rational systems-based engineering principles. Such a view of living systems was already apparent to the father of modern systems theory, Norbert Wiener, who in his seminal book Cybernetics (1948), wrote: “As far back as four years ago, the group of scientists about Dr. Rosenblueth and myself had already become aware of the essential unity of the set of problems centering about communication, control, and statistical mechanics, whether in the machine or in living tissue.” Wiener’s vision, predating Watson and Crick’s double helix model of DNA, today is reality. Indeed, the fields that Wiener so fundamentally impacted – signal processing and systems theory – play central roles in computational systems biology [1].

In the past few decades, numerous approaches for modeling the molecular interaction networks in cells have been developed, including continuous dynamical models such as systems of nonlinear ordinary differential equations and discrete dynamical models, such as Boolean networks. Stochastic or probabilistic counterparts of these models include stochastic differential equations (Langevin dynamics) or probabilistic Boolean networks. Other approaches for representing molecular interactions and dynamics are fully probabilistic, such as dynamic Bayesian networks or other graphical models.

A major challenge is the inference of model structure and parameters from heterogeneous measurement data, which is typically a highly under constrained problem in that the variables or degrees of freedom greatly outnumber the observations. The issues of model selection, including Bayesian and information theoretic approaches (e.g. minimal description length), complexity regularization, Markov chain Monte Carlo methods, error estimation, feature selection, and other tools from statistical and computational learning theory are of great relevance and require further intensive research efforts.

Once the model is constructed, there is a need to study its dynamics. This may include predicting the response of the system to a particular environmental input signal, analyzing its steady-state behavior, or determining the nature of the transitions among multiple stable steady-states in order to understand cellular decision making or the effects of molecular noise. Stochastic processes, nonlinear dynamical systems theory [2], and time-frequency analysis [3] provide the tools for studying biomolecular system dynamics. In addition, the view that living cells are information processing systems makes it possible to employ information theoretic methods, particularly those rooted in algorithmic information theory, to understand global molecular systems-level dynamical properties of living cells [4, 5] or relationships between the structure and dynamics of these systems [6].

A number of approaches have been developed for determining the optimal intervention or control strategies intended to drive the state of the system, such as a genetic regulatory network, towards some desired set of states or away from some undesired states. Developed primarily within the modeling framework of probabilistic Boolean networks, these optimal control strategies entail dynamic programming, Markov chain perturbation theory and steady-state analysis, and robust control [2]. The ultimate goal of these approaches is to develop personalized therapies to maximize efficacy for each patient.

The measurement data collected from patients can be used to classify or stratify patients into clinically distinct groups or to predict clinical outcomes or endpoints. This is particularly important when patient groups are clearly defined on the molecular level, but are not apparent on the phenotypic level, for instance with traditional histopathology [7]. Statistical pattern recognition or machine learning approaches can be harnessed for classifying cancers into distinct clinical subtypes, for predicting the response to a particular chemotherapeutic drug, and for diagnostic purposes.

For example, The Cancer Genome Atlas (TCGA) project, an NIH funded project, generates petabytes of heterogeneous data from approximately 10,000 patients spanning over 20 tumor types. The wealth of molecular measurement data, including tumor and normal (germline) DNA sequence and histology image data coupled with the clinical data available for each patient (e.g., tumor grade and subtype, vascular invasion, age of onset, distant metastasis and pathologic spread, survival, etc.) allows the discovery of multivariate associations among the molecular and clinical features. Many challenges arise due to large feature spaces relative to the number of observations, correlated or redundant features, noisy and missing data, and mixed data types such as continuous, discrete, or categorical data. It is important to design algorithms that can deal with these challenges. One such algorithm (RF-ACE), based on random forest regression, is being developed as a collaboration between the Computational Systems Biology group at Tampere University of Technology, Finland and the Institute for Systems Biology in Seattle, WA, which houses one of the Genome Data Analysis Centers within TCGA.

The next frontier in computational systems biology will be extending the modeling and analysis efforts across multiple biological scales of organization, from molecules, to cells, tissues, and organs. These multiscale models will capture the molecular network dynamics within individual cells of different types, governing cellular states and decision-making, the chemical and mechanical interactions among cells, the formation of multicellular structures (e.g. blood vessels), and the interactions with the extracellular matrix. Multiscale modeling efforts will require high-performance computing to simulate biological systems at multiple temporal and spatial scales at the level of billions of cells, which will make it possible to predict the effects of molecular perturbations on the tissue level. These predictions could be used to identify specific nodes in genetic regulatory networks that, when perturbed, will lead to a decrease in the amount of vasculature in a solid tumor. The continuing increase of complexity and scale of data underscores the critical role of systems theory and signal processing in the life sciences.

Ilya Shmulevich is a Professor at the Institute for Systems Biology and an Affiliate Professor in the Department of Bioengineering and the Department of Electrical Engineering, University of Washington. He is a Senior Member of the IEEE.

To find out more about this topic, view the presentation titled “Computational Systems Biology” given by the author at the IEEE Board of Directors meeting on June 26, 2011 in Bellevue, WA.

Hide References
  1. I. Shmulevich and E. R. Dougherty. Genomic Signal Processing, Princeton University Press, 2007.
  2. I. Shmulevich and E. R. Dougherty, Probabilistic Boolean Networks: The Modeling and Control of Gene Regulatory Networks, SIAM Press, 2009.
  3. A. V. Ratushny, I. Shmulevich, J. D. Aitchison, “Trade-off between responsiveness and noise suppression in biomolecular system responses to environmental cues,” PLoS Computational Biology, Vol. 7, No. 6, e1002091, 2011.
  4. M. Nykter, N. D. Price , M. Aldana, S. A. Ramsey, S. A. Kauffman, L. Hood, O. Yli-Harja, I. Shmulevich, “Gene Expression Dynamics in the Macrophage Exhibit Criticality,” Proceedings of the National Academy of Sciences of the USA, Vol. 105, No. 6, pp. 1897-1900, 2008.
  5. D. J. Galas, M. Nykter, G. W. Carter, N. D. Price, I. Shmulevich, “Biological information as set-based complexity,” IEEE Transactions on Information Theory, Vol. 56, No. 2, pp. 667-677, 2010.
  6. M. Nykter, N. D. Price, A. Larjo, T. Aho, S. A. Kauffman, O. Yli-Harja, I. Shmulevich, “Critical networks exhibit maximal information diversity in structure-dynamics relationships,” Physical Review Letters, Vol. 100, 058702, 2008.
  7. N. D. Price, J. Trent, A. K. El-Naggar, D. Cogdell, E. Taylor, K. K. Hunt, R. E. Pollock, L. Hood, I. Shmulevich, W. Zhang, “Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas,” Proceedings of the National Academy of Sciences of the USA, Vol. 104, No. 9, pp- 3414-3419, 2007.