By Mathukumalli Vidyasagar
(This is the first article in a continuing feature of the Newsletter.)
“Computational biology” is a phrase that means different things to different people, ranging all the way from sequence alignment, to turning very short “reads” of DNA fragments into a complete genome, to predicting adverse responses in clinical trials. In this month’s article, I look at the role of machine learning in computational biology. Is it necessary – and is it sufficient, as it is currently implemented?
Welcome to the first of a series of monthly columns on the broad topic of computational biology.
This column is not meant to be archival. Rather, it represents my thoughts at that moment in James Joyce’s “stream of consciousness” format. At some future date the column might become interactive whereby readers will be able to post comments; but for the moment I shall be strictly in a “broadcast only” mode. Nevertheless, I welcome feedback of all kinds, including suggestions on topics for future columns.
“Computational biology” is a phrase that means different things to different people, ranging all the way from sequence alignment to turn very short “reads” of DNA fragments into a complete genome, to predicting adverse responses in clinical trials. My own interests lie in those aspects that are closest to translational medicine. However, that is merely my personal preference, and I realize that computationally minded individuals can make a valuable contribution at all stages of the value chain in the drug discovery process.
In his 2002 Nobel Prize Lecture, Prof. Sydney Brenner commented that “We are all conscious today that we [the biologists] are drowning in a sea of data and starving for knowledge.” In my view, “turning data into knowledge” is the motto that we engineers should adopt in any collaboration with the biological community. However, in order for the collaboration to be successful, we as engineers need to go more than 50% of the way on the road to biology. My own experience has been that engineers can pick up the rudiments of biology far more easily than biologists can pick up the rudiments of mathematics. My own collaborators constantly joke “Why do you think we became biologists? Because we’re scared of math!” Indeed, in a provocative statement published in em>Nature (Volume 438, page 1079, December 2005), Dr. Marvin Cassman, Director of the Institute for Bioengineering, Biotechnology and Quantitative Biomedical Engineering at UC San Francisco, said “Unfortunately the translation of systems biology into a broader approach is complicated by the innumeracy of many biologists. Some modicum of mathematical training will be required, reversing the trend of the past 30 years, in which biology has become a discipline for people who want to do science without learning mathematics.” While I hope that the biologist community would heed the admonition of one of their own kind, I am not holding my breath for biologists to overcome their mathphobia anytime soon! So it is incumbent on us, the engineers, to travel most of the way down the road to meet biologists.
The meaning of the phrase “computational biology” at a given point in time is a function of the comfort level that the biology community feels with computation at that time. Ten years ago a person whose only skill consisted of submitting queries to BLAST could pass off himself/herself as a “computational biologist”! I can recall from my undergraduate days (the era of keypunch cards, Hollerith codes and so on) that there were these exotic creatures known as “computer programmers” who were privy to the arcane and mysterious rites to propitiate the digital computer in a way that the rest of us weren’t. So these high priests of programming were anointed to act as intermediaries between the rest of us mere mortals and the Gods represented by digital computers. Fortunately those days are behind us now, as are the days when biologists depended on someone else to submit their queries to BLAST.
In future columns I hope to discuss several topics that are on my mind these days. For this, the initial column, let me briefly discuss the topic of machine learning, which with its heavy reliance on probability theory would appear to be about as far from biology as one can imagine. Nevertheless, I believe firmly that machine learning is just about the only way in which patterns can be unearthed from massive amounts of data. I do not believe in “visualization” in which very high-dimensional data is projected onto a two-dimensional space for the comfort and convenience of human beings. It is pretty easy to construct highly correlated high-dimensional data that looks completely random and uncorrelated when projected onto any two components. However, I believe equally strongly that entirely new approaches to machine learning need to be developed in order to analyze biological data. Conventional engineering applications of ML theory have large numbers of samples in relatively low dimensional feature space; recognition of speech or handwritten characters are canonical examples. In contrast, in biological problems almost always the data consists of relatively few samples in extremely high-dimensional feature space. For instance, a typical genome-wide association study may consist of expression profiles of 20,000 or so genes, across a few hundred samples at most. Viewed as a matrix, the data set is obviously of very low rank. However, methods that would require us to compute a small number of linear combinations of all columns, such as principal component analysis, singular value decomposition etc., are not practical. Any practical approach to using such data for predictive purposes would require us to identify just a handful of “most informative” features, and discard the rest. There are already several interesting papers on this topic, and they can be found using the search phrase “recursive feature elimination””. My own students have recently completed an exercise whereby they started with expression data on 17,814 genes in ovarian cancer tissues, and isolated a panel of just 28 genes that are able to predict whether a patient is likely to be a non-responder to front-line therapy, with greater than 90% accuracy. Since the a priori probability of a patient being a non-responder is only about 20%, identifying such likely non-responders beforehand so that the physician can have a “Plan B” ready is a significant achievement in my view; let me see if I have any success persuading the physicians!
There are many other such challenges that contemporary biological applications throw up to us practicing engineers. In short, both biology and computation will enrich each other, if pursued properly.
See you next month!