Thursday, July 10, 2008

As new kind of science, as data deluge makes the scientific method obsolete...

An article by Chris Anderson in Wired Magazine, pointed out to me by my son Jon, argues that science as we have known it has ended. The argument is that the quest for knowledge that used to begin with grand theories now, in the petabyte age, begins with massive amounts of data. Google has set the new model for science. I show some clips here, and then follow with the contra argument by John Timmers that follows):
Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right...Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

The hypothesize-model-test model of science is becoming obsolete...The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility...the more we learn about biology, the further we find ourselves from a model that can explain it...There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.

This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including Google File System, IBM's Tivoli, and an open source version of Google's MapReduce. Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.
Here is the immediate rejoinder to this article from John Timmers at Ars Technica.
Every so often, someone (generally not a practicing scientist) suggests that it's time to replace science with something better. The desire often seems to be a product of either an exaggerated sense of the potential of new approaches, or a lack of understanding of what's actually going on in the world of science. This week's version, which comes courtesy of Chris Anderson, the Editor-in-Chief of Wired, manages to combine both of these features in suggesting that the advent of a cloud of scientific data may free us from the need to use the standard scientific method.

It's easy to see what has Anderson enthused. Modern scientific data sets are increasingly large, comprehensive, and electronic. Things like genome sequences tell us all there is to know about the DNA present in an organism's cells, while DNA chip experiments can determine every gene that's expressed by that cell. That data's also publicly available—out in the cloud, in the current parlance—and it's being mined successfully. That mining extends beyond traditional biological data, too, as projects like WikiProteins are also drawing on text-mining of the electronic scientific literature to suggest connections among biological activities.

There is a lot to like about these trends, and little reason not to be enthused about them. They hold the potential to suggest new avenues of research that scientists wouldn't have identified based on their own analysis of the data. But Anderson appears to take the position that the new research part of the equation has become superfluous; simply having a good algorithm that recognizes the correlation is enough.

The source of this flight of fancy was apparently a quote by Google's research director, who repurposed a cliché that most scientists are aware of: "All models are wrong, and increasingly you can succeed without them." And Google clearly has. It doesn't need to develop a theory as to why a given pattern of links can serve as an indication of valuable information; all it needs to know is that an algorithm that recognizes specific link patterns satisfies its users. Anderson's argument distills down to the suggestion that science can operate on the same level—mechanisms, models, and theories are all dispensable as long as something can pick the correlations out of masses of data.

Science 2.0 I can't possibly imagine how he comes to that conclusion. Correlations are a way of catching a scientist's attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications. One only needs to look at a promising field that lacks a strong theoretical foundation—high-temperature superconductivity springs to mind—to see how badly the lack of a theory can impact progress. Put in more practical terms, would Anderson be willing to help test a drug that was based on a poorly understood correlation pulled out of a datamine? These days, we like our drugs to have known targets and mechanisms of action and, to get there, we need standard science.

Anderson does provide two examples that he feels support his position, but they actually appear to undercut it. He notes that we know quantum mechanics is wrong on some level, but have been unable to craft a replacement theory after decades of work. But he neglects to mention two key things: without the testable predictions made by the theory, we'll never be able to tell how precisely it is wrong and, in those decades where we've failed to find a replacement, the predictions of quantum mechanics have been used to create the modern electronics industry, with the data cloud being a consequence of that.

If anything, his second example is worse. We can now perform large-scale genetic surveys of the life present in remote environments, such as the far reaches of the Pacific. Doing so has informed us that there's a lot of unexplored biodiversity on the bacterial level; fragments of sequence hint at organisms we've never encountered under a microscope. But as Anderson himself notes, the only thing we can do is make a few guesses as to the properties of the organisms based on who their relatives are, an activity that actually requires a working scientific theory, namely evolution. To do more than that, we need to deploy models of metabolism and ecology against the bacteria themselves.

Overall, the foundation of the argument for a replacement for science is correct: the data cloud is changing science, and leaving us in many cases with a Google-level understanding of the connections between things. Where Anderson stumbles is in his conclusions about what this means for science. The fact is that we couldn't have even reached this Google-level understanding without the models and mechanisms that he suggests are doomed to irrelevance. But, more importantly, nobody, including Anderson himself if he had thought about it, should be happy with stopping at this level of understanding of the natural world.

No comments:

Post a Comment