Monday, December 20, 2010

Culturomics

This is a bit mind-blowing. Here is the New York Times article, here is the Science summary by Bohannon, and here is the abstract and the article PDF of the collective effort by Google and academic researchers (including Steven Pinker, Martin Nowak, etc.), and here is the PDF of their supplement giving the details.  The abstract:

We constructed a corpus of digitized texts containing about 4% of all books (5,195,769 digitized books) ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
Clips from the Bohannon review:
The researchers have revealed 500,000 English words missed by all dictionaries, tracked the rise and fall of ideologies and famous people, and, perhaps most provocatively, identified possible cases of political suppression unknown to historians...tracking the ebb and flow of “Sigmund Freud” and “Charles Darwin” reveals an ongoing intellectual shift: Freud has been losing ground, and Darwin finally overtook him in 2005...the amount of data that Google Books offers...currently includes 2 trillion words from 15 million books, about 12% of every book in every language published since the Gutenberg Bible in 1450. By comparison, the human genome is a mere 3-billion-letter poem...the size of the English language has nearly doubled over the past century, to more than 1 million words. And vocabulary seems to be growing faster now than ever before.

2 comments:

  1. This is really cool stuff.

    I have always thought it would be fascinating to apply this kind of work to the blogosphere.

    Doing so would allow you to track, for any given subject, the rise of trends, collective opinions, attitudes, beliefs etc...

    ReplyDelete
  2. It's a great opportunity but not without its problems. This discusses some problems with OCR'd books and the words lists they generate:

    http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181

    ReplyDelete