Thursday, May 31, 2012

Quantitative analysis of language evolution.

Modern computation techniques and the mass digitization of books have made possible the systematic analysis of one of humankind's most important cultural artifacts, its languages. A analysis by Hughes et al. is quite different from studies in the dating of literary works, the analysis of the coarse-grained structure of literary history (and the evolution of genre), and most notably, a recent analysis of Google Books that examined temporal trends in content-word usage. (One of the co-authors of the study is a polymath, David Krakauer, who recently become Director of our Wisconsin Institute of Discovery here at the University of Wisconsin and is also Co-Director of its Center for Complexity and Collective Computation). Hughes et al. focus on the usage of content-free words as the basis of a first large-scale study of the similarity structure of literary style. Content-free words are the “syntactic glue” of a language: They are words that carry little meaning on their own but form the bridge between words that convey meaning. Their joint frequency of usage is known to provide a useful stylistic fingerprint for authorship, and thus suggests a method of comparing author styles. Their dataset was a subset of 537 authors in the Project Gutenberg database composed of those who wrote after the year 1550, had at least five works in English in the Project Gutenberg collection, and for whom birth and death date information was available. The primary results of the analysis are that time provides the most coherent means of clustering work and that a trend of diminishing stylistic influence is observed as one moves forward in time. Such a finding is consistent with a simple evolutionary model for stylistic influence, which assumes that imitation attends preferentially to contemporary authors. The authors uncover quantitative support of the previously purely anecdotal notion of a literary “style of a time.” They note that their findings suggest the utility and perhaps the creation of a new field of stylometric analysis in culturomics. Here is their abstract:
Literature is a form of expression whose temporal structure, both in content and style, provides a historical record of the evolution of culture. In this work we take on a quantitative analysis of literary style and conduct the first large-scale temporal stylometric study of literature by using the vast holdings in the Project Gutenberg Digital Library corpus. We find temporal stylistic localization among authors through the analysis of the similarity structure in feature vectors derived from content-free word usage, nonhomogeneous decay rates of stylistic influence, and an accelerating rate of decay of influence among modern authors. Within a given time period we also find evidence for stylistic coherence with a given literary topic, such that writers in different fields adopt different literary styles. This study gives quantitative support to the notion of a literary “style of a time” with a strong trend toward increasingly contemporaneous stylistic influence.
It seems a bit amazing that their analysis of the use of 307 content-free words that included prepositions, articles, conjunctions, “to be” verbs, and some common nouns and pronouns allowed them to cluster authors in time and by narrative theme, and that content-free word frequencies were found to be fairly faithfully transmitted among authors of a similar period, even when imitation at this level of textual resolution seems to be out of the question. Moving into the present, this imitation becomes increasingly localized to our contemporaries. Further edited clips:
We propose that for the earliest periods in our dataset, and through the early modern period, the number of published works remained relatively low. This allowed authors to have sufficient time to sample (read) very broadly from the full range of historically published works. Common phrasing, and norms of syntax and grammar, remain relatively unchanged for long periods of time. This generates decay rates in similarity as a function of temporal distance that are not significantly different from the average, because authors are influenced by models distributed uniformly in time. However, for more recent authors, the number of possible choices of books to read has increased dramatically, and with a finite amount of time, a subset of these works must be chosen, leading to rather heterogeneous reading patterns and a greater overall diversity of authored works. The pattern accelerates in the later modern period, with even more authors to choose from and selection dominated by contemporaneous authors. This suggests a simple evolutionary model for patterns of influence.
The negative influence of authors from a preceding generation in the period 1907–1952 could be explained by the Modernist movement. Modernist authors, who are contained within this time period, display a radical shift in style as they reject their immediate stylistic predecessors yet remain a part of a dominant movement that included many of their contemporaries. The contemporary influence of writing programs and their often close readings of contemporary works and feedback (sometimes called “reflexive modernism”) has also been suggested to contribute to this effect. The overall pattern that we find is that the stylistic influence of the past is diminishing at an increasing rate, which suggests that style itself is evolving at an accelerating pace.
The patterns of influence are a first discovery from the corpus. Implicit in this is a temporal clustering of similarity and quantitative support for the qualitative suggestions of a notion of a “style of a time.” It is also worth noting that the implicit temporal clustering of similarity is not an exclusively temporal phenomenon. A network representation of the authors reveals evidence of thematic clustering as well. Examples include interesting groupings of English poets and playwrights, military leaders, and a collection of important naturalists, social thinkers, and historians. This is suggestive and supportive of the hypothesis that word frequencies are not only typical of a given time but also of a field of inquiry. Historians and naturalists do not only write about different topics, they write about them differently. Taken together with the patterns of decay in influence this suggests that whereas authors of the 18th and 19th centuries continued to be influenced by previous centuries, authors of the late 20th century are strongly influenced by authors from their own decade. The so-called “anxiety of influence”, whereby authors are understood in terms of their response to canonical precursors, is becoming an “anxiety of impotence,” in which the past exerts a diminishing stylistic influence on the present. These results are consistent with many complex, scaling phenomena such as those found in urban and technological systems, where there has been an accelerating rate of change into the present. This is a rather intriguing pattern of short-term cultural evolution that is different from the constant rates of change reported for names and pottery or the reduced rates of lexical substitution of frequently used words over thousands of years. Further analysis will elucidate not only the transmission mechanisms generating temporally localized styles but additional stylistic factors that help differentiate the style of one author from that of another.

No comments:

Post a Comment