Tennenbaum et al. offer an utterly fascinating review of attempts to understand cognitive development by reverse engineering. They offer a simple description of Bayesian or probabilistic approaches that even I can (finally) begin to understand. They state the problem:
For scientists studying how humans come to understand their world, the central challenge is this: How do our minds get so much from so little? We build rich causal models, make strong generalizations, and construct powerful abstractions, whereas the input data are sparse, noisy, and ambiguous—in every way far too limited. A massive mismatch looms between the information coming in through our senses and the ouputs of cognition.Here are several clips from the article (I can send a PDF of the whole article to interested readers). They start with an illustration (click to enlarge):
Figure Legend:The authors continue by offering examples of hierarchical Bayesian models with different graphical matrices, and then argue that the Bayesian approach brings us closer to understanding cognition that older connectionist or neural network models.
Human children learning names for object concepts routinely make strong generalizations from just a few examples. The same processes of rapid generalization can be studied in adults learning names for novel objects created with computer graphics. (A) Given these alien objects and three examples (boxed in red) of “tufas” (a word in the alien language), which other objects are tufas? Almost everyone selects just the objects boxed in gray. (B) Learning names for categories can be modeled as Bayesian inference over a tree-structured domain representation. Objects are placed at the leaves of the tree, and hypotheses about categories that words could label correspond to different branches. Branches at different depths pick out hypotheses at different levels of generality (e.g., Clydesdales, draft horses, horses, animals, or living things). Priors are defined on the basis of branch length, reflecting the distinctiveness of categories. Likelihoods assume that examples are drawn randomly from the branch that the word labels, favoring lower branches that cover the examples tightly; this captures the sense of suspicious coincidence when all examples of a word cluster in the same part of the tree. Combining priors and likelihoods yields posterior probabilities that favor generalizing across the lowest distinctive branch that spans all the observed examples (boxed in gray).
“Bayesian” or “probabilistic” are merely placeholders for a set of interrelated principles and theoretical claims. The key ideas can be thought of as proposals for how to answer three central questions:
1) How does abstract knowledge guide learning and inference from sparse data?
2) What forms does abstract knowledge take, across different domains and tasks?
3) How is abstract knowledge itself acquired?
At heart, Bayes’s rule is simply a tool for answering question 1: How does abstract knowledge guide inference from incomplete data? Abstract knowledge is encoded in a probabilistic generative model, a kind of mental model that describes the causal processes in the world giving rise to the learner’s observations as well as unobserved or latent variables that support effective prediction and action if the learner can infer their hidden state. Generative models must be probabilistic to handle the learner’s uncertainty about the true states of latent variables and the true causal processes at work. A generative model is abstract in two senses: It describes not only the specific situation at hand, but also a broader class of situations over which learning should generalize, and it captures in parsimonious form the essential world structure that causes learners’ observations and makes generalization possible.
Bayesian inference gives a rational framework for updating beliefs about latent variables in generative models given observed data. Background knowledge is encoded through a constrained space of hypotheses H about possible values for the latent variables, candidate world structures that could explain the observed data. Finer-grained knowledge comes in the “prior probability” P(h), the learner’s degree of belief in a specific hypothesis h prior to (or independent of) the observations. Bayes’s rule updates priors to “posterior probabilities” P(h|d) conditional on the observed data d:
The posterior probability is proportional to the product of the prior probability and the likelihood P(d|h), measuring how expected the data are under hypothesis h, relative to all other hypotheses h′ in H.
To illustrate Bayes’s rule in action, suppose we observe John coughing (d), and we consider three hypotheses as explanations: John has h1, a cold; h2, lung disease; or h3, heartburn. Intuitively only h1 seems compelling. Bayes’s rule explains why. The likelihood favors h1 and h2 over h3: only colds and lung disease cause coughing and thus elevate the probability of the data above baseline. The prior, in contrast, favors h1 and h3 over h2: Colds and heartburn are much more common than lung disease. Bayes’s rule weighs hypotheses according to the product of priors and likelihoods and so yields only explanations like h1 that score highly on both terms.
The same principles can explain how people learn from sparse data. In concept learning, the data might correspond to several example objects (Fig. 1) and the hypotheses to possible extensions of the concept. Why, given three examples of different kinds of horses, would a child generalize the word “horse” to all and only horses (h1)? Why not h2, “all horses except Clydesdales”; h3, “all animals”; or any other rule consistent with the data? Likelihoods favor the more specific patterns, h1 and h2; it would be a highly suspicious coincidence to draw three random examples that all fall within the smaller sets h1 or h2 if they were actually drawn from the much larger h3. The prior favors h1 and h3, because as more coherent and distinctive categories, they are more likely to be the referents of common words in language. Only h1 scores highly on both terms. Likewise, in causal learning, the data could be co-occurences between events; the hypotheses, possible causal relations linking the events. Likelihoods favor causal links that make the co-occurence more probable, whereas priors favor links that fit with our background knowledge of what kinds of events are likely to cause which others; for example, a disease (e.g., cold) is more likely to cause a symptom (e.g., coughing) than the other way around.
...the Bayesian approach lets us move beyond classic either-or dichotomies that have long shaped and limited debates in cognitive science: “empiricism versus nativism,” “domain-general versus domain-specific,” “logic versus probability,” “symbols versus statistics.” Instead we can ask harder questions of reverse-engineering, with answers potentially rich enough to help us build more humanlike AI systems.