Human face-to-face communication works best when one can watch the speaker's face. This becomes obvious when someone speaks to us in a noisy environment, in which the auditory speech signal is degraded. Visual cues place constraints on what our brain expects to perceive in the auditory channel. These visual constraints improve the recognition rate for audiovisual speech, compared with auditory speech alone. Similarly, speaker identity recognition by voice can be improved by concurrent visual information. Accordingly, audiovisual models of human voice and face perception posit that there are interactions between auditory and visual processing streamsThe authors in fact obtained these results when they used functional magnetic resonance imaging (fMRI) to show the response properties of these two areas.
Neurophysiological face processing studies indicate that distinct brain areas are specialized for processing time-varying information [facial movements, superior temporal sulcus (STS), and time-constant information (face identity, fusiform face area (FFA). If speech and speaker recognition are neuroanatomically dissociable, and the improvement by audiovisual learning uses learned dependencies between audition and vision, the STS should underpin the improvement in speech recognition in both controls and prosopagnosics. A similar improvement in speaker recognition should be based on the FFA in controls but not prosopagnosics. Such a neuroanatomical dissociation would imply that visual face processing areas are instrumental for improved auditory-only recognition.
Monday, May 12, 2008
Your lips in my brain...
The title of the Kriegstein et al. article is: "Simulation of talking faces in the human brain improves auditory speech recognition." It turns out that observing a specific person talking for 2 min improves our subsequent auditory-only speech and speaker recognition for this person. This shows that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. The authors suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. From the author's introduction: