Ross, Choi and Purves (PDF here) offer a fascinating study how vocal tract anatomy and vocal language sounds might explain why humans, across cultures, have created music using pitch intervals that divide octaves into the 12 tones of the chromatic scale:
Throughout history and across cultures, humans have created music using pitch intervals that divide octaves into the 12 tones of the chromatic scale. Why these specific intervals in music are preferred, however, is not known. In the present study, we analyzed a database of individually spoken English vowel phones to examine the hypothesis that musical intervals arise from the relationships of the formants in speech spectra that determine the perceptions of distinct vowels. Expressed as ratios, the frequency relationships of the first two formants in vowel phones represent all 12 intervals of the chromatic scale. Were the formants to fall outside the ranges found in the human voice, their relationships would generate either a less complete or a more dilute representation of these specific intervals. These results imply that human preference for the intervals of the chromatic scale arises from experience with the way speech formants modulate laryngeal harmonics to create different phonemes.
The periodicity in speech sound stimuli is generated primarily by the repeating peaks of energy in the vocal air stream produced by oscillations of the vocal folds in the larynx. The intensity carried by the harmonic series produced in this way is altered, however, by the resonance frequencies of the rest of the vocal tract, which change dynamically in response to neurally controlled movements of the soft palate, tongue, lips and other articulators (see figure). These variable vocal tract resonances, called formants, modulate the harmonic series generated by the laryngeal oscillations by suppressing some harmonics more than others. When coupled with unvoiced speech sounds (consonants), this modulation by the formants creates the different voiced speech sounds that give rise to the semantic content in all human languages. With respect to vowel phones, only the first two formants have a major influence on the vowel perceived: artificially removing them from vowel phones makes vowel phonemes largely indistinguishable, whereas removing the higher formants has little effect on the perception of speech sounds. Indeed, the first and second formants of vowel sounds of all languages fall within well defined frequency ranges. The resonances of the first two formants are typically between approximately 200–1,000 Hz and approximately 800–3,000 Hz, respectively, their central values approximating the odd harmonics of the resonances of a tube approximately 17 cm in length open at one end, the usual physical model of the adult vocal tract in a relaxed state).
Figure - Ranges of the peak harmonic in the first two formants (F1 and F2) for eight American English vowels uttered as single words in an emotionally neutral manner. (A) Diagram of the human larynx and vocal tract; see Introduction for explanation. (B) Distribution of the peak harmonics selected as the index for the first and second formant for the five male participants. (C) Distribution for the five female participants. The somewhat smaller harmonic ranges for females are due to the higher average fundamental frequency of female speech. The mean fundamental frequency for male speakers was 109 Hz (SD = 10) and for female speakers 171 Hz (SD = 20).