Musical mind control: Human speech takes on characteristics of background music
Department of Linguistics, University of Canterbury
20 Kirkwood Avenue, Upper Riccarton
Christchurch, NZ, 8041
Popular version of paper 1aNS4, “Musical mind control: Acoustic convergence to background music in speech production.”
Presented Monday morning, November 28, 2016
172nd ASA Meeting, Honolulu
People often adjust their speech to resemble that of their conversation partners – a phenomenon known as speech convergence. Broadly defined, convergence describes automatic synchronization to some external source, much like running to the beat of music playing at the gym without intentionally choosing to do so. Through a variety of studies a general trend has emerged where we find people automatically synchronizing to various aspects of their environment 1,2,3. With specific regard to language use, convergence effects have also been observed in many linguistic domains such as sentence-formation4, word-formation 5, and vowel production6 (where differences in vowel production are well associated with perceived accentedness 7,8). This prevalence in linguistics raises many interesting questions about the extent to which speakers converge. This research uses a speech-in-noise paradigm to explore whether or not speakers also converge to non-linguistic signals in the environment: Specifically, will a speaker’s rhythm, pitch, or intensity (which is closely related to loudness) be influenced by fluctuations in background music such that the speech echoes specific characteristics of that background music (for example, if the tempo of background music slows down, will that influence those listening to unconsciously decrease their speech rate)?
In this experiment participants read passages aloud while hearing music through headphones. Background music was composed by the experimenter to be relatively stable with regard to pitch, tempo/rhythm, and intensity, so we could manipulate and test only one of these dimensions at a time, within each test-condition. We imposed these manipulations gradually and consistently toward a target, which can be seen in Figure 1, and would similarly return to the level at which they started after reaching that target. We played the participants music with no experimental changes in between all manipulated sessions. (Examples of what participants heard in headphones are available as sound-files 1 and 2]
Fig. 1 Using software designed for digital signal processing (analyzing and altering sound), manipulations were applied in a linear fashion (in a straight line) toward a target – this can be seen above as the blue line, which first rises and then falls. NOTE: After manipulations reach their target (the target is seen above as a dashed, vertical red line), the degree of manipulation would then return to the level at which it started in a similar linear fashion. Graphic captured while using Praat 9 to increase and then decrease the perceived loudness of the background music.
Data from 15 native speakers of New Zealand English were analyzed using statistical tests that allow effects to vary somewhat for each participant where we observed significant convergence in both the pitch and intensity conditions. Analysis of the Tempo condition, however, has not yet been conducted. Interestingly, these effects appear to differ systematically based on a person’s previous musical training. While non-musicians demonstrate the predicted effect and follow the manipulations, musicians appear to invert the effect and reliably alter aspects of their pitch and intensity in the opposite direction of the manipulation (see Figure 2). Sociolinguistic research indicates that under certain conditions speakers will emphasize characteristics of their speech to distinguish themselves socially from conversation partners or groups, as opposed to converging with them6. It seems plausible then that, given a relatively heightened ability to recognize low-level variations of sound, musicians may on some cognitive level be more aware of the variation in their sound environment, and as a result similarly resist the more typical effect. However, more work is required to better understand this phenomenon.
Fig. 2 The above plots measure pitch on the y-axis (up and down on the left edge), and indicate the portions of background music that have been manipulated on the x- axis (across the bottom). The blue lines show that speakers generally lower their pitch as an un-manipulated condition progresses. However the red lines show that when global pitch is lowered during a test-condition, such lowering is relatively more dramatic for non-musicians (left plot) and that the effect is reversed by those with musical training (right plot). NOTE: A follow-up model further accounts for the relatedness of Pitch and Intensity and shows much the same effect.
This work indicates that speakers are not only influenced by human speech partners in production, but also, to some degree, by noise within the immediate speech environment, which suggests that environmental noise may constantly be influencing certain aspects of our speech production in very specific and predictable ways. Human listeners are rather talented when it comes to recognizing subtle cues in speech 10, especially compared to computers and algorithms that can’t yet match this ability. Some language scientists argue these changes in speech occur to make understanding easier for those listening 11. That is why work like this is likely to resonate in both academia and the private sector, as a better understanding of how speech will change in different environments contributes to the development of more effective aids for the hearing impaired, as well as improvements to many devices used in global communications.
Sound-file 1. An example of what participants heard as a control condition (no experimental manipulation) in between test-conditions.
Sound-file 2. An example of what participants heard as a test condition (Pitch manipulation, which drops 200 cents/one full step).
1. Hill, A. R., Adams, J. M., Parker, B. E., & Rochester, D. F. (1988). Short-term entrainment of ventilation to the walking cycle in humans. Journal of Applied Physiology, 65(2), 570-578.
2. Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic acoustic stimuli. Neuroscience letters, 424(1), 55-60.
3. McClintock, M. K. (1971). Menstrual synchrony and suppression. Nature, Vol 229, 244-245.
4. Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007). Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163-197.
5. Beckner, C., Rácz, P., Hay, J., Brandstetter, J., & Bartneck, C. (2015). Participants Conform to Humans but Not to Humanoid
Robots in an English Past Tense Formation Task. Journal of Language and Social Psychology, 0261927X15584682.
Retreived from: http://jls.sagepub.com.ezproxy.canterbury.ac.nz/content/early/2015/05/06/0261927X15584682.
6. Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics, 40(1), 177-189.
7. Major, R. C. (1987). English voiceless stop production by speakers of Brazilian Portuguese. Journal of Phonetics, 15, 197—
8. Rekart, D. M. (1985) Evaluation of foreign accent using synthetic speech. Ph.D. dissertation, the Lousiana State University.
9. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.4.04) [Computer program]. Retrieved
10. Hay, J., Podlubny, R., Drager, K., & McAuliffe, M. (under review). Car-talk: Location-specific speech production and
11. Lane, H., & Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech, Language, and Hearing Research, 14(4), 677-709.
University of Utah
201 Presidents Cir
Salt Lake City, UT
Popular version of paper 4pMU4 “How well can a human mimic the sound of a trumpet?”
Presented Thursday May 26, 2:00 pm, Solitude room
171st ASA Meeting Salt Lake City
Man-made musical instruments are sometimes designed or played to mimic the human voice, and likewise vocalists try to mimic the sounds of man-made instruments. If flutes and strings accompany a singer, a “brassy” voice is likely to produce mismatches in timbre (tone color or sound quality). Likewise, a “fluty” voice may not be ideal for a brass accompaniment. Thus, singers are looking for ways to color their voice with variable timbre.
Acoustically, brass instruments are close cousins of the human voice. It was discovered prehistorically that sending sound over long distances (to locate, be located, or warn of danger) is made easier when a vibrating sound source is connected to a horn. It is not known which came first – blowing hollow animal horns or sea shells with pursed and vibrating lips, or cupping the hands to extend the airway for vocalization. In both cases, however, airflow-induced vibration of soft tissue (vocal folds or lips) is enhanced by a tube that resonates the frequencies and radiates them (sends them out) to the listener.
Around 1840, theatrical singing by males went through a revolution. Men wanted to portray more masculinity and raw emotion with vocal timbre. “Do di Petto”, which is Italien for “C in chest voice” was introduced by operatic tenor Gilbert Duprez in 1837, which soon became a phenomenon. A heroic voice in opera took on more of a brass-like quality than a flute-like quality. Similarly, in the early to mid- twentieth century (1920-1950), female singers were driven by the desire to sing with a richer timbre, one that matched brass and percussion instruments rather than strings or flutes. Ethel Merman became an icon in this revolution. This led to the theatre belt sound produced by females today, which has much in common with a trumpet sound.
Fig.1. Mouth opening to head-size ratio for Ethel Merman and corresponding frequency spectrum for the sound “aw” with a fundamental frequency fo (pitch) at 547 Hz and a second harmonic frequency 2 fo at 1094 Hz.
The length of an uncoiled trumpet horn is about 2 meters (including the full length of the valves), whereas the length of a human airway above the glottis (the space between the vocal cords) is only about 17 cm (Fig. 2). The vibrating lips and the vibrating vocal cords can produce similar pitch ranges, but the resonators have vastly different natural frequencies due to the more than 10:1 ratio in airway length. So, we ask, how can the voice produce a brass-like timbre in a “call” or “belt”?
One structural similarity between the human instrument and the brass instrument is the shape of the airway directly above the glottis, a short and narrow tube formed by the epiglottis. It corresponds to the mouthpiece of brass instruments. This mouthpiece plays a major role in shaping the sound quality. A second structural similarity is created when a singer uses a wide mouth opening, simulating the bell of the trumpet. With these two structural similarities, the spectrum of tones produced by the two instruments can be quite similar, despite the huge difference in the overall length of the instrument.
Fig 2. Human airway and trumpet (not drawn to scale).
Acoustically, the call or belt-like quality is achieved by strengthening the second harmonic frequency 2fo in relation to the fundamental frequency fo. In the human instrument, this can be done by choosing a bright vowel like /ᴂ/ that puts an airway resonance near the second harmonic. The fundamental frequency will then have significantly less energy than the second harmonic.
Why does that resonance adjustment produce a brass-like timbre? To understand this, we first recognize that, in brass-instrument playing, the tones produced by the lips are entrained (synchronized) to the resonance frequencies of the tube. Thus, the tones heard from the trumpet are the resonance tones. These resonance tones form a harmonic series, but the fundamental tone in this series is missing. It is known as the pedal tone. Thus, by design, the trumpet has a strong second harmonic frequency with a missing fundamental frequency.
Perceptually, an imaginary fundamental frequency may be produced by our auditory system when a series of higher harmonics (equally spaced overtones) is heard. Thus, the fundamental (pedal tone) may be perceptually present to some degree, but the highly dominant second harmonic determines the note that is played.
In belting and loud calling, the fundamental is not eliminated, but suppressed relative to the second harmonic. The timbre of belt is related to the timbre of a trumpet due to this lack of energy in the fundamental frequency. There is a limit, however, in how high the pitch can be raised with this timbre. As pitch goes up, the first resonance of the airway has to be raised higher and higher to maintain the strong second harmonic. This requires ever more mouth opening, literally creating a trumpet bell (Fig. 3).
Fig 3. Mouth opening to head-size ratio for Idina Menzel and corresponding frequency spectrum for a belt sound with a fundamental frequency (pitch) at 545 Hz.
Note the strong second harmonic frequency 2fo in the spectrum of frequencies produced by Idina Menzel, a current musical theatre singer.
One final comment about the perceived pitch of a belt sound is in order. Pitch perception is not only related to the fundamental frequency, but the entire spectrum of frequencies. The strong second harmonic influences pitch perception. The belt timbre on a D5 (587 Hz) results in a higher pitch perception for most people than a classical soprano sound on the same note. This adds to the excitement of the sound.
14431 Ventura Blvd #200
Sherman Oaks, CA 91423
Popular version of paper 2aMU4
Presented Tuesday morning, May 24, 2016
There exist a number of ways the human vocal folds can vibrate which create unique sounds used in singing. The two most common vibrational patterns of the vocal folds are commonly called “chest voice” and “head voice”, with chest voice sounding like speaking or yelling and head voice sounding more flute-like or like screaming on high pitches. In the operatic singing tradition, men sing primarily in chest voice while women sing primarily in their head voice. However, in rock singing, men often emit high screams using their head voice while female rock singers use almost exclusively their chest voice for high notes.
Vocal fold vibrational pattern differences are only a part of the story though, since the shaping of the throat, mouth and nose (the vocal tract) play a large part in the perception of the final sound. That means that head voice can be made to “sound” like chest voice on high screams using vocal tract shaping and only the most experienced listener can determine if the vocal register used was chest or head voice.
Using spectrographic analysis, differences and similarities between operatic and rock singers can be seen. One similarity between the two is the heightened output of a resonance commonly called “ring”. This resonance, when amplified by vocal tract shaping, creates a piercing sound that’s perceived by the listener as extremely loud. The amplified ring harmonics can be seen in the 3,000 Hz band in both the male opera sample and in rock singing samples:
MALE OPERA – HIGH B (B4…494 Hz) CHEST VOICE
MALE ROCK – HIGH E (E5…659 Hz) CHEST VOICE
MALE ROCK – HIGH G (G5…784 Hz) HEAD VOICE
Though each of these three male singers exhibit a unique frequency signature and whether singing in chest or head voice, each singer is using the amplified ring strategy in the 3,000Hz range amplify their sound and create excitement.
Popular version of paper 2aMU5, “Listener Ratings of Singer Expressivity in Musical Performance.”
Presented Tuesday, May 24, 2016, 10:20-10:35 am, Salon B/C, ASA meeting, Salt Lake City
Vocal fry is the lowest register of the human voice. Its distinct sound is characterized by a low rumble interspersed with uneven popping and crackling. The use of fry as a vocal mannerism is becoming increasingly common in American speech, fueling discussion about the implications of its use and how listeners perceive the speaker . Previous studies have suggested that listeners find vocal fry to be generally unpleasant in women’s speech, but associate it with positive characteristics in men’s speech .
As it has become more prevalent, fry has perhaps not surprisingly found its place in many commercial song styles as well. Many singers are implementing fry as a stylistic device at the onset or offset of a sung tone. This can be found very readily in popular musical styles, presumably to impact and amplify the emotion that the performer is attempting to convey.
Researchers at the University of Texas at San Antonio conducted a survey to analyze whether listeners’ ratings of a singer’s expressivity in musical samples in two contemporary commercial styles (pop and country) were affected by the presence of vocal fry, and to see if there was a difference in listener ratings according to the singer’s gender. A male and a female singer recorded musical samples for the study in a noise reduction booth. As can be seen in the table below, the singers were asked to sing most of the musical selections twice, once using vocal fry at phrase onsets, and once without fry, while maintaining the same vocal quality, tempo, dynamics, and stylization. Some samples were presented more than one time in the survey portion of the study to test listener reliability.
(Hit Me) Baby One More Time
If I Die Young
With and Without Fry
With and Without Fry
Thinking Out Loud
Without Fry Only
Amarillo By Morning
With and Without Fry
With and Without Fry
Across all listener ratings of all the songs, the recordings which included vocal fry were rated as being only slightly more expressive than the recordings which contained no vocal fry. When comparing the use of fry between the male and female singer, there were some differences between the genders. The listeners rated the samples where the female singer used vocal fry higher (e.g., more expressive) than those without fry, which was surprising considering the negative association with women using vocal fry in speech. Conversely, the listeners rated the male samples without fry as being more expressive than those with fry. Part of this preference pattern may have also been an indication of the singer; the male singer was much more experienced with pop styles than the female singer, who is primarily classically trained. The overall expressivity ratings for the male singer were higher than those of the female singer by a statistically significant margin.
There were also listener rating trends between the differing age groups of participants. Younger listeners drove the gap of preference between the female singer’s performances with fry versus non-fry and the male singer’s performances without fry versus with fry further apart. Presumably they are more tuned into stylistic norms of current pop singers. However, this could also imply a gender bias in younger listeners. The older listener groups rated the mean expressivity of the performers as being lower than the younger listener groups. Since most of the songs that we sampled are fairly recent in production, this may indicate a generational trend in preference. Perhaps listeners rate the style of vocal production that is most similar to what they listened to during their young adult years as the most expressive style of singing. These findings have raised many questions for further studies about vocal fry in pop and country music.
Anderson, R.C., Klofstad, C.A., Mayew, W.J., Venkatachalam, M. “Vocal Fry May Undermine the Success of Young Women in the Labor Market. “ PLoS ONE, 2014. 9(5): e97506. doi:10.1371/journal.pone.0097506.
Yuasa, I. P. “Creaky Voice: A New Feminine Voice Quality for Young Urban-Oriented Upwardly Mobile American Women.” American Speech, 2010. 85(3): 315-337.
Effects of language and music experience on speech perception
T. Christina Zhao — firstname.lastname@example.org
Patricia K. Kuhl — email@example.com
Institute for Learning & Brain Sciences
University of Washington, BOX 357988
Seattle, WA, 98195
Popular version of paper 4aSC2, “Top-down linguistic categories dominate over bottom-up acoustics in lexical tone processing”
Presented Thursday morning, May 21st, 2015, 8:00 AM, Ballroom 2
169th ASA Meeting, Pittsburgh
Speech perception involves constant interplay between top-down and bottom-up processing. For example, to process phonemes (e.g. ‘b’ from ‘p’), the listener must accurately process the acoustical information in the speech signals (i.e. bottom-up strategy) and assign these sounds efficiently to a category (i.e. top-down strategy). Listeners’ performance in speech perception tasks is influenced by their experience in either processing strategy. Here, we use lexical tone processing as a window to examine how extensive experience in both strategies influence speech perception.
Lexical tones are contrastive pitch contour patterns at the word level. That is, a small difference in the pitch contour can result in different word meaning. Native speakers of a tonal language thus have extensive experience in using the top-down strategy to assign highly variable pitch contours into lexical tone categories. This top-down influence is reflected by the reduced sensitivity to acoustic differences within a phonemic category compared to across categories (Halle, Chang, & Best, 2004). On the other hand, individuals with extensive music training early in life exhibit enhanced sensitivities to pitch differences not only in music, but also in speech, reflecting stronger bottom-up influence. Such bottom-up influence is reflected by the enhanced sensitivity in detecting differences between lexical tones when the listeners are non-tonal language speakers (Wong, Skoe, Russo, Dees, & Kraus, 2007).
How does extensive experience in both strategies influence lexical tone processing? To address this question, native Mandarin speakers with extensive music training (N=17) completed a music pitch discrimination task and a lexical tone discrimination task. We compared their performance with individuals with extensive experience in only one of the processing strategies (i.e. Mandarin nonmusicians (N=20) and English musicians (N=20), data from Zhao & Kuhl (2015)).
Despite the enhanced performance in the music pitch discrimination task in Mandarin musicians, their performance in the lexical tone discrimination ask is similar to the performance of the Mandarin nonmusicians, and different from the English musicians’ performance
(Fig. 1, ‘Sensitivity across lexical tone continuum by group’). That is, they exhibited reduced sensitivities within phonemic categories (i.e. on either end of the line) compared to within categories (i.e. the middle of the line), and their overall performance is lower than the English musicians. This result strongly suggests a dominant effect of the top-down influence in processing lexical tone. Yet, further analyses revealed that Mandarin musicians and Mandarin nonmusicians may still be relying on different underlying mechanisms for performing in the lexical tone discrimination task. In the Mandarin musician, their music pitch discrimination scores are correlated with their lexical tone discrimination scores, suggesting a contribution of the bottom-up strategy in their lexical tone discrimination performance (Fig. 2, ‘Music pitch and lexical tone discrimination’, purple). This relation is similar to the English musicians (Fig. 2, peach) but very different from the Mandarin non-musicians
(Fig. 2, yellow). Specifically, for Mandarin nonmusicians, the music pitch discrimination scores do not correlate with the lexical tone discrimination scores, suggesting independent processes.
Halle, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395-421. doi: 10.1016/s0095-4470(03)00016-0
Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nat. Neurosci., 10(4), 420-422. doi: 10.1038/nn1872
Zhao, T. C., & Kuhl, P. K. (2015). Effect of musical experience on learning lexical tone categories. The Journal of the Acoustical Society of America, 137(3), 1452-1463. doi: doi:http://dx.doi.org/10.1121/1.4913457