5aSC3 – Children’s perception of their own speech; A perceptual study of Polish /s, ʂ, ɕ/

Marzena Żygis – zygis@leibniz-zas.de
Leibniz Centre – General Linguistics & Humboldt University, Berlin, Germany
Marek Jaskuła – Marek.Jaskula@zut.edu.pl
Westpomeranian University of Technology, Szczecin, Poland

Laura L. Koenig – koenig@haskins.yale.edu
Adelphi University, Garden City, New York, United States; Haskins Laboratories; New Haven CT;

Popular version of paper 5aSC3, “Do children understand adults better or themselves? A perceptual study of Polish /s, ʂ, ɕ/”
Presented Friday morning, November 9, 2018, 8:30–11:30 AM, Upper Pavilion

Typically-developing children usually pronounce most sounds of their native language correctly by about 5 years of age, but for some “difficult” sounds the learning process may take longer.  One set of difficult sounds is called the sibilants, an example of which is /s/.  Polish has a complex three-way sibilant contrast (see Figure 1).  One purpose of this study was to explore acquisitional patterns of this unusual sibilant set.

Further, most past studies assessed children’s accuracy in listening to adult speech.  Here, we explored children’s perception of their own voices as well as that of an adult. It might be that children’s speech contain cues that adults do not notice, i.e. that they can hear distinctions in their own speech that adults do not.

We collected data from 75 monolingual Polish-speaking children, ages 35–95 months. The experiment had three parts. First, children named pictures displayed on a computer screen. Words only differed in the sibilant consonant, holding all other sounds constant (see Figure 1).

Figure 1:  Word examples and attached audio (left, adult; right, child)

Next, children listened to the words produced by an unknown adult and chose the picture corresponding to what they heard. Finally, they listened to their own productions, as recorded in the first part, and chose the corresponding picture.  Our computer setup, “Linguistino”, allowed us to obtain the children’s response times via button-press, and also provided for recording their words in part 1 and playing them back, in randomized order, in part 3.

The results show three things. First, not surprisingly, children’s labeling is both more accurate and faster as they get older. The accuracy data, averaged over sounds, are shown in Figure 2.

Figure 2:  Labeling accuracy

Further, some sibilants are harder to discriminate than others.  Figure 3 shows that, across ages, children are fastest for the sound /ɕ/, and slowest for /ʂ/, for both adult and child productions. (The reaction times for /ʂ/ and /s/ were not significantly different, however).

Figure 2: Reaction time of choosing the sibilant /ɕ/, /s/ or /ʂ/ in function of age.

Finally, and not as expected, children’s labeling is significantly worse when they label their own productions vs. those of the adult. One might think that children have considerable experience listening to themselves, so that they would most accurately label their own speech, but this is not what we find.

These results lend insight into the specifics of acquiring Polish as a native language, and may also contribute to an understanding of sibilant perception more broadly. They also suggest that children’s internal representations of these speech sounds are not built around their own speech patterns.

3pID2 – Yanny or Laurel? Acoustic and non-acoustic cues that influence speech perception

Brian B. Monson, monson@illinois.edu

Speech and Hearing Science
University of Illinois at Urbana-Champaign
901 S Sixth St
Champaign, IL 61820
USA

Popular version of paper 3pID2, “Yanny or Laurel? Acoustic and non-acoustic cues that influence speech perception”
Presented Wednesday afternoon, November 7, 1:25-1:45pm, Crystal Ballroom FE
176th ASA Meeting, Victoria, Canada

“What do you hear?” This question that divided the masses earlier this year highlights the complex nature of speech perception, and, more generally, each individual’s perception of the world.  From the yanny v. laurel phenomenon, it should be clear that what we perceive is dependent not only upon the physics of the world around us, but also upon our individual anatomy and individual life experiences. For speech, this means our perception can be influenced greatly by individual differences in auditory anatomy, physiology, and function, but also by factors that may at first seem unrelated to speech.

In our research, we are learning that one’s ability (or inability) to hear at extended high frequencies can have substantial influence over one’s performance in common speech perception tasks.  These findings are striking because it has long been presumed that extended high-frequency hearing is not terribly useful for speech perception.

Extended high-frequency hearing is defined as the ability to hear at frequencies beyond 8,000 Hz.  These are the highest audible frequencies for humans, are not typically assessed during standard hearing exams, and are believed to be of little consequence when it comes to speech.  Notably, sensitivity to these frequencies is the first thing to go in most forms of hearing loss, and age-related extended high-frequency hearing loss begins early in life for nearly everyone.  (This is why the infamous “mosquito tone” ringtones are audible to most teenagers but inaudible to most adults.)

Previous research from our lab and others has revealed that a surprising amount of speech information resides in the highest audible frequency range for humans, including information about the location of a speech source, the consonants and vowels being spoken, and the sex of the talker. Most recently, we ran two experiments assessing what happens when we simulate extended high-frequency hearing loss.  We found that one’s ability to detect the head orientation of talker is diminished without extended high frequencies.  Why might that be important?  Knowing a talker’s head orientation (i.e., “Is this person facing me or facing away from me?”) helps to answer the question of whether a spoken message is intended for you or someone else.  Relatedly, and most surprisingly, we found that restricting access to the extended high frequencies diminishes one’s ability to overcome the “cocktail party” problem.  That is, extended high-frequency hearing improves one’s ability to “tune in” to a specific talker of interest when many interfering talkers are talking simultaneously, as when attending a cocktail party or other noisy gathering.  Do you seem to have a harder time understanding speech at a cocktail party than you used to?  Are you middle-aged?  It may be that the typical age-related hearing loss at extended high frequencies is contributing to this problem.  Our hope is that assessment of hearing at extended high frequencies will become standard routine for audiological exams.  This would allow us to determine the severity of extended high-frequency hearing loss in the population and whether some techniques (e.g., hearing aids) could be used to address it.

Yanny or Laurel

Figure 1. Spectrographic representation of the phrase “Oh, say, can you see by the dawn’s early light.” While the majority of energy in speech lies below about 6,000 Hz (dotted line), extended high-frequency (EHF) energy beyond 8,000 Hz is audible and assists with speech detection and comprehension.

5aSC1 – Understanding how we speak using computational models of the vocal tract

Connor Mayer – connomayer@ucla.edu
Department of Linguistics – University of California, Los Angeles

Ian Stavness – ian.stavness@usask.ca
Department of Computer Science – University of Saskatchewan

Bryan Gick – gick@mail.ubc.ca
Department of Linguistics – University of British Columbia; Haskins Labs

Popular version of poster 5aSC1, “A biomechanical model for infant speech and aerodigestive movements”
Presented Friday morning, November 9, 2018, 8:30-11:30 AM, Upper Pavilion
176th ASA Meeting and 2018 Acoustics Week in Canada, Victoria, Canada

Speaking is arguably the most complex voluntary movement behaviour in the natural world. Speech is also uniquely human, making it an extremely recent innovation in evolutionary history. How did our species develop such a complex and precise system of movements in so little time? And how can human infants learn to speak long before they can tie their shoes, and with no formal training?

Answering these questions requires a deep understanding of how the human body makes speech sounds. Researchers have used a variety of techniques to understand the movements we make with our vocal tracts while we speak – acoustic analysis, ultrasound, brain imaging, and so on. While these approaches have increased our understanding of speech movements, they are limited. For example, the anatomy of the vocal tract is quite complex, and tools that measure muscle activation, such as EMG, are too invasive or imprecise to be used effectively for speech movements.

Computational modeling has become an increasingly promising method for understanding speech. The biomechanical modeling platform Artisynth (https://www.artisynth.org), for example, allows scientists to study realistic 3D models of the vocal tract that are built using anatomical and physiological data.

These models can be used to see aspects of speech that are hard to visualize using other tools. For example, we can see what shape the tongue takes when a specific set of muscles activates. Or we can have the model perform a certain action and measure aspects of the outcome, like having the model produce the syllable “ba” and looking at how much the lips deform by mutual compression during their contact in the /b/ sound. We can also predict how changes to typical vocal tract anatomy, such as the removal of part of the tongue in response to oral cancer, affect the ability to perform speech movements.

In our project at the 176th ASA Meeting, we present a model of the vocal tract of an 11 month old infant. A detailed model of the adult vocal tract named ‘Frank’ has already been implemented in Artisynth, but the infant vocal tract has different proportions than an adult vocal tract. Using Frank as a starting point, we modified the relative scale of the different structures based on measurements taken from CT scan images of an infant vocal tract (see Figure 1).

Going forward, we plan to use this infant vocal tract model (see Figure 2) to simulate both aerodigestive movements and speech movements. One of the hypotheses for how infants learn to speak so quickly is that they build on movements they can carry out at birth, such as swallowing or suckling. The results of these simulations will help supplement neurological, clinical, and kinematic evidence bearing on this hypothesis. In addition, the model will be generally useful for researchers interested in the infant vocal tract. 

vocal tractFigure 1: Left: A cross-section of the Frank model of an adult vocal tract with measurement lines. Right: A cross-sectional CT scan image of an 11 month old infant with measurement lines. The relative proportions of each vocal tract were compared to generate the infant model.

 Figure 2: A modified Frank vocal tract conforming to infant proportions.

1pSCb15 – Why your boot might not sound like my boot: Gender, ethnicity, and back-vowel fronting in Mississippi

Wendy Herd – wherd@english.msstate.edu
Joy Cariño – smc790@msstate.edu
Meredith Hilliard – mah838@msstate.edu
Emily Coggins – egc102@msstate.edu
Jessica Sherman – jls1790@msstate.edu

Linguistics Research Laboratory
Mississippi State University
Mississippi State, MS 39762

Popular version of paper 1pSCb15, “The role of gender, ethnicity, and rurality in Mississippi back-vowel fronting.”
Presented Monday afternoon, November 5, 2018, Upper Pavilion, 176th ASA Meeting, Victoria, Canada

We often notice differences in pronunciation between our own speech and that of other speakers. We even use differences, like the Southern pronunciation of hi or the Northeastern absence of ‘r’ in park, to guess where a given speaker is from. A speaker’s pronunciation also includes cues that tell us which social groups that speaker identifies with. For example, the way you pronounce words might give listeners information about where you are from, whether you identify with a specific cultural group, whether you identify as a man, a woman, or a non-binary gender, as well as other information.

Back-vowel fronting is a particular type of pronunciation change that affects American English vowels like the /u/ in boot and the /o/ in boat. While these two vowel sounds are canonically produced with the tongue raised in the back of the mouth, speakers from across the United States sometimes produce these vowels with the tongue closer to the front of the mouth, nearing the position of the tongue in words like beat. We can measure this difference in tongue position by analyzing F1 and F2, which represent the important frequency information that allows us to differentiate between different vowel sounds. As seen in Figure 1, F1 and F2 (i.e., the dark horizontal bars in the bottom portion of the images) are very close together when the [u] of boot is pronounced in the back of the mouth while F1 and F2 are far apart when the [u] of boot is pronounced in the front of the mouth. These differences in pronunciation can also be heard in the sound files corresponding to each image.

black male boot

black female boot

white male boot

white female boot

Figure 1. Waveform (top) and spectrogram (bottom) of (a-b) boot pronounced with a back vowel by a Black male speaker and by a Black female speaker, and of (c-d) boot pronounced with a fronted vowel by a White male speaker and by a White female speaker.

Other studies have found fronting in words like boot and boat in almost every regional dialect across the United States; however, back-vowel fronting is still primarily associated with the speech of young women, and the research in this area still tends to be limited to the speech of White speakers [1, 2]. The few studies that focused on Black speakers have reported mixed results, either that Black speakers do not front back vowels [3] or that Black speakers do front back vowels but exhibit less extreme fronting than White speakers [4]. Note that in the case of the former study, only male speakers from North Carolina were recorded, and in the case of the latter, both male and female speakers were recorded, but they were all from Memphis, an urban area.

Our study is different in that it includes recordings of both men and women and both Black and White speakers and in that it focuses on a specific geographic region, thus minimizing variation due to regional differences that might be confounded with variation due to gender and/or ethnicity. We recorded the speech of 73 volunteers from Mississippi, making sure to recruit similar numbers of volunteers from different regions and/or cities in the state. The study included 19 Black female speakers, 15 Black male speakers, 20 White female speakers, and 19 White male speakers, all of whom fell within the age range of 18 – 22. This allowed us to directly compare the speech of women and men as well as Black and White speakers within Mississippi.

As can be seen in Figure 2, we found that speakers who identified as White were much more likely to front their back vowels in boot and boat than speakers who self-identified as Black. However, we did not find any gender differences. With the exception of one speaker, women who identified as Black were just as resistant to back-vowel fronting as men. Likewise, men who identified as White were just as likely to front their back vowels as women.


Figure 2. Scatterplots of vowels produced in boot (yellow), boat (blue), beat (red), book (green), and bought (purple) by Black male speakers (top-left), Black female speakers (top-right), White male speakers (bottom-left), and White female speakers (bottom-right). Each point represents a different speaker. The words beat, book, and bought as well as the labels “high,” “low,” “front,” and “back” were included to illustrate the most front/back and high/low points in the mouth.

Why do Black speakers and White speakers pronounce the vowels in boot and boat differently? Speakers tend to pronounce vowels – like other speech sounds – the way others in their social group pronounce those sounds. As such, pronouncing a fronted /u/ or /o/ could be perceived as a cue that tells listeners that the speaker identifies with other speakers who also front those vowels, in this case White speakers and vice-versa. Note that while back-vowel fronting might be associated with a more feminine identity in other regional dialects, that may not be the case in Mississippi because we found no gender differences. Finally, to learn more about how we use back-vowel fronting to align ourselves with social groups, it is necessary to look at the perception of fronted back vowels by speakers from different groups as well as to look at the degree of back-vowel fronting that occurs during spontaneous speech. What do you think? Do you front your back vowels? Can you hear the difference in the recordings above?

 

  1. Fridland, V. 2001. The social dimension of the Southern Vowel Shift: Gender, age and class. Journal of Sociolinguistics, 5(2), 233-253.
  2. Clopper, C., Pisoni, D., & de Jong, K. 2005. Acoustic characteristics of the vowel systems of six regional varieties of American English. Journal of the Acoustical Society of America, 118(3), 1661-1676.
  3. Holt, Y. 2018. Mechanisms of vowel variation in African American English. Journal of Speech, Language, and Hearing Research, 61, 197-209.
  4. Fridland, V. & Bartlett, K. 2006. The social and linguistic conditioning of back vowel fronting across ethnic groups in Memphis, Tennessee. English Language and Linguistics, 10(1), 1-22.

2aSC11 – Adult imitating child speech: A case study using 3D ultrasound

Colette Feehan – cmfeehan@iu.edu
Steven M. Lulich – slulich@iu.edu
Indiana University

Popular version of paper 2aSC11
Presented Tuesday morning, November 6, 2018
176th ASA meeting, Victoria
Click here to read the abstract

Many people do not realize that a lot of the “child” voices they hear in animated TV shows and movies are actually produced by adults.1 The field of animation has a long tradition of using adults to voice child characters such as in Peter Pan (1953), The Jetsons (1962-63), Rugrats (1991-2004), The Wild Thornberrys (1998-2004), and The Boondocks (2005-2014) to name just a few1. Reasons for using adults include: the fact that children are hard to direct, they legally cannot work long hours, and their voices change as they grow up,1 so if they had used real children in a series like The Simpsons (1989-), they might be on Bart number seven by now, whereas with the talented Nancy Cartwright, Bart has maintained the same vocal spunk of his 1980s self.8

Voice actors are an interesting population for linguistic study because they are essentially professional folk linguists9: this means that without formal training in linguistics they skillfully and reliably perform complex linguistic tasks. Previous studies10-17 of voice actors investigated how changes in pitch, movement of the vocal tract, and voice quality (e.g. how breathy or scratchy a voice sounds) affect the way listeners and viewers understand and interpret the animated character. The current investigation uses 3D ultrasound data from an amateur voice actor to address the question: What do adult voice actors do with their vocal tracts in order to sound like a child?

Ultrasound works by emitting high-frequency sound and measuring the time it takes for the sound to echo back. For this study, an ultrasound probe (like what you use to see a baby) was placed under the participant’s chin and held in place using a customized helmet. The sound waves travel through the tissues of the face and tongue—a fairly dense medium—and when the waves come into contact with the air along the surface of the tongue—a much lower density medium—they echo back.

These echoes are represented in ultrasound images as a bright line (see Figure 1).

Multiple images can be analyzed and assembled into 3D representations of the tongue surface (see Figure 2).

This study identified three strategies for imitating a child’s voice. First the actor raised the hyoid bone (a tiny bone in your neck) which is visible as an acoustic “shadow” circled in Figure 3.

This gesture effectively shortens the vocal tract, helping the actor to sound like a smaller person. Second, the actor pushed tongue movements forward in the mouth (visible in Figure 4).

This gesture shortens the front part of the vocal tract, which also helps the actor to sound like a smaller person. Third, the actor produced a prominent groove down the middle of the tongue (visible in Figure 2), effectively narrowing the vocal tract. These three strategies together help voice actors sound like people with smaller vocal tracts, which is very effective when voicing an animated child character!

 

References

  1. Holliday, C. “Emotion Capture: Vocal Performances by Children in the Computer-Animated Film”. Alphaville: Journal of Film and Screen Media 3 (Summer 2012). Web. ISSN: 2009-4078.
  2. Disney, W. (Producer) Geronimi, C., Jackson, W., Luske, H. (Directors). (1953). Peter Pan [Motion Picture]. Burbank, CA: Walt Disney Productions.
  3. Hanna, W., & Barbera, J. (1962). The Jetsons. [Television Series] Los Angles, CA: Hanna Barbera Productions.
  4. Klasky, A., Csupo, G., Coffey, V., Germain, P., Harrington, M. (Executive Producers) (1991). Rugrats [Television Series]. Hollywood, CA: Klasky/Csupo, Inc.
  5. Klasky, A., & Csupo, G. (Executive Producers). (1998). The Wild Thornberrys [Television Series]. Hollywood, CA: Klasky/Csupo, Inc.
  6. McGruder, A., Hudlin, R., Barnes, R., Cowan, B., Jones, C. (Executive Producers). (2005). The Boondocks [Television Series] Culver City, CA: Adelaide Productions Television.
  7. Brooks, J., & Groening, M. (Executive Producers). (1989). The Simpsons [Television Series]. Los Angeles, CA: Gracie Films.
  8. Cartwright, N. (2001) My Life as a 10-Year-Old Boy. New York: Hyperion Books.
  9. Preston, D. R. (1993). Folk dialectology. American dialect research, 333-378.
  10. Starr, R. L. (2015). Sweet voice: The role of voice quality in a Japanese feminine style. Language in Society, 44(01), 1-34.
  11. Teshigawara, M. (2003). Voices in Japanese animation: a phonetic study of vocal stereotypes of heroes and villains in Japanese culture. Dissertation.
  12. Teshigawara, M. (2004). Vocally expressed emotions and stereotypes in Japanese animation: Voice qualities of the bad guys compared to those of the good guys. Journal of the Phonetic Society of Japan8(1), 60-76.
  13. Teshigawara, M., & Murano, E. Z. (2004). Articulatory correlates of voice qualities of good guys and bad guys in Japanese anime: An MRI study. In Proceedings of INTERSPEECH (pp. 1249-1252).
  14. Teshigawara, M., Amir, N., Amir, O., Wlosko, E., & Avivi, M. (2007). Effects of random splicing on listeners’ perceptions. In 16th international congress of phonetic sciences (icphs).
  15. Teshigawara, M. 2009. Vocal expressions of emotions and personalities in Japanese anime. In Izdebski, K. (ed.), Emotions of the Human Voice, Vol. III Culture and Perception. San Diego: Plural Publishing, 275-287.
  16. Teshigawara, K. (2011). Voice-based person perception: two dimensions and their phonetic properties. ICPhSXVII, 1974-1977.
  17. Uchida, T. 2007. Effects of F0 range and contours in speech upon the image of speakers’ personality. Proc.19th ICA Madrid. http://www.seaacustica.es/WEB_ICA_07/fchrs/papers/cas-03-024.pdf

2pSC34 – Distinguishing Dick from Jane: Children’s voices are more difficult to identify than adults’ voices

Natalie Fecher – natalie.fecher@utoronto.ca
Angela Cooper – angela.cooper@utoronto.ca
Elizabeth K. Johnson – elizabeth.johnson@utoronto.ca

University of Toronto
3359 Mississauga Rd.,
Mississauga, Ontario L5G 4K2 CANADA

Popular version of paper 2pSC34
Presented Tuesday afternoon, November 6, 2018, 2:00-5:00 PM, UPPER PAVILION (VCC)
176th ASA Meeting, Victoria, Canada

Parents will tell you that a two-year-old’s birthday party is a chaotic place—young children running around, parents calling out to their children. Amidst that chaos, if you heard a young child calling out, asking to go to the bathroom, would you be able to recognize who’s talking without seeing their face? Perhaps not easily as you might expect, suggests new research from the University of Toronto.

Adults are very adept at recognizing other adults from only their speech. However, children’s speech productions differ substantially from adults, arising from differences in the size of their vocal tracts, to how well they can control their articulators (e.g., tongue) to form speech sounds, to differences in their linguistic knowledge. As a result, a child may pronounce words like elephant and strawberry more like “ephant” and “dobby”. We know very little about how these differences in child and adult speech might affect our ability to recognize who’s talking. Previous work from our lab demonstrated that even mothers are surprisingly not as accurate as you might expect at identifying their own child’s voice.

Sample of 4 adult voices 

4 child voices producing the word ‘elephant’

In this study, we used two tasks to shed light on differences between child and adult voice recognition. First, we presented adult listeners with pairs of either child or adult voices to determine if they could even tell them apart. Results revealed that listeners were substantially worse at differentiating child voices relative to adult voices.

The second task had new adult listeners complete a two-day voice learning experiment, where they were trained to identify a set of 4 child voices on one day and 4 adult voices on the other day. Listeners first heard each voice producing a set of words while seeing a cartoon image on the screen, so they could learn the association between the cartoon and voice. During training, they heard a word and saw a pair of cartoon images, after which, they selected who they thought was speaking and received feedback on their accuracy. Finally, at test, they heard a word and saw 4 cartoon images on the screen and selected who they thought was speaking (Figure 1).

Children’s voices

Figure 1. Paradigm for the voice learning task

Results showed that with training, listeners can learn to identify children’s voices above chance, though child voice learning was still slower and less accurate than adult voice learning. Interestingly, no relationship was found between a listeners’ voice learning performance with adult voices and their voice learning performance with child voices, such that those who were relatively good at identifying adult voices were not necessarily also good at identifying child voices.

This may suggest that the information in the speech signal that we use to differentiate adult voices may not be as informative for identifying child voices. Successful child voice recognition may require re-tuning our perceptual system to pay attention to different cues. For example, it may be more helpful to attend to the fact that one child makes certain pronunciation errors, while another child makes a different set of pronunciation errors.