1aSCb4 – Formant and voice quality changes as a function of age in women

Laura L. Koenig – koenig@haskins.yale.edu
Adelphi University
158 Cambridge Avenue
Garden City NY 11530

Susanne Fuchs – fuchs@leibniz-zas.de
Leibniz-Zentrum Allgemeine Sprachwissenschaft (ZAS)
Schützenstr. 18
10117 Berlin (Germany)

Annette Gerstenberg – gerstenberg@uni-potsdam.de
University of Potsdam, Department of Romance Studies
Am Neuen Palais 10
14467 Potsdam (Germany)

Moriah Rastegar – moriahrastegar@mail.adelphi.edu
Adelphi University
158 Cambridge Avenue
Garden City NY 11530

Popular version of the paper: 1aSCb4
Presented: December 7, 2020 at 10:15 AM – 11:00 AM EST

As we age, we change in many ways:  How we look, the way we dress, and how we speak.  Some of these changes are biological, and others are social.  All are potentially informative to those we interact with.

age age

Captions: “Younger (left) and older (right). Image obtained under the publicly-available creative commons licence.  Aging manipulation courtesy of Jolanda Fuchs.”

******

The human voice is a rich source of information on speaker characteristics, and studies indicate that listeners are relatively accurate in judging the age of an unknown person they hear on the phone.  Vocal signals carry information on (a) the sizes of the mouth and throat cavities, which change as we produce different vowels and consonants; (b) the voice pitch, which reflects characteristics of the vocal-folds; and (c) the voice quality, which also reflects vocal-fold characteristics, but in complex and multidimensional ways.  One voice quality dimension is whether a person speaks with a breathier voice quality.  Past studies on the acoustic effects of vocal aging have concentrated on formants, which reflect upper-airway cavity sizes, and fundamental frequency, which corresponds to voice pitch.  Few studies have assessed voice quality.

Further, most past work investigated age by comparing people from different generations.  Cross-generational studies can be confounded by changes in human living conditions such as nutrition, employment settings, and exposure to risk factors.  To separate effects of aging from environmental factors, it is preferable to assess the same individuals at different time points.  Such work is rather rare given the demands of re-connecting with people over long periods of time.

Here, we take advantage of the French LangAge corpus (https://www.uni-potsdam.de/langage/).  Participants engaged in bibliographic interviews beginning in 2005, and were revisited in subsequent years.  Our analysis is based on four women recorded in 2005 and 2015.  We focus on women because biological aging may differ across the sexes. Out of all words, we selected two of the most frequent ones that were produced for each speaker and time point and did not include voiceless sounds.

Numbers 049 and 016 identify the two speakers, f=female, and the following value (e.g. 72) is the age of the speaker.

049_f_72_LeGris.wav 016_f_71_chiens.wav
049_f_82_LeBaigneur.wav 016_f_81_chiens.wav

Our results show that all four speakers have a lower cavity (formant) frequency at older ages.  This may reflect lengthening of the upper airways, e.g. the larynx descends somewhat over time.  Voice quality also changed, with breathier vocal quality at younger ages than at older ages.  However, speakers differed considerably in the magnitude of these changes and in which measures demonstrated aging effects.

In some cultures, a breathy vocal quality is a marker of gender. Lifestyle changes in later life could lead to a reduced need to demonstrate “female” qualities. In our dataset, the speaker with the largest changes in breathiness was widowed between recording times.  Along with physiological factors and social-communicative conditions, ongoing adaptation to gender roles as a person ages may also contribute to changes in voice quality.

2pSCb4 – The Science of Voice Acting

Colette Feehan – cmfeehan@iu.edu
Indiana University

Popular version of paper 2pSCb4
Presented Tuesday afternoon, December 8, 2020
179th ASA meeting, Virtually Everywhere
Click here to read the abstract

Many people do not realize that the “children” they hear in animation are actually voiced by adults1. There are several reasons for this, including: children cannot work long hours, are difficult to direct, and their voices change as they grow. Using an adult who can simulate a child voice bypasses these issues, but surprisingly not all voice actors (VAs) can create a believable child voice.

Studying what VAs do can tell us about how the vocal tract works. They can speak intelligibly while contorting their mouths in unnatural ways. A few previous studies2-10 have looked at the acoustics of VAs, or just the sounds that they produce, such as changes in pitch, voice quality (how raspy or breathy a voice sounds), and what kinds of regional dialects they use. This study uses 3D ultrasound and acoustic data from 3 professional and 3 amateur VAs to start answering the question: What do voice actors do with their vocal tracts to sound like a child? There are multiple different strategies to make your vocal tract sound smaller and different actors combine different strategies to make their child-like voices.

Looking at both the acoustics (the sounds they produce) and the ultrasound imaging of their vocal tracts, the strategies identified so far include: Gesture fronting and raising and hyoid bone raising.

Gesture fronting and raising refers to the position of the tongue within the mouth while you speak. If you think about the location of your tongue when repeating “ta ka ta ka…” you will notice that your tongue touches the roof of your mouth in different places to make each of those consonant sounds—farther forward in the mouth for “ta” and farther back for “ka” and the same is true for vowels. Figure 1 comes from analyzing the recording of their speech and shows that the position of the tongue for the adult versus child voice is pretty different for [i] and [ɑ] sounds for this subject. Given this information, we can then look at the ultrasound and see that the tongue positions are indeed farther forward (right) or higher in the mouth for the child voice, see Figure 2

The hyoid bone is a small bone above the larynx in your neck. This bone interrupts the ultrasound signal and prevents an image from showing up, but looking at the location of this hyoid “shadow” can still give us information. If the hyoid shadow is raised and fronted, as seen in Figure 3, it might be the case that the actor is shortening their vocal tract by contracting muscles in their throat.

Figure 4 shows that, for this VA, the hyoid bone shadow was higher throughout the entire utterance while doing a child voice, meaning that the actor might physically shorten the whole vocal tract the whole time while speaking

Data from VAs can help find alternative pronunciations for speech sounds which could help people with speech impediments but could also be used to help trans individuals sound closer to their identity.

References

  1. Holliday, C. “Emotion Capture: Vocal Performances by Children in the Computer-Animated Film”. Alphaville: Journal of Film and Screen Media 3 (Summer 2012). Web. ISSN: 2009-4078.
  2. Starr, R. L. (2015). Sweet voice: The role of voice quality in a Japanese feminine style. Language in Society, 44(01), 1-34.
  3. Teshigawara, M. (2003). Voices in Japanese animation: a phonetic study of vocal stereotypes of heroes and villains in Japanese culture. Dissertation.
  4. Teshigawara, M. (2004). Vocally expressed emotions and stereotypes in Japanese animation: Voice qualities of the bad guys compared to those of the good guys. Journal of the Phonetic Society of Japan8(1), 60-76.
  5. Teshigawara, M., & Murano, E. Z. (2004). Articulatory correlates of voice qualities of good guys and bad guys in Japanese anime: An MRI study. In Proceedings of INTERSPEECH (pp. 1249-1252).
  6. Teshigawara, M., Amir, N., Amir, O., Wlosko, E., & Avivi, M. (2007). Effects of random splicing on listeners’ perceptions. In 16th international congress of phonetic sciences (icphs).
  7. Teshigawara, M. 2009. Vocal expressions of emotions and personalities in Japanese anime. In Izdebski, K. (ed.), Emotions of the Human Voice, Vol. III Culture and Perception. San Diego: Plural Publishing, 275-287.
  8. Teshigawara, K. (2011). Voice-based person perception: two dimensions and their phonetic properties. ICPhSXVII, 1974-1977.
  9. Uchida, T. 2007. Effects of F0 range and contours in speech upon the image of speakers’ personality. Proc.19th ICA Madrid. http://www.seaacustica.es/WEB_ICA_07/fchrs/papers/cas-03-024.pdf
  10. Lippi-Green, R. (2011). English with an accent : language, ideology and discrimination in the united states. Retrieved from https://ebookcentral.proquest.com

1aSCa3 – Training effects on speech prosody production by Cantonese-speaking children with autism spectrum disorder

Si Chen – sarah.chen@polyu.edu.hk
Bei Li
Fang Zhou
Angel Wing Shan Chan
Tempo Po Yi Tang
Eunjin Chun
Phoebe Choi
Chakling Ng
Fiona Cheng
Xinrui Gou

Department of Chinese and Bilingual Studies
The Hong Kong Polytechnic University
11 Yuk Choi Road, Hung Hom, Hong Kong, China

Popular version of paper 1aSCa3
Presented Monday, December 07, 2020, 9:30 AM – 10:15 AM EST
179th ASA Meeting, Acoustics Virtually Everywhere

Speakers can utilize prosodic variations to express their intentions, states and emotions. Specifically, the relatively new information of an utterance, namely the focus, is often associated with expanded range of prosodic cues. The main types of focus include broad, narrow, and contrastive focus. Broad focus involves focus in a whole sentence (A: What did you say? B: [Emily ate an apple]FOCUS), whereas narrow focus emphasizes one constituent asked in the question (A: What did Emily eat? B: Emily ate an [apple]FOCUS). Contrastive focus rejects alternative statements (A: Did Emily eat an orange? B: (No,) Emily ate an [apple]FOCUS).

Children with autism spectrum disorder (ASD) have been reported to show difficulties in using speech prosody to mark focus. The presented research aims to test whether speech training and sung speech training may improve the use of speech prosody to mark focus. Fifteen Cantonese-speaking ASD children finished pre- and post-training speech production tasks and received either speech or sung speech training. In the pre- and post- training speech production tasks, we designed games to measure participants’ ability to mark focus in conversations. In the training tasks, we improved the mapping between acoustic cues and information structure categories through a series of tasks. The conversations used in sung speech training were designed with melodies that imitated the change of acoustic cues in speech.

Training.mp4, An example of congruous and incongruous conversation pairs in sung speech training

Both of the two training methods consisted of three phases of training. In the first phase, participants listened to congruous conversations pairs attentively in a designed game. In the second phase, participants were told that the odd trial of conversation was incongruous (the focus type that the question elicited did not match with that of the answer), and the even trial was congruous. They need to attend to the differences between the odd and even trials. In the third phase, all the trials were presented in a random order. Participants needed to judge if a pair was congruous or not. Instant feedback was provided after each response.

We extracted acoustic cues from ASD children’s speech before and after training and performed statistical analyses. Our pilot results showed that both speech and sung speech training might have improved the use of prosodic cues such as intensity and f0 in marking focus across various focus positions (e.g. meanF0.tiff). However, ASD children may still have difficulties in integrating all the prosodic cues across focus conditions.

autism spectrum disorderMean f0 of narrow focus in the initial position before and after training

3pSC10 – Fearless Steps: Taking the Next Step towards Advanced Speech Technology for Naturalistic Audio

John H. L. Hansen – john.hansen@utdallas.edu
Aditya Joglekar – aditya.joglekar@utdallas.edu
Meena Chandra Shekar – meena.chandrashekar@utdallas.edu
Abhijeet Sangwan – abhijeet.sangwan@utdallas.edu
CRSS: Center for Robust Speech Systems;
University of Texas at Dallas (UTDallas),
Richardson, TX – 75080, USA

Popular version of paper 3pSC10 “Fearless Steps: Taking the Next Step towards Advanced Speech Technology for Naturalistic Audio”
To be Presented between, Dec 2-6, 2019,
178th ASA Meeting, San Diego

J.H.L. Hansen, A. Joglekar, A. Sangwan, L. Kaushik, C. Yu, M.M.C. Shekhar, “Fearless Steps: Taking the Next Step towards Advanced Speech Technology for Naturalistic Audio,” 178th Acoustical Society of America, Session: 3pSC12, (Wed., 1:00pm-4:00pm; Dec. 4, 2019), San Diego, CA, Dec. 2-6, 2019.

NASA’s Apollo program represents one of the greatest achievements of humankind in the 20th century. During a span of 4 years (from 1968 to 1972), nine lunar missions were launched with 12 astronauts who walked on the surface of the moon. To monitor and assess this massive team challenge, all communications between NASA personnel and astronauts were recorded on 30-track 1-inch analog audio tapes. NASA recorded this in order to be able to review and determine best practices to improve success in subsequent Apollo missions. This analog audio collection essentially was set aside when the Apollo program was completed with Apollo-17, and all tapes stored in NASA’s tape archive. Clearly there are opportunities for research on this audio for both technology and historical purposes. The entire Apollo mission consists of well over ~150,000 hours. Through the Fearless Steps initiative, CRSS-UTDallas digitized 19,000 hours of audio data from Apollo missions: A-1, A-11 and A-13. The focus of this current effort is to contribute to the development of Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. To achieve this goal, a new 30-track analog audio decoder was designed using NASA Soundscriber.

Figure 1: (left): The SoundScriber device used to decode 30 track analog tapes, and (right): The UTD-CRSS designed read-head decoder retrofitted to the SoundScriber [5]

To develop efficient speech technologies towards analyzing conversations and interactions, multiple sources of data such as interviews, flight journals, debriefs, and other text sources along with videos were used [8, 12, 13]. This initial research direction allowed CRSS-UTDallas to develop document linking and web application called ‘Explore Apollo’ wherein a user can access certain moments/stories in the Apollo-11 mission. Tools such as the exploreapollo.org enable us to display our findings in an interactive manner [10, 14]. A case in point is to illustrate team communication dynamics via a chord diagram. This diagram (Figure 2 (right)) illustrates the amount of conversation each astronaut has with each other during the mission, and the communication interactions with the capsule communicator (only personnel directly communicating with the astronauts). Analyses such as these provide an interesting insight into the interaction dynamics for astronauts in deep space.

Figure 2: (left): Explore Apollo Online Platform [14] and (right): Chord Diagram of Astronauts’ Conversations [9]

With a massive aggregated data, CRSS-UTDallas sought to explore the problem of automatic speech understanding using algorithmic strategies to answer the questions: (1) when were they talking; (2) who was talking; (3) what was being said; and (4) how were they talking. These questions formulated in technical terminologies are represented as the following tasks: (1) Speech Activity Detection [5], (2) Speaker Identification, (3) Automatic Speech Recognition and Speaker Diarization [6], (4) Sentiment and Topic Understanding [7].

The general task of recognizing what was being said at what time is called the “Diarization pipeline”. In an effort to answer these questions, CRSS-UTDallas developed solutions for automated diarization and transcript generation using Deep Learning strategies for speech recognition along with Apollo mission specific language models [9]. We further developed algorithms which would help answer the other questions including detecting speech activity, and speaker identity for segments of the corpus [6, 8].

Figure 3: Illustration of the Apollo Transcripts using the Transcriber tool

These transcripts allow us to search through the 19,000 hours of data to find keywords, phrases, or any other points on interest in a matter of seconds as opposed to listening to the audio for hours to search for the answers [10, 11]. The transcripts along with the complete Apollo-11 and Apollo-13 corpora are now freely available on the website fearlesssteps.exploreapollo.org

Audio: Air-to-ground communication during the Apollo-11 Mission

Figure 4: The Fearless Steps Challenge website

Phase one of the Fearless Steps Challenge [15] involved performing five challenge tasks on 100 hours of time and mission critical audio out of the 19,000 hours of the Apollo 11 mission. The five challenge tasks are:

  1. Speech Activity Detection
  2. Speaker Identification
  3. Automatic Speech Recognition
  4. Speaker Diarization
  5. Sentiment detection.

The goal of this Challenge was to evaluate the performance of state-of-the-art speech and language systems for large task oriented teams with naturalistic audio in challenging environments. In the future, we aim to digitize all of the Apollo missions and make it publicly available.

A. Sangwan, L. Kaushik, C. Yu, J. H. L. Hansen and Douglas W. Oard. ”Houston, we have a solution: using NASA Apollo program to advance speech and language processing technology.” INTERSPEECH. 2013.
C. Yu, J. H. L. Hansen, and Douglas W. Oard. “`Houston, We Have a Solution’: A Case Study of the Analysis of Astronaut Speech During NASA Apollo 11 for Long-Term Speaker Modeling,” INTERSPEECH. 2014.
Douglas W. Oard, J. H. L. Hansen, A. Sangwan, B. Toth, L. Kaushik, and C. Yu. “Toward Access to Multi-Perspective Archival Spoken Word Content.” In Digital Libraries: Knowledge, Information, and Data in an Open Access Society, 10075:77–82. Cham: Springer International Publishing, 2016.
A. Ziaei, L. Kaushik, A. Sangwan, J. H.L. Hansen, & D. W. Oard, (2014). “Speech activity detection for NASA Apollo Space Missions: Challenges and Solutions.” (pp. 1544-1548) INTERSPEECH. 2013.
L. Kaushik, A. Sangwan, and J. H.L. Hansen. “Multi-Channel Apollo Mission Speech Transcripts Calibration,” 2799–2803. IINERSPEECH, 2017.
C. Yu and J. H. L. Hansen, “Active Learning Based Constrained Clustering For Speaker Diarization,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2188-2198, Nov. 2017. doi: 10.1109/TASLP.2017.2747097
L. Kaushik, A. Sangwan and J. H. L. Hansen, “Automatic Sentiment Detection in Naturalistic Audio,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, pp. 1668-1679, Aug. 2017.
C. Yu, and J. H. L. Hansen. ”A study of voice production characteristics of astronaut speech during Apollo 11 for speaker modeling in space.” Journal of the Acoustic Society of America (JASA), 2017 Mar: 141(3):1605.
L. Kaushik. “Conversational Speech Understanding in highly Naturalistic Audio Streams” PhD Dissertation, The University of Texas at Dallas, 2017.
A. Joglekar, C. Yu, L. Kaushik, A. Sangwan, J. H. L. Hansen, “Fearless Steps Corpus: A Review Of The Audio Corpus For Apollo-11 Space Mission And Associated Challenge Tasks” In NASA Human Research Program Investigators’ Workshop (HRP), 2018.
L. Kaushik, A. Sangwan, J. H. L. Hansen, “Apollo Archive Explorer: An Online Tool To Explore And Study Space Missions” In NASA Human Research Program Investigators’ Workshop (HRP), 2017.
Apollo 11 Mission Overview: https://www.nasa.gov/-mission_pages/apollo/missions/apollo11.html
Apollo 11 Mission Reports: https://www.hq.nasa.gov/alsj/a11-/a11mr.html
Explore Apollo Document Linking Application: https://app.-exploreapollo.org/
Hansen, J. H., Joglekar, A., Shekar, M. C., Kothapally, V., Yu, C., Kaushik, L., & Sangwan, A. (2019). The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio. In Proc. Interspeech (Vol. 2019).

Additional News Releases from 2017-2019:
https://www.youtube.com/watch?v=CTJtRNMac0E&t
https://www.nasa.gov/feature/nasa-university-of-texas-at-dallas-reveal-apollo-11-behind-the-scenes-audio
https://www.foxnews.com/tech/engineers-restore-audio-recordings-from-apollo-11-mission
https://www.nbcnews.com/mach/science/nasa-releases-19-000-hours-audio-historic-apollo-11-mission-ncna903721
https://www.insidescience.org/news/thousands-hours-newly-released-audio-tell-backstage-story-apollo-11-moon-mission

4aSC19 – Consonant Variation in Southern Speech

Lisa Lipani – llipani@uga.edu
Michael Olsen – michael.olsen25@uga.edu
Rachel Olsen – rmm75992@uga.edu
Department of Linguistics
University of Georgia
142 Gilbert Hall
Athens, Georgia 30602

Popular version of paper 4aSC19
Presented Thursday morning, December 5, 2019
178th ASA Meeting, San Diego, CA

We all recognize that people from different areas of the United States have different ways of talking, especially in how they pronounce their vowels. Think, for example, about stereotypical Bostonians who might “pahk the cah in Havahd Yahd”. The field of sociolinguistics studies speech sounds from different groups of people to establish and understand regional American dialects.

While there are decades of research on vowels, sociolinguists have recently begun to ask whether consonants such as p, b, t, d, k, and g also vary depending on where people are from or what social groups they belong to. These consonants, p, b, t, d, k, and g, are known as “stop consonants,” because the airflow “stops” due to a closure in your vocal tract. One acoustic characteristic of these consonants is voice onset time, the amount of time between the closure in the vocal tract and the start of vocal fold (also known as vocal cords) vibration. We wanted to know whether some groups of speakers, say men versus women or Texans versus other Southern speakers, pronounced their consonants differently than other groups. In order to investigate this, we used the Digital Archive of Southern Speech (DASS), which contains 367 hours of recordings made across the southeastern United States between 1970 and 1983, consisting of approximately two million words of Southern speech.

The original DASS researchers were mostly interested in differences in language based on the age of speakers and their geographic location. In the interviews, people were asked about specific words that might indicate their dialect. For example, do you say “pail” or “bucket” for the thing you might borrow from Jack and Jill?

We used computational methods to investigate Southern consonants in DASS, looking at pronunciations of p, b, t, d, k, and g at the beginning of roughly 144,000 words. Our results show that ethnicity is a social factor in the production of these sounds. In our data, African Americans had longer voice onset time, meaning that there was a longer period of time between the closure of the stop consonant and the start of vocal fold vibration, even when we adjusted the data for speaking rate. This kind of research is important because as we describe differences in the way we speak, we can better understand how we express our social and regional identity.

5aSC8 – How head and eyebrow movements help make conversation a success

Samantha Danner – sfgordon@usc.edu
Dani Byrd – dbyrd@usc.edu
Department of Linguistics, University of Southern California
Grace Ford Salvatori Hall, Rm. 301
3601 Watt Way
Los Angeles, CA 90089-1693

Jelena Krivokapić– jelenak@umich.edu
Department of Linguistics, University of Michigan
440 Lorch Hall
611 Tappan Street
Ann Arbor, MI 48109-1220

Popular version of poster 5aSC8
Presented Friday morning, December 6, 2019
178th ASA Meeting, San Diego, CA

It’s easy to take for granted our ability to have a conversation, even with someone we’ve never met before. In fact, the human capacity for choreographing conversation is quite incredible. The average time from when one speaker stops speaking to when the next speaker starts is only about 200 milliseconds. Yet somehow, speakers are able to let their conversation partner know when they are ready to turn over the conversational ‘floor.’ Likewise, people somehow sense when it is their turn to start speaking. How, without any conscious effort, is this dance of conversation between two people so relatively smooth?

One possible answer to this question is that we use non-verbal communication to help move conversations along. The study described in this presentation takes a look at how movements of the eyebrow and the head might be used by participants in conversation to help determine when to exchange the conversational floor with one another. For this research, speakers conversed in a pair, each taking turns to collaboratively recite a well-known nursery rhyme like ‘Humpty Dumpty’ or ‘Jack and Jill.’ Using nursery rhymes allowed us to study spontaneous speech (speech that is not rehearsed or read) that offered many opportunities for the members of the pair to take turns speaking. We used an instrument called an electromagnetic articulograph to precisely track the eyebrow and head movements of the two conversing people. Their speech was also recorded, so that it was clear exactly when in the conversation the movements of each person’s brow and head were happening.

We wondered whether we would see more frequent movements of the eyebrows and head when someone is acting as a speaker as opposed to a listener during the conversation, and whether we would see more or less frequent movement at particular moments in the conversation, such as when one person yields the conversational floor to the other, or interrupts the other, or finds that they need to start speaking again after an awkward pause.

We found that listeners move their heads and brows more frequently than speakers. This may mean that people in conversation use face movements to show their engagement with what their partner is saying. We also found that the moment in conversation when movements are most frequent is at interruptions, indicating that listeners may use co-speech movements to signal that they are about to interrupt a speaker.

This research on spoken language helps linguists understand how humans can converse so easily and effectively, highlighting some of the many behaviors we use in talking to each other. Actions of the face and body facilitate the uniquely human capacity for language communication—we use so much more than just our voices to make a conversation happen.