3pSCb1 – Sound Teaching Online During COVID19

Anne C. Balant – balanta@newpaltz.edu
State University of New York at New Paltz
1 Hawk Dr.
New Paltz, NY 12561

Popular version of lightening round talk 3pSCb1 “Lab kits for remote and socially distanced instruction in a GE Acoustics Course
Presented Thursday afternoon, June 10, 2021
180th ASA Meeting, Acoustics in Focus
Read the article in Proceedings of Meetings on Acoustics

How do you give students in an online acoustics course a hands-on lab experience?

kit

At the State University of New York (SUNY) at New Paltz, students in the online sections of “The World of Sound” use a lab kit that was designed by the instructor. Students pay for shipment of the kits to their homes at the start of the course and return them at the end. They submit photos or videos of their activities along with their completed lab reports.

These kits had been in use for several years in an online post-baccalaureate program that prepares students for graduate study in speech-language pathology when the COVID19 pandemic radically changed the undergraduate on-campus version the course.

“The World of Sound” is a four-credit general education lab science course. Undergraduates typically work in groups of three and share equipment within and across lab sections. By summer of 2020, it was clear that on-campus labs in the upcoming fall semester would have to meet social distancing requirements, with no sharing of materials, and that there could be a pivot to fully remote instruction at any time. The cost of the needed individual instructional materials was a consideration due to the fiscal impact of COVID19. A revised lab kit was developed that contains everything needed for seven labs, costs under $30.00, and has a shipping weight of less than two pounds.

kit

About one-fourth of the undergraduates in the course chose to study fully remotely during fall 2020. These students had their kits shipped to them and they attended a weekly virtual lab session. Each student in the seated course was issued an individual lab kit in a shipping box that was addressed to the department for ease of return shipment. Seated labs were conducted with all required precautions including face coverings and social distancing. The kits contained everything needed for each lab, including basic supplies, so no equipment had to be shared.

Although the college was able to keep COVID19 rates low enough to stay open for the entire semester, about 15% of the students in the course transitioned to remote learning at least briefly for reasons such as illness or quarantine, missing a required covid test date, financial issues, or COVID19-related family responsibilities or crises. Having their lab kits in their possession allowed these students to move seamlessly between seated and virtual lab sessions without falling behind. Every undergraduate who studied remotely for part or all of the semester completed the course successfully.

1aSC3 – Acoustic changes of speech during the later years of life

Benjamin Tucker – benjamin.tucker@ualberta.ca
Stephanie Hedges – shedges@ualberta.ca
Department of Linguistics
University of Alberta
Edmonton, Alberta T6G 2E7
Canada

Mark Berardi – mberardi@msu.edu
Eric Hunter – ejhunter@msu.edu
Department of Communicative Sciences and Disorders
Michigan State University
East Lansing, Michigan 48824

Popular version of paper 1aSC3 (your paper version)
Presented Monday morning 11:15 AM – 12:00P PM, December 7, 2020
179th ASA Meeting, Acoustics Virtually Everywhere

Research into the perception and production of the human voice has shown that the human voice changes with age (e.g., Harnsberger et al., 2008). Most of the previous studies have investigated speech changes over time using groups of people of different ages, while a few studies have tracked how an individual speaker’s voice changes over time. The present study investigates three male speakers and how their voices change over the last 30 to 50 years of their lives.

We used publicly available archives of speeches given to large audiences on a semi-regular basis (generally with a couple of years between each instance). The group of speeches was given during the last 30-50 years of each speaker’s life, meaning that we have samples ranging from the speakers’ late 40s to early 90s. We extracted 5-minute samples (recordings and transcripts) from each speech. We then used the Penn forced-alignment system (this system finds and marks the boundaries of individual speech sounds) to identify word and sound boundaries. Acoustic characteristics of the speech were extracted from the speech signal using a custom script using the Praat software package.

In the present analysis, we investigate changes in the vowel space (the acoustic range of vowels a speaker has produced), fundamental frequency (what a listener hears as pitch), the duration of words and sounds (segments), and speech rate. We model the acoustic characteristics of our speakers using Generalized Additive Models (Hastie & Tibshirani, 1990), which allows for an investigation of non-linear changes over time.

The results are discussed in terms of vocal changes over the lifespan in the speakers’ later-years. Figure 1 illustrates the change in one speaker’s vowel space as he ages. We find that for this speaker the vowel space shifts to lower frequencies as he ages.

Figure 1 – An animation of Speaker 1’s vowel space and how it changes over a period of 50 years. Each colored circle represents a different decade.

We also find a similar effect for fundamental frequency across all three speakers, Figure 2, where the average fundamental frequency of their voices gets lower and lower as they age and then starts to get higher after the age of 70. This effect is the same for word and segment duration. We find that on average as our three speakers age their speech (at least when giving public speeches) gets faster and then slows down after around the age of 70.

Figure 2: Average fundamental frequency of our speakers’ speech as they age.

Figure 3: Average speech rate in syllables per second of our speakers’ speech as they age.

While on average our three speakers show a change in the progression of their speech at the age of 70, each speaker has their own unique speech trajectory. From a physiological standpoint, our data suggest that with age come not only laryngeal changes (changes to the voice) but also a decrease in respiratory health – especially expiratory volume – as has been reflected in previous studies.

1aSCb4 – Formant and voice quality changes as a function of age in women

Laura L. Koenig – koenig@haskins.yale.edu
Adelphi University
158 Cambridge Avenue
Garden City NY 11530

Susanne Fuchs – fuchs@leibniz-zas.de
Leibniz-Zentrum Allgemeine Sprachwissenschaft (ZAS)
Schützenstr. 18
10117 Berlin (Germany)

Annette Gerstenberg – gerstenberg@uni-potsdam.de
University of Potsdam, Department of Romance Studies
Am Neuen Palais 10
14467 Potsdam (Germany)

Moriah Rastegar – moriahrastegar@mail.adelphi.edu
Adelphi University
158 Cambridge Avenue
Garden City NY 11530

Popular version of the paper: 1aSCb4
Presented: December 7, 2020 at 10:15 AM – 11:00 AM EST

As we age, we change in many ways:  How we look, the way we dress, and how we speak.  Some of these changes are biological, and others are social.  All are potentially informative to those we interact with.

age age

Captions: “Younger (left) and older (right). Image obtained under the publicly-available creative commons licence.  Aging manipulation courtesy of Jolanda Fuchs.”

******

The human voice is a rich source of information on speaker characteristics, and studies indicate that listeners are relatively accurate in judging the age of an unknown person they hear on the phone.  Vocal signals carry information on (a) the sizes of the mouth and throat cavities, which change as we produce different vowels and consonants; (b) the voice pitch, which reflects characteristics of the vocal-folds; and (c) the voice quality, which also reflects vocal-fold characteristics, but in complex and multidimensional ways.  One voice quality dimension is whether a person speaks with a breathier voice quality.  Past studies on the acoustic effects of vocal aging have concentrated on formants, which reflect upper-airway cavity sizes, and fundamental frequency, which corresponds to voice pitch.  Few studies have assessed voice quality.

Further, most past work investigated age by comparing people from different generations.  Cross-generational studies can be confounded by changes in human living conditions such as nutrition, employment settings, and exposure to risk factors.  To separate effects of aging from environmental factors, it is preferable to assess the same individuals at different time points.  Such work is rather rare given the demands of re-connecting with people over long periods of time.

Here, we take advantage of the French LangAge corpus (https://www.uni-potsdam.de/langage/).  Participants engaged in bibliographic interviews beginning in 2005, and were revisited in subsequent years.  Our analysis is based on four women recorded in 2005 and 2015.  We focus on women because biological aging may differ across the sexes. Out of all words, we selected two of the most frequent ones that were produced for each speaker and time point and did not include voiceless sounds.

Numbers 049 and 016 identify the two speakers, f=female, and the following value (e.g. 72) is the age of the speaker.

049_f_72_LeGris.wav 016_f_71_chiens.wav
049_f_82_LeBaigneur.wav 016_f_81_chiens.wav

Our results show that all four speakers have a lower cavity (formant) frequency at older ages.  This may reflect lengthening of the upper airways, e.g. the larynx descends somewhat over time.  Voice quality also changed, with breathier vocal quality at younger ages than at older ages.  However, speakers differed considerably in the magnitude of these changes and in which measures demonstrated aging effects.

In some cultures, a breathy vocal quality is a marker of gender. Lifestyle changes in later life could lead to a reduced need to demonstrate “female” qualities. In our dataset, the speaker with the largest changes in breathiness was widowed between recording times.  Along with physiological factors and social-communicative conditions, ongoing adaptation to gender roles as a person ages may also contribute to changes in voice quality.

2pSCb4 – The Science of Voice Acting

Colette Feehan – cmfeehan@iu.edu
Indiana University

Popular version of paper 2pSCb4
Presented Tuesday afternoon, December 8, 2020
179th ASA meeting, Virtually Everywhere
Click here to read the abstract

Many people do not realize that the “children” they hear in animation are actually voiced by adults1. There are several reasons for this, including: children cannot work long hours, are difficult to direct, and their voices change as they grow. Using an adult who can simulate a child voice bypasses these issues, but surprisingly not all voice actors (VAs) can create a believable child voice.

Studying what VAs do can tell us about how the vocal tract works. They can speak intelligibly while contorting their mouths in unnatural ways. A few previous studies2-10 have looked at the acoustics of VAs, or just the sounds that they produce, such as changes in pitch, voice quality (how raspy or breathy a voice sounds), and what kinds of regional dialects they use. This study uses 3D ultrasound and acoustic data from 3 professional and 3 amateur VAs to start answering the question: What do voice actors do with their vocal tracts to sound like a child? There are multiple different strategies to make your vocal tract sound smaller and different actors combine different strategies to make their child-like voices.

Looking at both the acoustics (the sounds they produce) and the ultrasound imaging of their vocal tracts, the strategies identified so far include: Gesture fronting and raising and hyoid bone raising.

Gesture fronting and raising refers to the position of the tongue within the mouth while you speak. If you think about the location of your tongue when repeating “ta ka ta ka…” you will notice that your tongue touches the roof of your mouth in different places to make each of those consonant sounds—farther forward in the mouth for “ta” and farther back for “ka” and the same is true for vowels. Figure 1 comes from analyzing the recording of their speech and shows that the position of the tongue for the adult versus child voice is pretty different for [i] and [ɑ] sounds for this subject. Given this information, we can then look at the ultrasound and see that the tongue positions are indeed farther forward (right) or higher in the mouth for the child voice, see Figure 2

The hyoid bone is a small bone above the larynx in your neck. This bone interrupts the ultrasound signal and prevents an image from showing up, but looking at the location of this hyoid “shadow” can still give us information. If the hyoid shadow is raised and fronted, as seen in Figure 3, it might be the case that the actor is shortening their vocal tract by contracting muscles in their throat.

Figure 4 shows that, for this VA, the hyoid bone shadow was higher throughout the entire utterance while doing a child voice, meaning that the actor might physically shorten the whole vocal tract the whole time while speaking

Data from VAs can help find alternative pronunciations for speech sounds which could help people with speech impediments but could also be used to help trans individuals sound closer to their identity.

References

  1. Holliday, C. “Emotion Capture: Vocal Performances by Children in the Computer-Animated Film”. Alphaville: Journal of Film and Screen Media 3 (Summer 2012). Web. ISSN: 2009-4078.
  2. Starr, R. L. (2015). Sweet voice: The role of voice quality in a Japanese feminine style. Language in Society, 44(01), 1-34.
  3. Teshigawara, M. (2003). Voices in Japanese animation: a phonetic study of vocal stereotypes of heroes and villains in Japanese culture. Dissertation.
  4. Teshigawara, M. (2004). Vocally expressed emotions and stereotypes in Japanese animation: Voice qualities of the bad guys compared to those of the good guys. Journal of the Phonetic Society of Japan8(1), 60-76.
  5. Teshigawara, M., & Murano, E. Z. (2004). Articulatory correlates of voice qualities of good guys and bad guys in Japanese anime: An MRI study. In Proceedings of INTERSPEECH (pp. 1249-1252).
  6. Teshigawara, M., Amir, N., Amir, O., Wlosko, E., & Avivi, M. (2007). Effects of random splicing on listeners’ perceptions. In 16th international congress of phonetic sciences (icphs).
  7. Teshigawara, M. 2009. Vocal expressions of emotions and personalities in Japanese anime. In Izdebski, K. (ed.), Emotions of the Human Voice, Vol. III Culture and Perception. San Diego: Plural Publishing, 275-287.
  8. Teshigawara, K. (2011). Voice-based person perception: two dimensions and their phonetic properties. ICPhSXVII, 1974-1977.
  9. Uchida, T. 2007. Effects of F0 range and contours in speech upon the image of speakers’ personality. Proc.19th ICA Madrid. http://www.seaacustica.es/WEB_ICA_07/fchrs/papers/cas-03-024.pdf
  10. Lippi-Green, R. (2011). English with an accent : language, ideology and discrimination in the united states. Retrieved from https://ebookcentral.proquest.com

1aSCa3 – Training effects on speech prosody production by Cantonese-speaking children with autism spectrum disorder

Si Chen – sarah.chen@polyu.edu.hk
Bei Li
Fang Zhou
Angel Wing Shan Chan
Tempo Po Yi Tang
Eunjin Chun
Phoebe Choi
Chakling Ng
Fiona Cheng
Xinrui Gou

Department of Chinese and Bilingual Studies
The Hong Kong Polytechnic University
11 Yuk Choi Road, Hung Hom, Hong Kong, China

Popular version of paper 1aSCa3
Presented Monday, December 07, 2020, 9:30 AM – 10:15 AM EST
179th ASA Meeting, Acoustics Virtually Everywhere

Speakers can utilize prosodic variations to express their intentions, states and emotions. Specifically, the relatively new information of an utterance, namely the focus, is often associated with expanded range of prosodic cues. The main types of focus include broad, narrow, and contrastive focus. Broad focus involves focus in a whole sentence (A: What did you say? B: [Emily ate an apple]FOCUS), whereas narrow focus emphasizes one constituent asked in the question (A: What did Emily eat? B: Emily ate an [apple]FOCUS). Contrastive focus rejects alternative statements (A: Did Emily eat an orange? B: (No,) Emily ate an [apple]FOCUS).

Children with autism spectrum disorder (ASD) have been reported to show difficulties in using speech prosody to mark focus. The presented research aims to test whether speech training and sung speech training may improve the use of speech prosody to mark focus. Fifteen Cantonese-speaking ASD children finished pre- and post-training speech production tasks and received either speech or sung speech training. In the pre- and post- training speech production tasks, we designed games to measure participants’ ability to mark focus in conversations. In the training tasks, we improved the mapping between acoustic cues and information structure categories through a series of tasks. The conversations used in sung speech training were designed with melodies that imitated the change of acoustic cues in speech.

Training.mp4, An example of congruous and incongruous conversation pairs in sung speech training

Both of the two training methods consisted of three phases of training. In the first phase, participants listened to congruous conversations pairs attentively in a designed game. In the second phase, participants were told that the odd trial of conversation was incongruous (the focus type that the question elicited did not match with that of the answer), and the even trial was congruous. They need to attend to the differences between the odd and even trials. In the third phase, all the trials were presented in a random order. Participants needed to judge if a pair was congruous or not. Instant feedback was provided after each response.

We extracted acoustic cues from ASD children’s speech before and after training and performed statistical analyses. Our pilot results showed that both speech and sung speech training might have improved the use of prosodic cues such as intensity and f0 in marking focus across various focus positions (e.g. meanF0.tiff). However, ASD children may still have difficulties in integrating all the prosodic cues across focus conditions.

autism spectrum disorderMean f0 of narrow focus in the initial position before and after training

3pSC10 – Fearless Steps: Taking the Next Step towards Advanced Speech Technology for Naturalistic Audio

John H. L. Hansen – john.hansen@utdallas.edu
Aditya Joglekar – aditya.joglekar@utdallas.edu
Meena Chandra Shekar – meena.chandrashekar@utdallas.edu
Abhijeet Sangwan – abhijeet.sangwan@utdallas.edu
CRSS: Center for Robust Speech Systems;
University of Texas at Dallas (UTDallas),
Richardson, TX – 75080, USA

Popular version of paper 3pSC10 “Fearless Steps: Taking the Next Step towards Advanced Speech Technology for Naturalistic Audio”
To be Presented between, Dec 2-6, 2019,
178th ASA Meeting, San Diego

J.H.L. Hansen, A. Joglekar, A. Sangwan, L. Kaushik, C. Yu, M.M.C. Shekhar, “Fearless Steps: Taking the Next Step towards Advanced Speech Technology for Naturalistic Audio,” 178th Acoustical Society of America, Session: 3pSC12, (Wed., 1:00pm-4:00pm; Dec. 4, 2019), San Diego, CA, Dec. 2-6, 2019.

NASA’s Apollo program represents one of the greatest achievements of humankind in the 20th century. During a span of 4 years (from 1968 to 1972), nine lunar missions were launched with 12 astronauts who walked on the surface of the moon. To monitor and assess this massive team challenge, all communications between NASA personnel and astronauts were recorded on 30-track 1-inch analog audio tapes. NASA recorded this in order to be able to review and determine best practices to improve success in subsequent Apollo missions. This analog audio collection essentially was set aside when the Apollo program was completed with Apollo-17, and all tapes stored in NASA’s tape archive. Clearly there are opportunities for research on this audio for both technology and historical purposes. The entire Apollo mission consists of well over ~150,000 hours. Through the Fearless Steps initiative, CRSS-UTDallas digitized 19,000 hours of audio data from Apollo missions: A-1, A-11 and A-13. The focus of this current effort is to contribute to the development of Spoken Language Technology based algorithms to analyze and understand various aspects of conversational speech. To achieve this goal, a new 30-track analog audio decoder was designed using NASA Soundscriber.

Figure 1: (left): The SoundScriber device used to decode 30 track analog tapes, and (right): The UTD-CRSS designed read-head decoder retrofitted to the SoundScriber [5]

To develop efficient speech technologies towards analyzing conversations and interactions, multiple sources of data such as interviews, flight journals, debriefs, and other text sources along with videos were used [8, 12, 13]. This initial research direction allowed CRSS-UTDallas to develop document linking and web application called ‘Explore Apollo’ wherein a user can access certain moments/stories in the Apollo-11 mission. Tools such as the exploreapollo.org enable us to display our findings in an interactive manner [10, 14]. A case in point is to illustrate team communication dynamics via a chord diagram. This diagram (Figure 2 (right)) illustrates the amount of conversation each astronaut has with each other during the mission, and the communication interactions with the capsule communicator (only personnel directly communicating with the astronauts). Analyses such as these provide an interesting insight into the interaction dynamics for astronauts in deep space.

Figure 2: (left): Explore Apollo Online Platform [14] and (right): Chord Diagram of Astronauts’ Conversations [9]

With a massive aggregated data, CRSS-UTDallas sought to explore the problem of automatic speech understanding using algorithmic strategies to answer the questions: (1) when were they talking; (2) who was talking; (3) what was being said; and (4) how were they talking. These questions formulated in technical terminologies are represented as the following tasks: (1) Speech Activity Detection [5], (2) Speaker Identification, (3) Automatic Speech Recognition and Speaker Diarization [6], (4) Sentiment and Topic Understanding [7].

The general task of recognizing what was being said at what time is called the “Diarization pipeline”. In an effort to answer these questions, CRSS-UTDallas developed solutions for automated diarization and transcript generation using Deep Learning strategies for speech recognition along with Apollo mission specific language models [9]. We further developed algorithms which would help answer the other questions including detecting speech activity, and speaker identity for segments of the corpus [6, 8].

Figure 3: Illustration of the Apollo Transcripts using the Transcriber tool

These transcripts allow us to search through the 19,000 hours of data to find keywords, phrases, or any other points on interest in a matter of seconds as opposed to listening to the audio for hours to search for the answers [10, 11]. The transcripts along with the complete Apollo-11 and Apollo-13 corpora are now freely available on the website fearlesssteps.exploreapollo.org

Audio: Air-to-ground communication during the Apollo-11 Mission

Figure 4: The Fearless Steps Challenge website

Phase one of the Fearless Steps Challenge [15] involved performing five challenge tasks on 100 hours of time and mission critical audio out of the 19,000 hours of the Apollo 11 mission. The five challenge tasks are:

  1. Speech Activity Detection
  2. Speaker Identification
  3. Automatic Speech Recognition
  4. Speaker Diarization
  5. Sentiment detection.

The goal of this Challenge was to evaluate the performance of state-of-the-art speech and language systems for large task oriented teams with naturalistic audio in challenging environments. In the future, we aim to digitize all of the Apollo missions and make it publicly available.

A. Sangwan, L. Kaushik, C. Yu, J. H. L. Hansen and Douglas W. Oard. ”Houston, we have a solution: using NASA Apollo program to advance speech and language processing technology.” INTERSPEECH. 2013.
C. Yu, J. H. L. Hansen, and Douglas W. Oard. “`Houston, We Have a Solution’: A Case Study of the Analysis of Astronaut Speech During NASA Apollo 11 for Long-Term Speaker Modeling,” INTERSPEECH. 2014.
Douglas W. Oard, J. H. L. Hansen, A. Sangwan, B. Toth, L. Kaushik, and C. Yu. “Toward Access to Multi-Perspective Archival Spoken Word Content.” In Digital Libraries: Knowledge, Information, and Data in an Open Access Society, 10075:77–82. Cham: Springer International Publishing, 2016.
A. Ziaei, L. Kaushik, A. Sangwan, J. H.L. Hansen, & D. W. Oard, (2014). “Speech activity detection for NASA Apollo Space Missions: Challenges and Solutions.” (pp. 1544-1548) INTERSPEECH. 2013.
L. Kaushik, A. Sangwan, and J. H.L. Hansen. “Multi-Channel Apollo Mission Speech Transcripts Calibration,” 2799–2803. IINERSPEECH, 2017.
C. Yu and J. H. L. Hansen, “Active Learning Based Constrained Clustering For Speaker Diarization,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 11, pp. 2188-2198, Nov. 2017. doi: 10.1109/TASLP.2017.2747097
L. Kaushik, A. Sangwan and J. H. L. Hansen, “Automatic Sentiment Detection in Naturalistic Audio,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 8, pp. 1668-1679, Aug. 2017.
C. Yu, and J. H. L. Hansen. ”A study of voice production characteristics of astronaut speech during Apollo 11 for speaker modeling in space.” Journal of the Acoustic Society of America (JASA), 2017 Mar: 141(3):1605.
L. Kaushik. “Conversational Speech Understanding in highly Naturalistic Audio Streams” PhD Dissertation, The University of Texas at Dallas, 2017.
A. Joglekar, C. Yu, L. Kaushik, A. Sangwan, J. H. L. Hansen, “Fearless Steps Corpus: A Review Of The Audio Corpus For Apollo-11 Space Mission And Associated Challenge Tasks” In NASA Human Research Program Investigators’ Workshop (HRP), 2018.
L. Kaushik, A. Sangwan, J. H. L. Hansen, “Apollo Archive Explorer: An Online Tool To Explore And Study Space Missions” In NASA Human Research Program Investigators’ Workshop (HRP), 2017.
Apollo 11 Mission Overview: https://www.nasa.gov/-mission_pages/apollo/missions/apollo11.html
Apollo 11 Mission Reports: https://www.hq.nasa.gov/alsj/a11-/a11mr.html
Explore Apollo Document Linking Application: https://app.-exploreapollo.org/
Hansen, J. H., Joglekar, A., Shekar, M. C., Kothapally, V., Yu, C., Kaushik, L., & Sangwan, A. (2019). The 2019 inaugural fearless steps challenge: A giant leap for naturalistic audio. In Proc. Interspeech (Vol. 2019).

Additional News Releases from 2017-2019:
https://www.youtube.com/watch?v=CTJtRNMac0E&t
https://www.nasa.gov/feature/nasa-university-of-texas-at-dallas-reveal-apollo-11-behind-the-scenes-audio
https://www.foxnews.com/tech/engineers-restore-audio-recordings-from-apollo-11-mission
https://www.nbcnews.com/mach/science/nasa-releases-19-000-hours-audio-historic-apollo-11-mission-ncna903721
https://www.insidescience.org/news/thousands-hours-newly-released-audio-tell-backstage-story-apollo-11-moon-mission