Speech Communication | Acoustics.org

1aSC1 – Untangling the link between working memory and understanding speech

Adam Bosen – adam.bosen@boystwon.org
Boys Town National Research Hospital
555 N. 30^th St
Omaha, NE 68131

Popular version of paper 1aSC1 Reconsidering reading span as the sole measure of working memory in speech recognition research
Presented Tuesday morning, June 8^th, 2021
180^th ASA Meeting, Acoustics in Focus

Many patients with cochlear implants have difficulty understanding speech. Cochlear implants often do not convey all of the pieces of speech, so the patient often has to use their memory of what they heard to fill in the missing pieces. As a result, their ability to understand speech is correlated with their performance on working memory tests (O’Neill et al., 2019). Working memory is our ability to simultaneously remember some information while working on other information. For example, if you want to add 57 and 38 in your head you need to sum 7+8 and then hold the result in memory while you work on summing 50+30.

The reading span test is a common tool for measuring working memory. In this test, people see lists of alternating sentences and letters and must decide whether each sentence makes sense while simultaneously remembering the letters. The reading span test is important because it often predicts how well people with hearing loss can understand speech.

We do not know is why the reading span test is related to speech understanding. One idea is that the ability to simultaneously remember and work on interpreting what you heard is essential for understanding unclear speech. To test this idea, our lab asked young adults with normal hearing to try to understand unclear sentences. These sentences were mixed with two other people talking in the background and then processed to mimic the limited signal a cochlear implant provides.

[Vocoded Speech in Babble.mp3, An unclear recording of someone saying “If the farm is rented, the rent must be paid” with other people talking in the background.]

They also completed memory tests which do not require them to work on anything, such as remembering lists of spoken numbers (example) or words on a screen (example). These tests were as good as reading span at predicting how well these participants could understand unclear speech. This finding indicates that the reading span test is just one way to assess the parts of memory that relate to understanding speech. We conclude that the ability to simultaneously remember and work on information is not the only part of memory that helps us understand unclear speech.

We also tested older adults with cochlear implants on their ability to understand sentences and their ability to remember lists of numbers. Surprisingly, we did not find a relationship between remembering lists of numbers and understanding speech like we did in young adults with normal hearing. This finding indicates that age and/or hearing loss change which parts of working memory relate to understanding speech. Previous work suggests that some parts of working memory tend to decline with age, while others do not (Bopp & Verhaeghen, 2005; Oberauer, 2005). We conclude that further untangling the link between working memory and understanding speech requires measuring different parts of memory using multiple tests.

Bopp, K. L., & Verhaeghen, P. (2005). Aging and Verbal Memory Span: A Meta-Analysis. Journal of Gerontology, 60B(5), 223–233. https://doi.org/https://doi.org/10.1093/geronb/60.5.P223

O’Neill, E. R., Kreft, H. A., & Oxenham, A. J. (2019). Cognitive factors contribute to speech perception in cochlear-implant users and age-matched normal-hearing listeners under vocoded conditions. The Journal of the Acoustical Society of America, 146(1), 195–210. https://doi.org/10.1121/1.5116009

Oberauer, K. (2005). Control of the contents of working memory – A comparison of two paradigms and two age groups. Journal of Experimental Psychology: Learning Memory and Cognition, 31(4), 714–728. https://doi.org/10.1037/0278-7393.31.4.714

3pSCb1 – Sound Teaching Online During COVID19

Anne C. Balant – balanta@newpaltz.edu
State University of New York at New Paltz
1 Hawk Dr.
New Paltz, NY 12561

Popular version of lightening round talk 3pSCb1 “Lab kits for remote and socially distanced instruction in a GE Acoustics Course“
Presented Thursday afternoon, June 10, 2021
180^thASA Meeting, Acoustics in Focus
Read the article in Proceedings of Meetings on Acoustics

How do you give students in an online acoustics course a hands-on lab experience?

At the State University of New York (SUNY) at New Paltz, students in the online sections of “The World of Sound” use a lab kit that was designed by the instructor. Students pay for shipment of the kits to their homes at the start of the course and return them at the end. They submit photos or videos of their activities along with their completed lab reports.

These kits had been in use for several years in an online post-baccalaureate program that prepares students for graduate study in speech-language pathology when the COVID19 pandemic radically changed the undergraduate on-campus version the course.

“The World of Sound” is a four-credit general education lab science course. Undergraduates typically work in groups of three and share equipment within and across lab sections. By summer of 2020, it was clear that on-campus labs in the upcoming fall semester would have to meet social distancing requirements, with no sharing of materials, and that there could be a pivot to fully remote instruction at any time. The cost of the needed individual instructional materials was a consideration due to the fiscal impact of COVID19. A revised lab kit was developed that contains everything needed for seven labs, costs under $30.00, and has a shipping weight of less than two pounds.

About one-fourth of the undergraduates in the course chose to study fully remotely during fall 2020. These students had their kits shipped to them and they attended a weekly virtual lab session. Each student in the seated course was issued an individual lab kit in a shipping box that was addressed to the department for ease of return shipment. Seated labs were conducted with all required precautions including face coverings and social distancing. The kits contained everything needed for each lab, including basic supplies, so no equipment had to be shared.

Although the college was able to keep COVID19 rates low enough to stay open for the entire semester, about 15% of the students in the course transitioned to remote learning at least briefly for reasons such as illness or quarantine, missing a required covid test date, financial issues, or COVID19-related family responsibilities or crises. Having their lab kits in their possession allowed these students to move seamlessly between seated and virtual lab sessions without falling behind. Every undergraduate who studied remotely for part or all of the semester completed the course successfully.

1aSC3 – Acoustic changes of speech during the later years of life

Benjamin Tucker – benjamin.tucker@ualberta.ca
Stephanie Hedges – shedges@ualberta.ca
Department of Linguistics
University of Alberta
Edmonton, Alberta T6G 2E7
Canada

Mark Berardi – mberardi@msu.edu
Eric Hunter – ejhunter@msu.edu
Department of Communicative Sciences and Disorders
Michigan State University
East Lansing, Michigan 48824

Popular version of paper 1aSC3 (your paper version)
Presented Monday morning 11:15 AM – 12:00P PM, December 7, 2020
179^thASA Meeting, Acoustics Virtually Everywhere

Research into the perception and production of the human voice has shown that the human voice changes with age (e.g., Harnsberger et al., 2008). Most of the previous studies have investigated speech changes over time using groups of people of different ages, while a few studies have tracked how an individual speaker’s voice changes over time. The present study investigates three male speakers and how their voices change over the last 30 to 50 years of their lives.

We used publicly available archives of speeches given to large audiences on a semi-regular basis (generally with a couple of years between each instance). The group of speeches was given during the last 30-50 years of each speaker’s life, meaning that we have samples ranging from the speakers’ late 40s to early 90s. We extracted 5-minute samples (recordings and transcripts) from each speech. We then used the Penn forced-alignment system (this system finds and marks the boundaries of individual speech sounds) to identify word and sound boundaries. Acoustic characteristics of the speech were extracted from the speech signal using a custom script using the Praat software package.

In the present analysis, we investigate changes in the vowel space (the acoustic range of vowels a speaker has produced), fundamental frequency (what a listener hears as pitch), the duration of words and sounds (segments), and speech rate. We model the acoustic characteristics of our speakers using Generalized Additive Models (Hastie & Tibshirani, 1990), which allows for an investigation of non-linear changes over time.

The results are discussed in terms of vocal changes over the lifespan in the speakers’ later-years. Figure 1 illustrates the change in one speaker’s vowel space as he ages. We find that for this speaker the vowel space shifts to lower frequencies as he ages.

Figure 1 – An animation of Speaker 1’s vowel space and how it changes over a period of 50 years. Each colored circle represents a different decade.

We also find a similar effect for fundamental frequency across all three speakers, Figure 2, where the average fundamental frequency of their voices gets lower and lower as they age and then starts to get higher after the age of 70. This effect is the same for word and segment duration. We find that on average as our three speakers age their speech (at least when giving public speeches) gets faster and then slows down after around the age of 70.

Figure 2: Average fundamental frequency of our speakers’ speech as they age.

Figure 3: Average speech rate in syllables per second of our speakers’ speech as they age.

While on average our three speakers show a change in the progression of their speech at the age of 70, each speaker has their own unique speech trajectory. From a physiological standpoint, our data suggest that with age come not only laryngeal changes (changes to the voice) but also a decrease in respiratory health – especially expiratory volume – as has been reflected in previous studies.

1aSCb4 – Formant and voice quality changes as a function of age in women

Laura L. Koenig – koenig@haskins.yale.edu
Adelphi University
158 Cambridge Avenue
Garden City NY 11530

Susanne Fuchs – fuchs@leibniz-zas.de
Leibniz-Zentrum Allgemeine Sprachwissenschaft (ZAS)
Schützenstr. 18
10117 Berlin (Germany)

Annette Gerstenberg – gerstenberg@uni-potsdam.de
University of Potsdam, Department of Romance Studies
Am Neuen Palais 10
14467 Potsdam (Germany)

Moriah Rastegar – moriahrastegar@mail.adelphi.edu
Adelphi University
158 Cambridge Avenue
Garden City NY 11530

Popular version of the paper: 1aSCb4
Presented: December 7, 2020 at 10:15 AM – 11:00 AM EST

As we age, we change in many ways: How we look, the way we dress, and how we speak. Some of these changes are biological, and others are social. All are potentially informative to those we interact with.

Captions: “Younger (left) and older (right). Image obtained under the publicly-available creative commons licence. Aging manipulation courtesy of Jolanda Fuchs.”

******

The human voice is a rich source of information on speaker characteristics, and studies indicate that listeners are relatively accurate in judging the age of an unknown person they hear on the phone. Vocal signals carry information on (a) the sizes of the mouth and throat cavities, which change as we produce different vowels and consonants; (b) the voice pitch, which reflects characteristics of the vocal-folds; and (c) the voice quality, which also reflects vocal-fold characteristics, but in complex and multidimensional ways. One voice quality dimension is whether a person speaks with a breathier voice quality. Past studies on the acoustic effects of vocal aging have concentrated on formants, which reflect upper-airway cavity sizes, and fundamental frequency, which corresponds to voice pitch. Few studies have assessed voice quality.

Further, most past work investigated age by comparing people from different generations. Cross-generational studies can be confounded by changes in human living conditions such as nutrition, employment settings, and exposure to risk factors. To separate effects of aging from environmental factors, it is preferable to assess the same individuals at different time points. Such work is rather rare given the demands of re-connecting with people over long periods of time.

Here, we take advantage of the French LangAge corpus (https://www.uni-potsdam.de/langage/). Participants engaged in bibliographic interviews beginning in 2005, and were revisited in subsequent years. Our analysis is based on four women recorded in 2005 and 2015. We focus on women because biological aging may differ across the sexes. Out of all words, we selected two of the most frequent ones that were produced for each speaker and time point and did not include voiceless sounds.

Numbers 049 and 016 identify the two speakers, f=female, and the following value (e.g. 72) is the age of the speaker.

049_f_72_LeGris.wav	016_f_71_chiens.wav
049_f_82_LeBaigneur.wav	016_f_81_chiens.wav

Our results show that all four speakers have a lower cavity (formant) frequency at older ages. This may reflect lengthening of the upper airways, e.g. the larynx descends somewhat over time. Voice quality also changed, with breathier vocal quality at younger ages than at older ages. However, speakers differed considerably in the magnitude of these changes and in which measures demonstrated aging effects.

In some cultures, a breathy vocal quality is a marker of gender. Lifestyle changes in later life could lead to a reduced need to demonstrate “female” qualities. In our dataset, the speaker with the largest changes in breathiness was widowed between recording times. Along with physiological factors and social-communicative conditions, ongoing adaptation to gender roles as a person ages may also contribute to changes in voice quality.

2pSCb4 – The Science of Voice Acting

Colette Feehan – cmfeehan@iu.edu
Indiana University

Popular version of paper 2pSCb4
Presented Tuesday afternoon, December 8, 2020
179^th ASA meeting, Virtually Everywhere
Click here to read the abstract

Many people do not realize that the “children” they hear in animation are actually voiced by adults¹. There are several reasons for this, including: children cannot work long hours, are difficult to direct, and their voices change as they grow. Using an adult who can simulate a child voice bypasses these issues, but surprisingly not all voice actors (VAs) can create a believable child voice.

Studying what VAs do can tell us about how the vocal tract works. They can speak intelligibly while contorting their mouths in unnatural ways. A few previous studies^2-10 have looked at the acoustics of VAs, or just the sounds that they produce, such as changes in pitch, voice quality (how raspy or breathy a voice sounds), and what kinds of regional dialects they use. This study uses 3D ultrasound and acoustic data from 3 professional and 3 amateur VAs to start answering the question: What do voice actors do with their vocal tracts to sound like a child? There are multiple different strategies to make your vocal tract sound smaller and different actors combine different strategies to make their child-like voices.

Looking at both the acoustics (the sounds they produce) and the ultrasound imaging of their vocal tracts, the strategies identified so far include: Gesture fronting and raising and hyoid bone raising.

Gesture fronting and raising refers to the position of the tongue within the mouth while you speak. If you think about the location of your tongue when repeating “ta ka ta ka…” you will notice that your tongue touches the roof of your mouth in different places to make each of those consonant sounds—farther forward in the mouth for “ta” and farther back for “ka” and the same is true for vowels. Figure 1 comes from analyzing the recording of their speech and shows that the position of the tongue for the adult versus child voice is pretty different for [i] and [ɑ] sounds for this subject. Given this information, we can then look at the ultrasound and see that the tongue positions are indeed farther forward (right) or higher in the mouth for the child voice, see Figure 2

The hyoid bone is a small bone above the larynx in your neck. This bone interrupts the ultrasound signal and prevents an image from showing up, but looking at the location of this hyoid “shadow” can still give us information. If the hyoid shadow is raised and fronted, as seen in Figure 3, it might be the case that the actor is shortening their vocal tract by contracting muscles in their throat.

Figure 4 shows that, for this VA, the hyoid bone shadow was higher throughout the entire utterance while doing a child voice, meaning that the actor might physically shorten the whole vocal tract the whole time while speaking

Data from VAs can help find alternative pronunciations for speech sounds which could help people with speech impediments but could also be used to help trans individuals sound closer to their identity.

References

Holliday, C. “Emotion Capture: Vocal Performances by Children in the Computer-Animated Film”. Alphaville: Journal of Film and Screen Media 3 (Summer 2012). Web. ISSN: 2009-4078.
Starr, R. L. (2015). Sweet voice: The role of voice quality in a Japanese feminine style. Language in Society, 44(01), 1-34.
Teshigawara, M. (2003). Voices in Japanese animation: a phonetic study of vocal stereotypes of heroes and villains in Japanese culture. Dissertation.
Teshigawara, M. (2004). Vocally expressed emotions and stereotypes in Japanese animation: Voice qualities of the bad guys compared to those of the good guys. Journal of the Phonetic Society of Japan, 8(1), 60-76.
Teshigawara, M., & Murano, E. Z. (2004). Articulatory correlates of voice qualities of good guys and bad guys in Japanese anime: An MRI study. In Proceedings of INTERSPEECH (pp. 1249-1252).
Teshigawara, M., Amir, N., Amir, O., Wlosko, E., & Avivi, M. (2007). Effects of random splicing on listeners’ perceptions. In 16th international congress of phonetic sciences (icphs).
Teshigawara, M. 2009. Vocal expressions of emotions and personalities in Japanese anime. In Izdebski, K. (ed.), Emotions of the Human Voice, Vol. III Culture and Perception. San Diego: Plural Publishing, 275-287.
Teshigawara, K. (2011). Voice-based person perception: two dimensions and their phonetic properties. ICPhSXVII, 1974-1977.
Uchida, T. 2007. Effects of F0 range and contours in speech upon the image of speakers’ personality. Proc.19th ICA Madrid. http://www.seaacustica.es/WEB_ICA_07/fchrs/papers/cas-03-024.pdf
Lippi-Green, R. (2011). English with an accent : language, ideology and discrimination in the united states. Retrieved from https://ebookcentral.proquest.com

1aSCa3 – Training effects on speech prosody production by Cantonese-speaking children with autism spectrum disorder

Si Chen – sarah.chen@polyu.edu.hk
Bei Li
Fang Zhou
Angel Wing Shan Chan
Tempo Po Yi Tang
Eunjin Chun
Phoebe Choi
Chakling Ng
Fiona Cheng
Xinrui Gou

Department of Chinese and Bilingual Studies
The Hong Kong Polytechnic University
11 Yuk Choi Road, Hung Hom, Hong Kong, China

Popular version of paper 1aSCa3
Presented Monday, December 07, 2020, 9:30 AM – 10:15 AM EST
179th ASA Meeting, Acoustics Virtually Everywhere

Speakers can utilize prosodic variations to express their intentions, states and emotions. Specifically, the relatively new information of an utterance, namely the focus, is often associated with expanded range of prosodic cues. The main types of focus include broad, narrow, and contrastive focus. Broad focus involves focus in a whole sentence (A: What did you say? B: [Emily ate an apple]_FOCUS), whereas narrow focus emphasizes one constituent asked in the question (A: What did Emily eat? B: Emily ate an [apple]_FOCUS). Contrastive focus rejects alternative statements (A: Did Emily eat an orange? B: (No,) Emily ate an [apple]_FOCUS).

Children with autism spectrum disorder (ASD) have been reported to show difficulties in using speech prosody to mark focus. The presented research aims to test whether speech training and sung speech training may improve the use of speech prosody to mark focus. Fifteen Cantonese-speaking ASD children finished pre- and post-training speech production tasks and received either speech or sung speech training. In the pre- and post- training speech production tasks, we designed games to measure participants’ ability to mark focus in conversations. In the training tasks, we improved the mapping between acoustic cues and information structure categories through a series of tasks. The conversations used in sung speech training were designed with melodies that imitated the change of acoustic cues in speech.

Training.mp4, An example of congruous and incongruous conversation pairs in sung speech training

Both of the two training methods consisted of three phases of training. In the first phase, participants listened to congruous conversations pairs attentively in a designed game. In the second phase, participants were told that the odd trial of conversation was incongruous (the focus type that the question elicited did not match with that of the answer), and the even trial was congruous. They need to attend to the differences between the odd and even trials. In the third phase, all the trials were presented in a random order. Participants needed to judge if a pair was congruous or not. Instant feedback was provided after each response.

We extracted acoustic cues from ASD children’s speech before and after training and performed statistical analyses. Our pilot results showed that both speech and sung speech training might have improved the use of prosodic cues such as intensity and f0 in marking focus across various focus positions (e.g. meanF0.tiff). However, ASD children may still have difficulties in integrating all the prosodic cues across focus conditions.

Mean f0 of narrow focus in the initial position before and after training

« Older Entries

Next Entries »