–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Language is a uniquely human capacity. Members of other species communicate, but those communications are neither as complex nor as interactional as human language. In spite of its greater complexity, however, human language evolved within the constraints of the mammalian auditory system, a system shared by all mammals. For individual children, spoken language must develop within the constraints of their own auditory systems. But even though the great majority of children can hear sounds at birth, there is a tremendous amount of development in the auditory system that takes place after birth, extending through puberty. This development happens in the central auditory pathways, which means the ability to perform more complex functions on acoustic signals does not reach maturity until near puberty. Thus, a reasonable proposal is that any condition that delays the development of the child’s auditory system can disrupt language development, especially for aspects of language most dependent upon having sophisticated auditory functions. This proposal was explored in this study. Furthermore, the idea was explored that two conditions heretofore known to negatively affect language development may exert some of that influence by disturbing the normal timing of auditory development. These conditions are poverty and premature birth.
Developmental scientists have long searched for the roots of the delays in language acquisition exhibited by children living in poverty. That work has focused on language models in the child’s environment, which are fewer in quantity and poorer in quality than what a middle-class child hears. But even though this factor has been found to explain effects of poverty on child language abilities to some extent, those relationships are never found to be very strong. This means that some other factor(s) must also be contributing.
Children born prematurely are known to have delayed language development, and the usual explanation is that the auditory environment in the neonatal intensive care unit is at once too noisy and too void of the human voice, which is available in utero. Again, those explanations might explain some of the deficit, but animal studies show that the simple act of being removed from the womb before full gestation leads to neurodevelopmental challenges. Obviously, those challenges for animals do not include language acquisition, but for human children born too early, language acquisition can be a challenge.
Our primary findings are:
Relatively strong relationships exist between measures of auditory function and language measures, and these relationships were strongest for the most complex language skills.
Socioeconomic status and gestational age at birth were related to measures of both auditory and language development.
Effects on language development of both socioeconomic status and gestational age at birth could be explained by their effects on auditory function, to at least some extent.
These results mean that developmental delays in the biological structures and functions underlying language disorders are happening long before the language problem can be diagnosed. We need to provide intensive interventions right from birth focused not only on discrete language targets, but on the whole child.
Instagram: @karenperta
Elmhurst University, Elmhurst, IL, 60126, United States
Zhaoyan Zhang, UCLA School of Medicine, Los Angeles, CA, United States.
Donna Erickson, Haskins Laboratories, New Haven, CT, United States.
Ryoko Hayashi, Kobe University, Kobe, Japan.
Toshiyuki Sadanobu, Kyoto University, Kyoto, Japan.
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Most people can recall a day so bad that ended with screaming into a pillow. Emotional vocalization is a critical part of human communication. People scream when having fun at sporting events and theme parks, for safety, or to be heard in noisy environments. However, not all screaming and yelling is the same. Some may lose their voice after one night at a concert; others can protest on the picket lines for days without a problem. Why is this?
The purpose of this study is to analyze and compare angry, emotional screaming with trained, “healthy” yelling using magnetic resonance imaging (MRI) and acoustic measures. The MRI shows movements inside of the vocal tract so we can understand exactly how these sounds are created. In this study, a single vocally trained female participant produced angry screaming versus “healthy” belting. Here is a look inside the vocal tract during these sounds:
Figure 1. MRI images of Scream versus Belt (courtesy of authors).
Acoustic measures help characterize the differences between the sounds and provide further insight into how they are produced. Both MRI and acoustic analyses help determine the features that are harmful to the vocal folds versus the features that allow the voice to be heard safely. Here is a power spectrum view that shows frequency (x-axis) and intensity (y-axis) of the sounds as one snapshot in time:
Figure 2. Power spectrum of Scream versus Belt (courtesy of authors).
Based on the MRI measures, we determined that Scream was produced with 1) the highest position of the larynx 2) the largest mouth opening 3) the smallest throat space. Belt was produced with 1) a high larynx position though to a less extreme degree 2) a smaller mouth opening 3) more open space in the throat. Compared to Belt, Scream was also produced with an extremely high pitch – twice that of Belt.
During Scream, the tight throat space led to prolonged contact and strong compression of the vocal folds. This allowed Scream to produce higher intensity (stronger harmonic peaks in the spectrum) at high frequencies (above 6kHs) in Scream as compared to Belt. However, this high intensity production came at the cost of vocal fold injury. The Scream caused the participant to develop small vocal fold lesions that took about two weeks to resolve:
Figure 3. Participant vocal fold lesions following scream (courtesy of authors).
In conclusion, Scream is a primitive vocalization that is produced with a very constrictive action that is similar to swallowing. During swallowing, the vocal tract and vocal folds squeeze and compress in order to keep food and liquid from going into the airway. In contrast, Belt is a learned, trained behavior that is less constrictive and “overrides” innate tendencies for squeezing the vocal tract and pressing the vocal folds. During screaming, the highly constrictive actions of the vocal tract put extra strain and force on the vocal folds that contribute to vocal fold injury. Though it may take some practice, safe yelling should not be tight, feel painful, or cause voice loss. Use caution. Happy yelling!
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Artificial intelligence is now remarkably good at cloning human voices, but can it convincingly imitate a disordered voice? Our findings suggest that while AI excels at copying healthy speech, it still struggles to capture the acoustic complexity of dysphonia, a condition that makes the voice sound rough, strained, or breathy.
Dysphonia affects millions of people and often reduces speech intelligibility, especially in noisy environments. Because collecting large amounts of patient data can be difficult, researchers wondered whether AI voice-cloning technologies might one day help them simulate disordered speech for training, education, or early-stage clinical research.
To test this idea, the team recorded 12 speakers (six with healthy voices and six with dysphonia) and used a commercial AI system to create a digital “voice clone” of each person. These AI voices were trained using about one minute of recorded speech for each speaker. More than 60 listeners participated in three online experiments designed to evaluate whether the AI-generated voice clones truly preserved the qualities of disordered speech.
Watch the short video below to see exactly how the experiment worked.
In the listening tasks, participants heard pairs of sentences. Sometimes both sentences were from the real speaker, sometimes both were AI-generated, and sometimes one was real and one was AI. In some trials, listeners tried to decide whether the two voices came from the same person. In others, they had to identify which sentence (if any) was produced by AI. A third task tested how well listeners understood real and AI-generated dysphonic speech in background noise.
In the first experiment, as shown in Figure 1, listeners were very accurate when both samples were real. Here, accuracy refers to the proportion of trials in which listeners correctly judged whether the two voice samples were from the same or different speakers. Accuracy dropped slightly when both samples were AI-generated. But when one sample was real and the other AI-generated, performance fell sharply, especially for healthy voices, where the AI clones often sounded strikingly similar to the real person.
Figure 1. Bar plot showing the percentage of correct AI identification responses across conditions for normal and dysphonic voices. Bars represent mean percentages with 95% confidence intervals. Note: RL = real speech; AI = AI-generated speech.
Figure 2. Bar plot showing the percentage of correct AI identification responses across conditions for normal and dysphonic voices. Bars represent mean percentages with 95% confidence intervals. Note: RL = real speech; AI = AI-generated speech.
A second experiment asked listeners to identify which sentences were AI-generated. For healthy voices, AI was difficult to detect. For dysphonic voices, however, listeners were more successful — suggesting the AI system smoothed out or failed to reproduce key features of dysphonia. The results are shown in Figure 2.
The final experiment delivered the strongest finding: AI-generated dysphonic voices were significantly more intelligible than real dysphonic voices when played in background noise. In other words, the AI unintentionally “cleaned up” the voice disorder, creating speech that sounded clearer and easier to understand than the real dysphonic voices. The results are shown in Figure 3.
These results demonstrate that while AI voice cloning is impressively realistic for healthy speech, it does not yet capture the natural irregularities of disordered voices. For now, real patient recordings remain essential. However, this research highlights the exciting potential of improved AI tools in the future.
Figure 3. Mean intelligibility scores (IS) of normal and dysphonic groups in real and AI-generated voice conditions. The IS values vary from 0 to 1. Error bars indicate standard errors. Note: RL = real speech; AI = AI-generated speech.
University of Texas at Arlington, 701 S Nedderman Dr, Arlington, TX, 76019, United States
Abby Walker
Cynthia Clopper
Popular version of 3pSC2 – Effects of dialect familiarity and dialect exposure on cross-dialect lexical processing
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037947
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Some of us grow up hearing mostly one dialect, while others of us have substantial exposure to multiple dialects, maybe because we’re in a pretty bidialectal community, or because we’ve moved between dialect regions. In our work, we’re investigating whether these differences in exposure to pronunciation variation impact how people recognize words.
Word recognition is a bit like a race in your head: there are lots of potential contenders, and your brain’s job is to sift through them really quickly. One thing that makes it easier to recognize a word is if it’s been recently activated: so if you hear “bed” then see <BED>, you’ll be really quick to recognize the written word, compared to if you had just heard a completely unrelated word, like “hat.” One thing that makes it harder to recognize a word is if you’ve just heard a competitor (a word that is pretty similar and therefore confusable with the target word), in this case, hearing something like “bad” before <BED>. We think activating these competitors makes recognition harder because when you hear the word “bad,” you suppress or inhibit competitor words like “bed.”
Figure 1: Map of the USA showing three major dialect regions: Northern, Midland and Southern. Image courtesy of Cynthia Clopper, and boundaries are based on Labov, Ash & Boberg 2006)]
Okay, so how does exposure to variability impact all this? In our experiments, participants heard words from different dialects and then matched them to written words. What we’ve been finding across a few studies with American English listeners is that people who have lived in multiple dialect regions (specifically moving between those highlighted in Figure 1) get less of a boost for matching words (“bed” > BED), and more robustly, less of a cost for competitor words (“bad” > BED). Why would this be the case? We think that if you’ve been exposed to lots of variation in pronunciation, you need to be more flexible as a listener: being too certain about what you heard (“oh, that’s definitely ‘bed’, not ‘bad’”) could make it difficult to recover when you’re wrong, and if there are lots of dialects around, there’s more room for you to be wrong! Importantly, we don’t see one style of listening as better or worse than another; rather, it looks like how we process words adapts to the particular challenges of the speech communities we grow up in!
University of Illinois, Urbana-Champaign Champaign, IL 61820 United States
Carly Wingfield2, Charlie Nudelman1, Joshua Glasner3, Yvonne Gonzales Redman1,2
Department of Speech and Hearing Science, University of Illinois, Urbana-Champaign
School of Music University of Illinois Urbana-Champaign
School of Graduate Studies, Delaware Valley University
Popular version of 2aAAa1 – Does Virtual Reality Match Reality? Vocal Performance Across Environments Presented at the 188th ASA Meeting Read the abstract at https://doi.org/10.1121/10.0037496
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Singers often perform in very different spaces than where they practice—sometimes in small, dry rooms and later in large, echoey concert halls. Many singers have shared that this mismatch can affect how they sing. Some say they end up singing too loudly because they can’t hear themselves well, while others say they hold back because the room makes them sound louder than they are. Singers have to adapt their voices to unfamiliar concert halls, and often they have very little rehearsal time to adjust.
While research has shown that instrumentalists adjust their playing depending on the room they are in, there’s been less work looking specifically at singers. Past studies have found that different rooms can change how singers use their voices, including how their vibrato (the small, natural variation in pitch) changes depending on the room’s echo and clarity.
At the University of Illinois, our research team from the School of Music and the Department of Speech and Hearing Science is studying whether virtual reality (VR) can help singers train for different acoustic environments. The big question: can a virtual concert hall give singers the same experience as a real one?
To explore this, we created virtual versions of three real performance spaces on campus (Figure 1).
Figure 1. 360 degree images of the three performance spaces investigated.
Singers wore open-backed headphones and a VR headset while singing into a microphone in a sound booth. As they sang, their voices were processed in real time to sound like they were in one of the real venues, and this audio was sent back to them through the headphones. In the Video (Video1), you can see a singer performing in the sound booth where the acoustic environments were recreated virtually. In the audio file (Audio1), you can hear exactly what the singer heard: the real-time, acoustically processed sound being sent back to their ears through the open-backed headphones.
Video 1. Singer performing in the virtual environment.
Audio 1. Example of real-time auralized feedback.
Ten trained singers performed in both the actual venues (Figure 2) and in virtual versions of those same spaces.
Figure 2. Singer performing in the rear environment.
We then compared how they sang and how they felt during each performance. The results showed no significant differences in how the singers used their voices or how they perceived the experience between real and virtual environments.
This is an exciting finding because it suggests that virtual reality could become a valuable tool in voice training. If a singer can’t practice in a real concert hall, a VR simulation could help them get used to the sound and feel of the space ahead of time. This technology could give students greater access to performance preparation and allow voice teachers to guide students through the process in a more flexible and affordable way.
Cleveland Hearing and Speech Center
6001 Euclid Avenue Suite 100
Cleveland, OH, 44103
United States
Popular version of 2aSC4 – From intention to understanding and back again: How a simple message of ‘Catch and Pass’ can build language in children
Presented at the 187th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0035171
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Project ELLA (Early Language and Literacy for All) is an exciting new program designed to boost early language and literacy skills in young children. The program uses a simple yet powerful message, “Catch and Pass,” to teach parents, grandparents, daycare teachers and other caregivers the importance of having back-and-forth conversations with children from birth. These interactions help build and strengthen the brain’s language pathways, setting the foundation for lifelong learning.
Developed by the Cleveland Hearing & Speech Center, Project ELLA focuses on helping children in the greater Cleveland area, especially those in under-resourced communities. Community health workers visit neighborhoods to build trust with neighbors, raise awareness about the importance of responsive interactions for language development, and help empower families to put their children on-track for later literacy (See Video1). They also identify children who may need more help through speech and language screenings. For children identified as needing more help, Project ELLA offers free speech-language therapy and support for caregivers at Cleveland Hearing & Speech Center.
The success of the project is measured by tracking the number of children and families served, the progress of children in therapy, the knowledge and skills of caregivers and teachers, and the partnerships established in the community (See Fig. 1). Project ELLA is a groundbreaking model that has the potential to transform language and literacy development in Cleveland and beyond.