How Online Meetings Change Your Voice—and How We Measure It

Akira Takeuchi – takeuchi.akira@studio-infinity.co.jp

Instagram: @akira_reference_
Studio Infinity
Tokyo, Minato-ku, 107-0061
Japan

Additional Authors
Yixuan Huang, Miki Morinaga, Satoshi Tsuboya, Yuto Hosoya, and Sungyoung Kim

Popular version of 1pCA5 – Evaluating speech quality for automatic transcription in videoconferencing
Presented at the 189th ASA Meeting
Read the abstract at https://eppro02.ativ.me/appinfo.php?page=Session&project=ASAASJ25&id=3983372&server=eppro02.ativ.me

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Ghosts in Online Meetings: Why Clear Voices Sometimes Get Lost
Have you ever noticed that voices suddenly sound unclear during an online meeting—even though the speaker believes they are speaking clearly? You may find yourself straining to listen, missing words, or misunderstanding what was said. These problems are surprisingly common and can be difficult to fix on the spot, especially when meeting participants are not familiar with the technical details of videoconference systems.

We study this hidden problem by developing a machine learning–based system that can evaluate speech quality without interrupting the meeting. Our goal is to detect sound problems automatically, before they become frustrating for listeners.

AI Transcription vs. Human Listening
Humans are remarkably good at understanding speech, even when parts of it are missing or covered by noise. When a word is unclear, listeners often guess the meaning from context and still understand the overall message.

Automatic speech transcription, which is now widely used to record and summarize meetings, works very differently. AI systems analyze sound exactly as it is received. If speech is distorted, masked by noise, or partially missing, transcription accuracy drops sharply.

We turn this weakness into a strength. By measuring how much transcription quality degrades, we use AI transcription accuracy as an indicator of speech quality. In other words, if the transcription struggles, listeners are likely struggling too.

Causes of sound deterioration
Sound deterioration during online meetings can be grouped into four main causes (Figure 1):

  • Speech factors
    • How and what the speaker says, such as speaking speed or clarity.
  • Acoustic factors
    • Background noise or room reverberation that affects sound before it reaches the microphone.
  • System factors
    • Problems with microphones, cables, or audio hardware quality.
  • Communication factors
    • Network issues that occur after sound is converted into digital data, such as data compression or packet loss.

Our research focuses on communication factors, which are especially important in videoconference systems and differ from traditional phone calls.

Figure 1. Causes of sound deterioration

Packet loss simulation
Online meetings send sound over the internet in small pieces called packets. We use the SILK audio codec, a common system for converting speech into a format suitable for network transmission. Sometimes, these packets are lost during transmission, causing brief gaps or distortions in the sound.

To study this effect, we intentionally simulate packet loss and create artificially degraded speech. This allows us to generate large amounts of training data and teach machine learning models what poor communication quality sounds like.

Figures 2 and 3 compare a clean speech signal with a packet-loss-simulated version, showing how missing data changes the sound structure.

Figure 2. Spectrogram of clean speech (click image to listen)

Figure 3. Spectrogram of packet loss simulated speech (click image to listen)

Why This Matters
As online meetings become a permanent part of work and education, unnoticed sound degradation can silently reduce communication quality. By automatically detecting these problems, our approach helps make virtual meetings clearer, fairer, and less tiring—so no one’s voice turns into a “ghost” in the meeting.

More details can be found on our R&D webpage.

How Sound Moves on Mars

Understanding acoustic propagation within the Martian environment helps scientists understand the planet and will inform future missions. #ASA_ASJ2025 #ASA189

HONOLULU, Dec. 4, 2025 — Acoustic signals have been important markers during NASA’s Mars missions. Measurements of sound can provide information both about Mars itself — such as turbulence in its atmosphere, changes in its temperature, and its surface conditions — and about the movement of the Mars rovers.

Using these sound measurements to the best extent possible requires an accurate understanding of how sound propagates on Mars. Charlie Zheng, a professor of mechanical and aerospace engineering at Utah State University, and his doctoral student Hayden Baird, who is partially sponsored by the Utah Space Grant Consortium Graduate Fellowship, will present their work simulating sound propagation on Mars Thursday, Dec. 4, at 8:25 a.m. HST as part of the Sixth Joint Meeting of the Acoustical Society of America and Acoustical Society of Japan, running Dec. 1-5 in in Honolulu, Hawaii.

A graph showing the simulated sound propagation on Mars. Credit: Charlie Zheng

A graph showing the simulated sound propagation on Mars. Credit: Charlie Zheng

“We expect that the study will provide deeper insight into weather and terrain effects on acoustic propagation in environments that are not easily measured,” said Zheng. “The Martian environment is obviously one of them.”

Baird and Zheng’s work uses NASA’s measurements of the atmospheric conditions and terrain on Mars, most of which have been previously modeled at meter-scale resolutions. They also had access to decades of data about the red planet’s atmospheric composition and properties, as well as seismic studies that measure the ground porosity — all factors that play into how sound propagates.

“The setup of the simulation model used in this study relies heavily on previous results from multiple scientific disciplines,” said Baird.

Focusing on the Jezero crater, the 2021 landing and exploration site of NASA’s Perseverance rover and its attached Ingenuity helicopter, the researchers simulated how sound moves through and scatters off the region’s complex terrains, whether it comes from a moving or stationary source. This will help them understand how other atmospheres compare to our own.

The researchers hope their model will help identify signals and patterns that indicate specific Martian atmospheric events. In the longer term, it may even help with sensor designs for future missions to other planets or moons to study atmospheric conditions.

“This study is a beginning to dive into many potential areas of planetary research,” said Zheng.

Contact:
AIP Media
+1 301-209-3090
media@aip.org

——————— MORE MEETING INFORMATION ——————–

Main Meeting Website: https://acousticalsociety.org/honolulu-2025/
Technical Program: https://eppro02.ativ.me/web/planner.php?id=ASAASJ25

ASA PRESS ROOM
In the coming weeks, ASA’s Press Room will be updated with newsworthy stories and the press conference schedule at https://acoustics.org/asa-press-room/.

LAY LANGUAGE PAPERS
ASA will also share dozens of lay language papers about topics covered at the conference. Lay language papers are summaries (300-500 words) of presentations written by scientists for a general audience. They will be accompanied by photos, audio, and video. Learn more at https://acoustics.org/lay-language-papers/.

PRESS REGISTRATION
ASA will grant free registration to credentialed and professional freelance journalists. If you are a reporter and would like to attend the meeting and/or press conferences, contact AIP Media Services at media@aip.org. For urgent requests, AIP staff can also help with setting up interviews and obtaining images, sound clips, or background information.

ABOUT THE ACOUSTICAL SOCIETY OF AMERICA
The Acoustical Society of America is the premier international scientific society in acoustics devoted to the science and technology of sound. Its 7,000 members worldwide represent a broad spectrum of the study of acoustics. ASA publications include The Journal of the Acoustical Society of America (the world’s leading journal on acoustics), JASA Express Letters, Proceedings of Meetings on Acoustics, Acoustics Today magazine, books, and standards on acoustics. The society also holds two major scientific meetings each year. See https://acousticalsociety.org/.

ABOUT THE ACOUSTICAL SOCIETY OF JAPAN
ASJ publishes a monthly journal in Japanese, the Journal of the Acoustical Society of Japan as well as a bimonthly journal in English, Acoustical Science and Technology, which is available online at no cost https://www.jstage.jst.go.jp/browse/ast. These journals include technical papers and review papers. Special issues are occasionally organized and published. The Society also publishes textbooks and reference books to promote acoustics associated with various topics. See https://acoustics.jp/en/.

Sound(e)scape: Can a Sonic Break Improve Cognitive Performance?

Alaa Algargoosh – algargoosh@vt.edu

Virginia Polytechnic Institute and State University (Virginia Tech), Perry St, Blacksburg, VA, 24061, United States

Megan Wysocki
Virginia Polytechnic Institute and State University (Virginia Tech)

Amneh Hamida
RWTH Aachen University.

Popular version of 1pNSa4 – Cognitive Restoration in Virtual Interactions with Indoor Acoustic Environments
Presented at the 189th ASA Meeting
Read the abstract at https://eppro02.ativ.me//web/index.php?page=Session&project=ASAASJ25&id=3977035

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

People often associate restorative experiences with nature: the sound of birds, wind, or flowing water. But what if indoor spaces could offer their own kind of mental escape, not through what we see, but through how we interact with sound?

This idea began with a simple observation. When you walk into a space and notice how your footsteps and voice are reflected back to you, the echoes create a subtle sense of awe. According to Attention Restoration Theory, experiences that evoke fascination and effortless engagement can help replenish mental resources. We wanted to explore whether these moments of acoustic interaction between a person and a space could invite gentle attention and, in turn, support cognitive restoration. In Attention Restoration Theory, this is referred to as soft fascination, a type of stimulus that is engaging but not overwhelming.

Exploring Echoes as a Path to Mental Restoration:
During a live demonstration at the MIT Museum, we used auralization a technology that allows you to hear your voice as if you were in a different place using that place’s sound signature or impulse response. A volunteer hummed into the acoustic signature of Hagia Sophia. Later, the entire audience hummed together and reflected on their experiences. The conversation pointed to the potential of such acoustic interaction to support a meditative state by impacting sense of space, time, and self.

This inspired a controlled experiment to study the restorative potential of indoor acoustic environments. We asked people to experience different sound environments (Figure 1) and measure their cognitive activity before and after each interaction. Early results suggest that interactive acoustics may support attention restoration depending on the acoustic characteristics, opening a new way of thinking about how sound affects us indoors.

Figure 1: Virtual interaction with an acoustic environment during the experiment, where a person hears their own voice transformed through the acoustic signature of another space.

Why does this matter?
We spend most of our time indoors, yet discussions of restorative environments often focus on natural settings. This is especially relevant for workplaces and schools, where mental fatigue is common. It may also hold meaningful promise for neurodivergent individuals, including those with ADHD, who often benefit from environments that support attention without overstimulating it.
We imagine applications in immersive restorative spaces where people can interact with sound to reset and return to their activities with greater clarity. We also envision subtle integration into transitional spaces such as staircases, corridors, and building entrances that provide gentle cognitive relief as people move throughout their day.

Sound(e)scape reframes acoustics not as background, but as a tool for well-being. By understanding how interactive sound shapes attention and cognition, we can design buildings that do not simply avoid harmful noise. They can actively help the mind take a restorative break.

Figure 2: Visualization of interacting with different acoustic environments. Left: Max Addae vocalizing in an office environment (MIT Media Lab). Middle: “Hagia Sophia – Muhammad, Allah, Abu Bakr” by Rabe!, licensed under CC BY-SA 3.0 (https://commons.wikimedia.org/wiki/File:Hagia_Sophia_-_Muhammad,_Allah,_Abu_Bakr.jpg) Cropped and one person (Max Addae) added by Alaa Algargoosh. Right: Max Addae vocalizing in Boston Symphony Hall.

Sound recordings:
1. Vocalizing in an office environment (MIT Media Lab). (Voice: Max Addae)
2. Virtual vocalization in Hagia Sophia. (Voice: Max Addae)
3. Virtual vocalization in Boston Symphony Hall. (Voice: Max Addae)
The virtual vocalizations were generated using the impulse responses available at ODEON software library.

Laying the Groundwork to Diagnose Speech Impairments in Children with Clinical AI #ASA188

Laying the Groundwork to Diagnose Speech Impairments in Children with Clinical AI #ASA188

Building large AI datasets can help experts provide faster, earlier diagnoses.

Media Contact:
AIP Media
301-209-3090
media@aip.org

PedzSTAR
pedzstarpr@mekkymedia.com

NEW ORLEANS, May 19, 2025 – Speech and language impairments affect over a million children every year, and identifying and treating these conditions early is key to helping these children overcome them. Clinicians struggling with time, resources, and access are in desperate need of tools to make diagnosing speech impairments faster and more accurate.

Marisha Speights, assistant professor at Northwestern University, built a data pipeline to train clinical artificial intelligence tools for childhood speech screening. She will present her work Monday, May 19, at 8:20 a.m. CT as part of the joint 188th Meeting of the Acoustical Society of America and 25th International Congress on Acoustics, running May 18-23.

Children at a childcare center. Credit: GETTY CC BY-SA

AI-based speech recognition and clinical diagnostic tools have been in use for years, but these tools are typically trained and used exclusively on adult speech. That makes them unsuitable for clinical work involving children. New AI tools must be developed, but there are no large datasets of recorded child speech for these tools to be trained on, in part because building these datasets is uniquely challenging.

“There’s a common misconception that collecting speech from children is as straightforward as it is with adults — but in reality, it requires a much more controlled and developmentally sensitive process,” said Speights. “Unlike adult speech, child speech is highly variable, acoustically distinct, and underrepresented in most training corpora.”

To remedy this, Speights and her colleagues began collecting and analyzing large volumes of child speech recordings to build such a dataset. However, they quickly realized a problem: The collection, processing, and annotation of thousands of speech samples is difficult without exactly the kind of automated tools they were trying to build.

“It’s a bit of a catch-22,” said Speights. “We need automated tools to scale data collection, but we need large datasets to train those tools.”

In response, the researchers built a computational pipeline to turn raw speech data into a useful dataset for training AI tools. They collected a representative sample of speech from children across the country, verified transcripts and enhanced audio quality using their custom software, and provided a platform that will enable detailed annotation by experts.

The result is a high-quality dataset that can be used to train clinical AI, giving experts access to a powerful set of tools to make diagnosing speech impairments much easier.

“Speech-language pathologists, health care clinicians and educators will be able to use AI-powered systems to flag speech-language concerns earlier, especially in places where access to specialists is limited,” said Speights.

——————— MORE MEETING INFORMATION ———————
Main Meeting Website: https://acousticalsociety.org/new-orleans-2025/
Technical Program: https://eppro01.ativ.me/src/EventPilot/php/express/web/planner.php?id=ASAICA25

ASA PRESS ROOM
In the coming weeks, ASA’s Press Room will be updated with newsworthy stories and the press conference schedule at https://acoustics.org/asa-press-room/.

LAY LANGUAGE PAPERS
ASA will also share dozens of lay language papers about topics covered at the conference. Lay language papers are summaries (300-500 words) of presentations written by scientists for a general audience. They will be accompanied by photos, audio, and video. Learn more at https://acoustics.org/lay-language-papers/.

PRESS REGISTRATION
ASA will grant free registration to credentialed and professional freelance journalists. If you are a reporter and would like to attend the meeting and/or press conferences, contact AIP Media Services at media@aip.org. For urgent requests, AIP staff can also help with setting up interviews and obtaining images, sound clips, or background information.

ABOUT THE ACOUSTICAL SOCIETY OF AMERICA
The Acoustical Society of America is the premier international scientific society in acoustics devoted to the science and technology of sound. Its 7,000 members worldwide represent a broad spectrum of the study of acoustics. ASA publications include The Journal of the Acoustical Society of America (the world’s leading journal on acoustics), JASA Express Letters, Proceedings of Meetings on Acoustics, Acoustics Today magazine, books, and standards on acoustics. The society also holds two major scientific meetings each year. See https://acousticalsociety.org/.

ABOUT THE INTERNATIONAL COMMISSION FOR ACOUSTICS
The purpose of the International Commission for Acoustics (ICA) is to promote international development and collaboration in all fields of acoustics including research, development, education, and standardization. ICA’s mission is to be the reference point for the acoustic community, becoming more inclusive and proactive in our global outreach, increasing coordination and support for the growing international interest and activity in acoustics. Learn more at https://www.icacommission.org/.

Can You Hear Me Now? Fixing Speech Recognition Tech So It Works for Every Child

Vishal Shrivastava – shrivastava_vishal@outlook.com

Northwestern University, School of Communication, Department of Communication Sciences and Disorders, Frances Searle Building, 2240 Campus Drive, Evanston, Illinois, 60208-3550, United States

Marisha Speights, Akangkshya Pathak

Popular version of 1aCA3 – Inclusive automatic speech recognition: A framework for equitable speech recognition in children with disorders
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037269

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Imagine a child says, “My chest hurts,” but the computer hears “My test works.”
In critical moments, mistranscriptions like this can have serious consequences.

Today’s voice recognition tools—like those behind Siri, Alexa, or educational apps—work well for most adults, but often struggle with children’s voices—especially when speech is accented, disordered, or still developing.

We set out to change that by fine-tuning existing systems to ensure every child’s voice is heard clearly, fairly, and without bias.

The Problem: When Technology Leaves Children Behind
Automatic Speech Recognition (ASR) turns spoken words into text. It powers voice commands, transcription tools, and increasingly, educational apps and therapies. But there’s a hidden flaw: these systems are trained mostly on adult speech.

Here’s why that matters:

  • Children’s voices are different—higher-pitched, more variable, and constantly evolving.
  • There’s less data. Collecting labeled child speech—especially from children with disorders—is hard, costly, and ethically complex.
  • Bias creeps in. When systems hear mostly one kind of speech (like adult American English), they treat that as “normal.”

Everything else—like a 6-year-old with a stutter—gets mistaken for noise.

This isn’t just a technical problem. It’s an equity problem. The very tools meant to support children in learning, therapy, or communication often fail to understand them.

Our Approach: Teaching AI to Listen Fairly

Fine-tuning Whisper ASR with domain classifiers and gradient reversal layer

We fine-tuned OpenAI’s Whisper ASR to better understand how children speak—not just by adding more data, but by teaching it to focus on what matters. Like other speech models, Whisper doesn’t only learn the words being said; it also picks up on who is speaking—age, accent, gender, and speech disorders. These cues are baked into the audio, and because Whisper was trained mostly on clear adult speech, it often misinterprets child or disordered speech, treating it as noise.

To fix this, we added a second learning objective—imagine two students in training. One transcribes speech; the other tries to guess traits like the speaker’s age or gender, using only the first student’s notes. Now we challenge the first: transcribe accurately, but reveal nothing about who’s speaking. The better they hide those clues while getting the words right, the better they’ve learned.

That’s the heart of adversarial debiasing. During fine-tuning, we added a domain classifier—like the second student—trained to detect speaker traits from Whisper’s internal audio features. We then inserted a gradient reversal layer to make that job harder, forcing the encoder to scrub away identity cues. All the while, the model continued learning to transcribe—only now, it did so without relying on speaker-specific shortcuts.

Architecture of the end-to-end domain adversarial fine-tuning

The Result: Technology That Includes Every Voice
By learning to ignore traits that shouldn’t affect understanding—like age, accent, or disordered articulation—Whisper becomes more robust, accurate, and fair. It no longer gets tripped up by voices that don’t match what it was originally trained on. That means fewer errors for children who speak differently, and a step closer to voice technology that works for everyone—not just the majority.

Finding the Right Tools to Interpret Crowd Noise at Sporting Events with AI

Jason Bickmore – jbickmore17@gmail.com

Instagram: @jason.bickmore
Brigham Young University, Department of Physics and Astronomy, Provo, Utah, 84602, United States

Popular version of 1aCA4 – Feature selection for machine-learned crowd reactions at collegiate basketball games
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037270

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

A mixture of traditional and custom tools is enabling AI to make meaning in an unexplored frontier: crowd noise at sporting events.

The unique link between a crowd’s emotional state and its sound makes crowd noise a promising way to capture feedback about an event continuously and in real-time. Transformed into feedback, crowd noise would help venues improve the experience for fans, sharpen advertisements, and support safety.

To capture this feedback, we turned to machine learning, a popular strategy for making tricky connections. While the tools required to teach AI to interpret speech from a single person are well-understood (think Siri), the tools required to make sense of crowd noise are not.

To find the best tools for this job, we began with a simpler task: teaching an AI model to recognize applause, chanting, distracting the other team, and cheering at college basketball and volleyball games (Fig. 1).

Figure 1: Machine learning identifies crowd behaviors from crowd noise. We helped machine learning models recognize four behaviors: applauding, chanting, cheering, and distracting the other team. Image courtesy of byucougars.com.

We began with a large list of tools, called features, some drawn from traditional speech processing and others created using a custom strategy. After applying five methods to eliminate all but the most powerful features, a blend of traditional and custom features remained. A model trained with these features recognized the four behaviors with at least 70% accuracy.

Based on these results, we concluded that, when interpreting crowd noise, both traditional and custom features have a place. Even though crowd noise is not the situation the traditional tools were designed for, they are still valuable. The custom tools are useful too, complementing the traditional tools and sometimes outperforming them. The tools’ success at recognizing the four behaviors indicates that a similar blend of traditional and custom tools could enable AI models to navigate crowd noise well enough to translate it into real-time feedback. In future work, we will investigate the robustness of these features by checking whether they enable AI to recognize crowd behaviors equally well at events other than college basketball and volleyball games.