Alaa Algargoosh – algargoosh@vt.edu
Virginia Polytechnic Institute and State University (Virginia Tech), Perry St, Blacksburg, VA, 24061, United States
Megan Wysocki
Virginia Polytechnic Institute and State University (Virginia Tech)
Amneh Hamida
RWTH Aachen University.
Popular version of 1pNSa4 – Cognitive Restoration in Virtual Interactions with Indoor Acoustic Environments
Presented at the 189th ASA Meeting
Read the abstract at https://eppro02.ativ.me//web/index.php?page=Session&project=ASAASJ25&id=3977035
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
People often associate restorative experiences with nature: the sound of birds, wind, or flowing water. But what if indoor spaces could offer their own kind of mental escape, not through what we see, but through how we interact with sound?
This idea began with a simple observation. When you walk into a space and notice how your footsteps and voice are reflected back to you, the echoes create a subtle sense of awe. According to Attention Restoration Theory, experiences that evoke fascination and effortless engagement can help replenish mental resources. We wanted to explore whether these moments of acoustic interaction between a person and a space could invite gentle attention and, in turn, support cognitive restoration. In Attention Restoration Theory, this is referred to as soft fascination, a type of stimulus that is engaging but not overwhelming.
Exploring Echoes as a Path to Mental Restoration:
During a live demonstration at the MIT Museum, we used auralization a technology that allows you to hear your voice as if you were in a different place using that place’s sound signature or impulse response. A volunteer hummed into the acoustic signature of Hagia Sophia. Later, the entire audience hummed together and reflected on their experiences. The conversation pointed to the potential of such acoustic interaction to support a meditative state by impacting sense of space, time, and self.
This inspired a controlled experiment to study the restorative potential of indoor acoustic environments. We asked people to experience different sound environments (Figure 1) and measure their cognitive activity before and after each interaction. Early results suggest that interactive acoustics may support attention restoration depending on the acoustic characteristics, opening a new way of thinking about how sound affects us indoors.
Figure 1: Virtual interaction with an acoustic environment during the experiment, where a person hears their own voice transformed through the acoustic signature of another space.
Why does this matter?
We spend most of our time indoors, yet discussions of restorative environments often focus on natural settings. This is especially relevant for workplaces and schools, where mental fatigue is common. It may also hold meaningful promise for neurodivergent individuals, including those with ADHD, who often benefit from environments that support attention without overstimulating it.
We imagine applications in immersive restorative spaces where people can interact with sound to reset and return to their activities with greater clarity. We also envision subtle integration into transitional spaces such as staircases, corridors, and building entrances that provide gentle cognitive relief as people move throughout their day.
Sound(e)scape reframes acoustics not as background, but as a tool for well-being. By understanding how interactive sound shapes attention and cognition, we can design buildings that do not simply avoid harmful noise. They can actively help the mind take a restorative break.
Figure 2: Visualization of interacting with different acoustic environments. Left: Max Addae vocalizing in an office environment (MIT Media Lab). Middle: “Hagia Sophia – Muhammad, Allah, Abu Bakr” by Rabe!, licensed under CC BY-SA 3.0 (https://commons.wikimedia.org/wiki/File:Hagia_Sophia_-_Muhammad,_Allah,_Abu_Bakr.jpg) Cropped and one person (Max Addae) added by Alaa Algargoosh. Right: Max Addae vocalizing in Boston Symphony Hall.
Sound recordings:
1. Vocalizing in an office environment (MIT Media Lab). (Voice: Max Addae)
2. Virtual vocalization in Hagia Sophia. (Voice: Max Addae)
3. Virtual vocalization in Boston Symphony Hall. (Voice: Max Addae)
The virtual vocalizations were generated using the impulse responses available at ODEON software library.
Vishal Shrivastava – shrivastava_vishal@outlook.com
Northwestern University, School of Communication, Department of Communication Sciences and Disorders, Frances Searle Building, 2240 Campus Drive, Evanston, Illinois, 60208-3550, United States
Marisha Speights, Akangkshya Pathak
Popular version of 1aCA3 – Inclusive automatic speech recognition: A framework for equitable speech recognition in children with disorders
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037269
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
Imagine a child says, “My chest hurts,” but the computer hears “My test works.”
In critical moments, mistranscriptions like this can have serious consequences.
Today’s voice recognition tools—like those behind Siri, Alexa, or educational apps—work well for most adults, but often struggle with children’s voices—especially when speech is accented, disordered, or still developing.
We set out to change that by fine-tuning existing systems to ensure every child’s voice is heard clearly, fairly, and without bias.
The Problem: When Technology Leaves Children Behind
Automatic Speech Recognition (ASR) turns spoken words into text. It powers voice commands, transcription tools, and increasingly, educational apps and therapies. But there’s a hidden flaw: these systems are trained mostly on adult speech.
Here’s why that matters:
- Children’s voices are different—higher-pitched, more variable, and constantly evolving.
- There’s less data. Collecting labeled child speech—especially from children with disorders—is hard, costly, and ethically complex.
- Bias creeps in. When systems hear mostly one kind of speech (like adult American English), they treat that as “normal.”
Everything else—like a 6-year-old with a stutter—gets mistaken for noise.
This isn’t just a technical problem. It’s an equity problem. The very tools meant to support children in learning, therapy, or communication often fail to understand them.
Our Approach: Teaching AI to Listen Fairly
Fine-tuning Whisper ASR with domain classifiers and gradient reversal layer
We fine-tuned OpenAI’s Whisper ASR to better understand how children speak—not just by adding more data, but by teaching it to focus on what matters. Like other speech models, Whisper doesn’t only learn the words being said; it also picks up on who is speaking—age, accent, gender, and speech disorders. These cues are baked into the audio, and because Whisper was trained mostly on clear adult speech, it often misinterprets child or disordered speech, treating it as noise.
To fix this, we added a second learning objective—imagine two students in training. One transcribes speech; the other tries to guess traits like the speaker’s age or gender, using only the first student’s notes. Now we challenge the first: transcribe accurately, but reveal nothing about who’s speaking. The better they hide those clues while getting the words right, the better they’ve learned.
That’s the heart of adversarial debiasing. During fine-tuning, we added a domain classifier—like the second student—trained to detect speaker traits from Whisper’s internal audio features. We then inserted a gradient reversal layer to make that job harder, forcing the encoder to scrub away identity cues. All the while, the model continued learning to transcribe—only now, it did so without relying on speaker-specific shortcuts.
Architecture of the end-to-end domain adversarial fine-tuning
The Result: Technology That Includes Every Voice
By learning to ignore traits that shouldn’t affect understanding—like age, accent, or disordered articulation—Whisper becomes more robust, accurate, and fair. It no longer gets tripped up by voices that don’t match what it was originally trained on. That means fewer errors for children who speak differently, and a step closer to voice technology that works for everyone—not just the majority.
Jason Bickmore – jbickmore17@gmail.com
Instagram: @jason.bickmore
Brigham Young University, Department of Physics and Astronomy, Provo, Utah, 84602, United States
Popular version of 1aCA4 – Feature selection for machine-learned crowd reactions at collegiate basketball games
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037270
–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–
A mixture of traditional and custom tools is enabling AI to make meaning in an unexplored frontier: crowd noise at sporting events.
The unique link between a crowd’s emotional state and its sound makes crowd noise a promising way to capture feedback about an event continuously and in real-time. Transformed into feedback, crowd noise would help venues improve the experience for fans, sharpen advertisements, and support safety.
To capture this feedback, we turned to machine learning, a popular strategy for making tricky connections. While the tools required to teach AI to interpret speech from a single person are well-understood (think Siri), the tools required to make sense of crowd noise are not.
To find the best tools for this job, we began with a simpler task: teaching an AI model to recognize applause, chanting, distracting the other team, and cheering at college basketball and volleyball games (Fig. 1).
Figure 1: Machine learning identifies crowd behaviors from crowd noise. We helped machine learning models recognize four behaviors: applauding, chanting, cheering, and distracting the other team. Image courtesy of byucougars.com.
We began with a large list of tools, called features, some drawn from traditional speech processing and others created using a custom strategy. After applying five methods to eliminate all but the most powerful features, a blend of traditional and custom features remained. A model trained with these features recognized the four behaviors with at least 70% accuracy.
Based on these results, we concluded that, when interpreting crowd noise, both traditional and custom features have a place. Even though crowd noise is not the situation the traditional tools were designed for, they are still valuable. The custom tools are useful too, complementing the traditional tools and sometimes outperforming them. The tools’ success at recognizing the four behaviors indicates that a similar blend of traditional and custom tools could enable AI models to navigate crowd noise well enough to translate it into real-time feedback. In future work, we will investigate the robustness of these features by checking whether they enable AI to recognize crowd behaviors equally well at events other than college basketball and volleyball games.