Listening for ice: Teaching AI to detect ice using sound

Leora Robinson – leorarobinson13@gmail.com

Brigham Young University, Provo, UT, 84602, United States

Tracianne Neilsen

Popular version of 3pUW6 – Acoustic binary classification of ice cover conditions using deep learning
Presented at the 190th ASA Meeting
Read the abstract at https://eppro01.ativ.me/web/index.php?page=session&project=ASASPRING2026&id=4082831

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Some Arctic animals don’t need to see ice to find it—they can hear it. Species like the beluga whale use sound to navigate through icy waters where visibility is limited, finding breathing holes in the ice without ever seeing them. This project asks a simple question: Can a computer learn to do the same? By analyzing acoustic signals, we show that a neural network can detect ice without relying on visual information.

Initial experiments were conducted in a laboratory tank (Figure 1) at the Brigham Young University Department of Physics and Astronomy. We took sound recordings when ice was and was not present on the surface of the water. Then, we trained a machine learning classifier to label the recordings as ‘ice’ or ‘no ice.’

Robotic arms mounted above a large transparent tank filled with water in a laboratory setting with control screens attached.Figure 1. Laboratory tank (side view).

For these experiments, we placed an underwater loudspeaker (transmitter) and an underwater microphone (hydrophone) in the tank. The transmitter produced ultrasonic chirps of increasing frequency when ice was and wasn’t present. We added about 600 pounds of block ice to the tank and took one-second recordings before ice was added, while it was present, and after it melted. We took two additional sets of recordings for testing the neural network: one using block ice and one using pebble ice.

After we acquired the recordings, we needed to label them. We did this using camera footage of the tank (Figure 2). Recordings with about 5% or more ice coverage between the transmitter and the hydrophone were labeled ‘ice,’ and recordings with less than 5% coverage were labeled ‘no ice.’ We chose this 5% threshold to differentiate between negligible and non-negligible ice cover. We converted each labeled recording into a time-frequency spectrogram and used the spectrograms to train a machine learning classifier.

Two robotic arms manipulating multiple white ice cubes floating in a clear water tank from an overhead view.Figure 2. Camera footage of the laboratory tank for labeling.

For the machine learning classifier, we selected a convolutional neural network (CNN) because it can detect important features indicating the presence of ice. We passed the spectrograms and their associated labels through the classifier for training, where the CNN learned to associate certain features of spectrograms with their labels. Ten classifiers were trained to provide a statistical representation of performance.

Diagram showing audio signals converted to spectrograms, processed by a CNN classifier to label presence or absence of ice.Figure 3. Roadmap of how each audio recording was processed and classified.

Once the ten classifiers were trained, we tested their performance on two other datasets that they were not trained on. We did this to see how well the CNN could generalize to other conditions. This generalizability is important because, in practical applications, the ocean environment is always changing: no two recordings will ever have identical conditions. The mean labeling accuracy across the ten classifiers on the testing block ice dataset was 93.5% ± 0.9%. On the pebble ice dataset, the classifiers achieved 94.3% ± 1.4% accuracy. These tests show that the CNNs can generalize well to new conditions.

The high accuracy of these initial experiments indicates that a CNN can use sound to detect the presence of ice. Just as the beluga whale listens for audio cues to find breathing holes in the ice, the neural network extracts important information from the sound to determine whether ice is present.

How Online Meetings Change Your Voice—and How We Measure It

Akira Takeuchi – takeuchi.akira@studio-infinity.co.jp

Instagram: @akira_reference_
Studio Infinity
Tokyo, Minato-ku, 107-0061
Japan

Additional Authors
Yixuan Huang, Miki Morinaga, Satoshi Tsuboya, Yuto Hosoya, and Sungyoung Kim

Popular version of 1pCA5 – Evaluating speech quality for automatic transcription in videoconferencing
Presented at the 189th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0040073

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Ghosts in Online Meetings: Why Clear Voices Sometimes Get Lost
Have you ever noticed that voices suddenly sound unclear during an online meeting—even though the speaker believes they are speaking clearly? You may find yourself straining to listen, missing words, or misunderstanding what was said. These problems are surprisingly common and can be difficult to fix on the spot, especially when meeting participants are not familiar with the technical details of videoconference systems.

We study this hidden problem by developing a machine learning–based system that can evaluate speech quality without interrupting the meeting. Our goal is to detect sound problems automatically, before they become frustrating for listeners.

AI Transcription vs. Human Listening
Humans are remarkably good at understanding speech, even when parts of it are missing or covered by noise. When a word is unclear, listeners often guess the meaning from context and still understand the overall message.

Automatic speech transcription, which is now widely used to record and summarize meetings, works very differently. AI systems analyze sound exactly as it is received. If speech is distorted, masked by noise, or partially missing, transcription accuracy drops sharply.

We turn this weakness into a strength. By measuring how much transcription quality degrades, we use AI transcription accuracy as an indicator of speech quality. In other words, if the transcription struggles, listeners are likely struggling too.

Causes of sound deterioration
Sound deterioration during online meetings can be grouped into four main causes (Figure 1):

  • Speech factors
    • How and what the speaker says, such as speaking speed or clarity.
  • Acoustic factors
    • Background noise or room reverberation that affects sound before it reaches the microphone.
  • System factors
    • Problems with microphones, cables, or audio hardware quality.
  • Communication factors
    • Network issues that occur after sound is converted into digital data, such as data compression or packet loss.

Our research focuses on communication factors, which are especially important in videoconference systems and differ from traditional phone calls.

Figure 1. Causes of sound deterioration

Packet loss simulation
Online meetings send sound over the internet in small pieces called packets. We use the SILK audio codec, a common system for converting speech into a format suitable for network transmission. Sometimes, these packets are lost during transmission, causing brief gaps or distortions in the sound.

To study this effect, we intentionally simulate packet loss and create artificially degraded speech. This allows us to generate large amounts of training data and teach machine learning models what poor communication quality sounds like.

Figures 2 and 3 compare a clean speech signal with a packet-loss-simulated version, showing how missing data changes the sound structure.

Figure 2. Spectrogram of clean speech (click image to listen)

Figure 3. Spectrogram of packet loss simulated speech (click image to listen)

Why This Matters
As online meetings become a permanent part of work and education, unnoticed sound degradation can silently reduce communication quality. By automatically detecting these problems, our approach helps make virtual meetings clearer, fairer, and less tiring—so no one’s voice turns into a “ghost” in the meeting.

More details can be found on our R&D webpage.

Understanding Why Engine Noise Feels Loud in Hybrid Vehicles with AI

Shinichi Suganuma – shinichi_suganuma@camal.mech.chuo-u.ac.jp

Graduate School of Science and Engineering
Chuo University
1-13-27 Kasuga
Bunkyo-ku, Tokyo, 112-8551
Japan

Shimpei Nagae
Nissan Motor Co., Ltd.
Kanagawa, Japan

Takeshi Toi
Chuo University
Tokyo, Japan

Popular version of 4aNSa2 – Development of a Machine Learning Model to Predict Engine Noise Perception Considering Regional and Driving Environment Differences
Presented at the 189th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0041106

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

When driving a hybrid vehicle, many people notice the moment when the quiet electric drive suddenly switches to the engine — and the engine can feel “loud,” even when the actual sound level is modest. Why does this happen? And does the way drivers perceive this noise differ across countries? In this study, we used machine learning to predict how people judge engine noise annoyance and to uncover insights that may help make future hybrid vehicles more comfortable.

Figure 1. AI Model for Predicting Engine Noise Perception
Video 1. On-Road Driving Example for Data Collection

We conducted on-road evaluations in Japan, the United States, and the United Kingdom. During each test, we simultaneously recorded in-cabin sound, vehicle parameters, and drivers’ ratings of engine noise on a three-level scale (“Not noisy,” “Noisy,” “Very noisy”), creating a dataset for AI training. In Japan, we used the series-hybrid Nissan Note e-POWER. In the U.S., where this model is not sold, we reproduced its engine sound on the Nissan Ariya EV, and in the U.K. we used the Qashqai e-POWER engine sound played on the Ariya. Because vehicles, drivers, and road environments differed across regions, the study provided a stringent test of model generality.

Figure 2. On-Road Evaluation Conditions in Japan, the U.S., and the U.K.

First, we tested how accurately AI could predict annoyance using only in-cabin sound data such as loudness and sharpness etc. In Japan, prediction accuracy reached about 57%. When we added three vehicle parameters — engine speed, acceleration torque, and vehicle speed — accuracy increased to 67%, demonstrating that driving conditions, not just sound, play an important role in annoyance perception. The same trend was observed in the U.S. and the U.K.

Figure 3. Prediction Accuracy Improvements Using Vehicle Data and Time History

However, the relative importance of the three vehicle parameters differed by region. In Japan and the U.S., engine speed contributed most strongly to predictions. In contrast, in the U.K., acceleration torque was the most influential factor. This likely reflects the presence of many roundabouts in the U.K. test route, where frequent acceleration and deceleration lead drivers to value the coherence between engine sound and vehicle motion. This aligns with the author’s own experience living in the U.K. for three years.

Next, we incorporated several seconds of engine-speed history into the vehicle parameters. In all regions, adding this short-term history improved prediction accuracy. Although the optimal history length differed slightly — around 5.5 seconds in Japan and 6.5 seconds in the U.S., — the common finding was clear: people judge engine noise not from a single moment but from the pattern of change over several seconds.

Figure 4 Prediction Improvement When Engine-Speed History Is Added

Despite differences in vehicles, traffic environments, and evaluation routes, considering “vehicle operating conditions” together with “recent temporal changes” consistently improved the AI’s ability to predict annoyance across all regions. These findings provide valuable clues for designing hybrid vehicles that feel smoother and more comfortable for drivers around the world.

Finding the Right Tools to Interpret Crowd Noise at Sporting Events with AI

Jason Bickmore – jbickmore17@gmail.com

Instagram: @jason.bickmore
Brigham Young University, Department of Physics and Astronomy, Provo, Utah, 84602, United States

Popular version of 1aCA4 – Feature selection for machine-learned crowd reactions at collegiate basketball games
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037270

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

A mixture of traditional and custom tools is enabling AI to make meaning in an unexplored frontier: crowd noise at sporting events.

The unique link between a crowd’s emotional state and its sound makes crowd noise a promising way to capture feedback about an event continuously and in real-time. Transformed into feedback, crowd noise would help venues improve the experience for fans, sharpen advertisements, and support safety.

To capture this feedback, we turned to machine learning, a popular strategy for making tricky connections. While the tools required to teach AI to interpret speech from a single person are well-understood (think Siri), the tools required to make sense of crowd noise are not.

To find the best tools for this job, we began with a simpler task: teaching an AI model to recognize applause, chanting, distracting the other team, and cheering at college basketball and volleyball games (Fig. 1).

Figure 1: Machine learning identifies crowd behaviors from crowd noise. We helped machine learning models recognize four behaviors: applauding, chanting, cheering, and distracting the other team. Image courtesy of byucougars.com.

We began with a large list of tools, called features, some drawn from traditional speech processing and others created using a custom strategy. After applying five methods to eliminate all but the most powerful features, a blend of traditional and custom features remained. A model trained with these features recognized the four behaviors with at least 70% accuracy.

Based on these results, we concluded that, when interpreting crowd noise, both traditional and custom features have a place. Even though crowd noise is not the situation the traditional tools were designed for, they are still valuable. The custom tools are useful too, complementing the traditional tools and sometimes outperforming them. The tools’ success at recognizing the four behaviors indicates that a similar blend of traditional and custom tools could enable AI models to navigate crowd noise well enough to translate it into real-time feedback. In future work, we will investigate the robustness of these features by checking whether they enable AI to recognize crowd behaviors equally well at events other than college basketball and volleyball games.

Enhancing Speech Recognition in Healthcare

Andrzej Czyzewski – andczyz@gmail.com

Gdańsk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Multimedia Systems Department, Gdańsk, Pomerania, 80-233, Poland

Popular version of 1aSP6 – Strategies for Preprocessing Speech to Enhance Neural Model Efficiency in Speech-to-Text Applications
Presented at the 187th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0034984

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–


Effective communication in healthcare is essential, as accurate information can directly impact patient care. This paper discusses research aimed at improving speech recognition technology to help medical professionals document patient information more effectively. By using advanced techniques, we can make speech-to-text systems more reliable for healthcare, ensuring they accurately capture spoken information.

In healthcare settings, professionals often need to quickly and accurately record patient interactions. Traditional typing can be slow and error-prone, while speech recognition allows doctors to dictate notes directly into electronic health records (EHRs), saving time and reducing miscommunication.

The main goal of our research was to test various ways of enhancing speech-to-text accuracy in healthcare. We compared several methods to help the system understand spoken language more clearly. These methods included different ways of analyzing sound, like looking at specific sound patterns or filtering background noise.

In this study, we recorded around 80,000 voice samples from medical professionals. These samples were then processed to highlight important speech patterns, making it easier for the system to learn and recognize medical terms. We used a method called Principal Component Analysis (PCA) to keep the data simple while ensuring essential information was retained.

Our findings showed that combining several techniques to capture speech patterns improved system performance. We saw an average accuracy improvement, with fewer word and character recognition errors.

The potential benefits of this work are significant:

  • Smoother documentation: Medical staff can record notes more efficiently, freeing up time for patient care.
  • Improved accuracy: Patient records become more reliable, reducing the chance of miscommunication.
  • Better healthcare outcomes: Enhanced communication can improve the quality of care.

This study highlights the promise of advanced speech recognition in healthcare. With further development, these systems can support medical professionals in delivering better patient care through efficient and accurate documentation.

Figure1. Frontpage of the ADMEDVOICE corpus containing medical text and their spoken equivalents