Akira Takeuchi – takeuchi.akira@studio-infinity.co.jp

Instagram: @akira_reference_
Studio Infinity
Tokyo, Minato-ku, 107-0061
Japan

Additional Authors
Yixuan Huang, Miki Morinaga, Satoshi Tsuboya, Yuto Hosoya, and Sungyoung Kim

Popular version of 1pCA5 – Evaluating speech quality for automatic transcription in videoconferencing
Presented at the 189th ASA Meeting
Read the abstract at https://eppro02.ativ.me/appinfo.php?page=Session&project=ASAASJ25&id=3983372&server=eppro02.ativ.me

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Ghosts in Online Meetings: Why Clear Voices Sometimes Get Lost
Have you ever noticed that voices suddenly sound unclear during an online meeting—even though the speaker believes they are speaking clearly? You may find yourself straining to listen, missing words, or misunderstanding what was said. These problems are surprisingly common and can be difficult to fix on the spot, especially when meeting participants are not familiar with the technical details of videoconference systems.

We study this hidden problem by developing a machine learning–based system that can evaluate speech quality without interrupting the meeting. Our goal is to detect sound problems automatically, before they become frustrating for listeners.

AI Transcription vs. Human Listening
Humans are remarkably good at understanding speech, even when parts of it are missing or covered by noise. When a word is unclear, listeners often guess the meaning from context and still understand the overall message.

Automatic speech transcription, which is now widely used to record and summarize meetings, works very differently. AI systems analyze sound exactly as it is received. If speech is distorted, masked by noise, or partially missing, transcription accuracy drops sharply.

We turn this weakness into a strength. By measuring how much transcription quality degrades, we use AI transcription accuracy as an indicator of speech quality. In other words, if the transcription struggles, listeners are likely struggling too.

Causes of sound deterioration
Sound deterioration during online meetings can be grouped into four main causes (Figure 1):

  • Speech factors
    • How and what the speaker says, such as speaking speed or clarity.
  • Acoustic factors
    • Background noise or room reverberation that affects sound before it reaches the microphone.
  • System factors
    • Problems with microphones, cables, or audio hardware quality.
  • Communication factors
    • Network issues that occur after sound is converted into digital data, such as data compression or packet loss.

Our research focuses on communication factors, which are especially important in videoconference systems and differ from traditional phone calls.

Figure 1. Causes of sound deterioration

Packet loss simulation
Online meetings send sound over the internet in small pieces called packets. We use the SILK audio codec, a common system for converting speech into a format suitable for network transmission. Sometimes, these packets are lost during transmission, causing brief gaps or distortions in the sound.

To study this effect, we intentionally simulate packet loss and create artificially degraded speech. This allows us to generate large amounts of training data and teach machine learning models what poor communication quality sounds like.

Figures 2 and 3 compare a clean speech signal with a packet-loss-simulated version, showing how missing data changes the sound structure.

Figure 2. Spectrogram of clean speech (click image to listen)

Figure 3. Spectrogram of packet loss simulated speech (click image to listen)

Why This Matters
As online meetings become a permanent part of work and education, unnoticed sound degradation can silently reduce communication quality. By automatically detecting these problems, our approach helps make virtual meetings clearer, fairer, and less tiring—so no one’s voice turns into a “ghost” in the meeting.

More details can be found on our R&D webpage.

Share This