Vishal Shrivastava – shrivastava_vishal@outlook.com

Northwestern University, School of Communication
Department of Communication Sciences and Disorders, Frances Searle Building, 2240 Campus Drive
Evanston, Illinois 60208-3550
United States

Marisha Speights, Akangkshya Pathak

Popular version of 1aCA3 – Inclusive automatic speech recognition: A framework for equitable speech recognition in children with disorders
Presented at the 188th ASA Meeting
Read the abstract at https://eppro01.ativ.me//web/index.php?page=Session&project=ASAICA25&id=3867184

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Imagine a child says, “My chest hurts,” but the computer hears “My test works.”
In critical moments, mistranscriptions like this can have serious consequences.

Today’s voice recognition tools—like those behind Siri, Alexa, or educational apps—work well for most adults, but often struggle with children’s voices—especially when speech is accented, disordered, or still developing.

We set out to change that by fine-tuning existing systems to ensure every child’s voice is heard clearly, fairly, and without bias.

The Problem: When Technology Leaves Children Behind
Automatic Speech Recognition (ASR) turns spoken words into text. It powers voice commands, transcription tools, and increasingly, educational apps and therapies. But there’s a hidden flaw: these systems are trained mostly on adult speech.

Here’s why that matters:

  • Children’s voices are different—higher-pitched, more variable, and constantly evolving.
  • There’s less data. Collecting labeled child speech—especially from children with disorders—is hard, costly, and ethically complex.
  • Bias creeps in. When systems hear mostly one kind of speech (like adult American English), they treat that as “normal.”

Everything else—like a 6-year-old with a stutter—gets mistaken for noise.

This isn’t just a technical problem. It’s an equity problem. The very tools meant to support children in learning, therapy, or communication often fail to understand them.

Our Approach: Teaching AI to Listen Fairly

Fine-tuning Whisper ASR with domain classifiers and gradient reversal layer

We fine-tuned OpenAI’s Whisper ASR to better understand how children speak—not just by adding more data, but by teaching it to focus on what matters. Like other speech models, Whisper doesn’t only learn the words being said; it also picks up on who is speaking—age, accent, gender, and speech disorders. These cues are baked into the audio, and because Whisper was trained mostly on clear adult speech, it often misinterprets child or disordered speech, treating it as noise.

To fix this, we added a second learning objective—imagine two students in training. One transcribes speech; the other tries to guess traits like the speaker’s age or gender, using only the first student’s notes. Now we challenge the first: transcribe accurately, but reveal nothing about who’s speaking. The better they hide those clues while getting the words right, the better they’ve learned.

That’s the heart of adversarial debiasing. During fine-tuning, we added a domain classifier—like the second student—trained to detect speaker traits from Whisper’s internal audio features. We then inserted a gradient reversal layer to make that job harder, forcing the encoder to scrub away identity cues. All the while, the model continued learning to transcribe—only now, it did so without relying on speaker-specific shortcuts.

Architecture of the end-to-end domain adversarial fine-tuning

The Result: Technology That Includes Every Voice
By learning to ignore traits that shouldn’t affect understanding—like age, accent, or disordered articulation—Whisper becomes more robust, accurate, and fair. It no longer gets tripped up by voices that don’t match what it was originally trained on. That means fewer errors for children who speak differently, and a step closer to voice technology that works for everyone—not just the majority.

Share This