Can Artificial Intelligence Accurately Clone Dysphonic Voices?

Pasquale Bottalico – pb81@illinois.edu

University of Illinois at Urbana-Champaign
Champaign, Illinois, 61801
United States

Additional Authors
Charles J. Nudelman
Daniel Fogerty
Virginia Tardini
Keiko Ishikawa

Popular version of 2aSCa8 – Can Artificial Inteligence Accurately Clone Dysphonic Voices? A Perceptual and Intelligibility Assessment
Presented at the 189th ASA Meeting
Read the abstract at https://eppro02.ativ.me//web/index.php?page=Session&project=ASAASJ25&id=3981555

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Artificial intelligence is now remarkably good at cloning human voices, but can it convincingly imitate a disordered voice? Our findings suggest that while AI excels at copying healthy speech, it still struggles to capture the acoustic complexity of dysphonia, a condition that makes the voice sound rough, strained, or breathy.

Dysphonia affects millions of people and often reduces speech intelligibility, especially in noisy environments. Because collecting large amounts of patient data can be difficult, researchers wondered whether AI voice-cloning technologies might one day help them simulate disordered speech for training, education, or early-stage clinical research.

To test this idea, the team recorded 12 speakers (six with healthy voices and six with dysphonia)  and used a commercial AI system to create a digital “voice clone” of each person. These AI voices were trained using about one minute of recorded speech for each speaker. More than 60 listeners participated in three online experiments designed to evaluate whether the AI-generated voice clones truly preserved the qualities of disordered speech.

Watch the short video below to see exactly how the experiment worked.

In the listening tasks, participants heard pairs of sentences. Sometimes both sentences were from the real speaker, sometimes both were AI-generated, and sometimes one was real and one was AI. In some trials, listeners tried to decide whether the two voices came from the same person. In others, they had to identify which sentence (if any) was produced by AI. A third task tested how well listeners understood real and AI-generated dysphonic speech in background noise.

In the first experiment, as shown in Figure 1, listeners were very accurate when both samples were real. Here, accuracy refers to the proportion of trials in which listeners correctly judged whether the two voice samples were from the same or different speakers. Accuracy dropped slightly when both samples were AI-generated. But when one sample was real and the other AI-generated, performance fell sharply, especially for healthy voices, where the AI clones often sounded strikingly similar to the real person.

Figure 1. Bar plot showing the percentage of correct AI identification responses across conditions for normal and dysphonic voices. Bars represent mean percentages with 95% confidence intervals. Note: RL = real speech; AI = AI-generated speech.

Figure 2. Bar plot showing the percentage of correct AI identification responses across conditions for normal and dysphonic voices. Bars represent mean percentages with 95% confidence intervals. Note: RL = real speech; AI = AI-generated speech.

A second experiment asked listeners to identify which sentences were AI-generated. For healthy voices, AI was difficult to detect. For dysphonic voices, however, listeners were more successful — suggesting the AI system smoothed out or failed to reproduce key features of dysphonia. The results are shown in Figure 2.
The final experiment delivered the strongest finding: AI-generated dysphonic voices were significantly more intelligible than real dysphonic voices when played in background noise. In other words, the AI unintentionally “cleaned up” the voice disorder, creating speech that sounded clearer and easier to understand than the real dysphonic voices. The results are shown in Figure 3.

These results demonstrate that while AI voice cloning is impressively realistic for healthy speech, it does not yet capture the natural irregularities of disordered voices. For now, real patient recordings remain essential. However, this research highlights the exciting potential of improved AI tools in the future.

Figure 3. Mean intelligibility scores (IS) of normal and dysphonic groups in real and AI-generated voice conditions. The IS values vary from 0 to 1. Error bars indicate standard errors. Note: RL = real speech; AI = AI-generated speech.

Designing Museum Spaces That Sound as Good as They Look

Milena Jonas Bem – jonasm@rpi.edu
School of Architecture, Rensselaer Polytechnic Institute
Greene Bldg, 110 8th St
Troy, NY 12180
United States

Popular version of 2pAAa7 – Acoustic Design in Contemporary Museums: Balancing Architectural Aesthetics and Auditory Experience
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037653

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Museums are designed to dazzle the eyes but often fail the ears. Imagine standing in a stunning gallery with high ceilings and gleaming floors, only to struggle to hear the tour guide over the echoes. Later, you pause before a painting, hoping for quiet reflection, but you get distracted by nearby chatter. Our research shows how simple design choices, like swapping concrete floors for carpet or adding acoustic ceilings, can transform visitor experiences by improving the acoustic environment.

The Acoustic Challenge in Museums
Contemporary museums often embrace a “white box” aesthetic, where minimalist architecture puts art center stage. Usually, this approach relies on hard, highly reflective finishes like glass, concrete, and masonry, paired with high ceilings and open‐plan layouts. While visually striking, these designs rarely account for their acoustic side effects, creating echo chambers that distract from the art they’re meant to highlight.

Testing “What if?” in Real Galleries

museum gallery

Figure 1. Room-impulse-response measurement in progress: a dodecahedral loudspeaker (left) emits test signals while a microphone records the gallery’s acoustic “fingerprint.” Photo: Aleksandr Tsurupa

To solve this, we visited museum rooms, recording how sound traveled in each space, like capturing an “acoustic fingerprint”, which we name room impulse response. Using these recordings, we built virtual models to test how different materials (e.g., carpet vs. concrete) changed the sound in the space. We evaluated three levels of sound absorption (low, medium, and high) on the floor, ceiling, and walls. Then we evaluated how these choices affected key acoustics metrics, including how long sound lingers (reverberation time, or RT), how intelligible speech is (Speech Transmission Index, or STI), and how far away you can still understand a conversation clearly (distraction distance).

Key Findings

1. More Absorption Always Helps: Our first big finding is that adding more absorption always helps—no exceptions. Increasing from low→medium→high absorption consistently: cut reverberation in half or more, boosted speech clarity by 0.05–0.10 STI points, and made speech level drop faster with distance (good for privacy).

2. Placement Matters: where you put that absorption makes a practical difference:

    • Floors yield the single biggest improvement, swapping concrete for carpet cuts reverberation by 1.8 seconds. However, it does not always guarantee meeting ideal results; supplemental ceiling or wall treatments may still be needed to hit ideal RT, clarity, and privacy levels.
    • Ceilings delivered the largest jumps in STI and clarity, showing the greatest overall increase in distraction distance and better sound attenuation. So, going from a fully reflective ceiling to wood and then microperforated ceiling panels is compelling for intelligibility.
    • Walls emerged as the ultimate privacy tool. Only high-absorption plaster walls drove conversation levels at 4 m below 52 dB and created the steepest drop-off, perfect for whisper-quiet exhibits or multimedia spaces.

3. A Simple STI‐Prediction Shortcut: Measuring speech intelligibility typically requires specialized equipment and complex calculations. We distilled our data into a simple formula to predict STI using just a room’s volume and total absorption—no advanced math required (STI ranges from 0–1; closer to 1 = perfect intelligibility).

Figure 2. Predicted Speech Transmission Index (STI) across room volume and total absorption area. Warm colors indicate higher STI in smaller, highly absorptive spaces; cool colors indicate lower STI in large, reflective rooms. The overlaid equation estimates STI from volume, absorption, and reverberation time. Source: Authors

Hear the Difference: Auralizations from Williams College Museum
Below is one of the rooms that was used as a case study (Figure 3). Using auralizations (audio simulations that let you “hear” a space before it’s built), you can experience these changes yourself. Click each scenario below to hear the differences!

Figure 3. Museum gallery (photo) and its calibrated 3D model. The highlighted gallery “W1” served as a case study for virtually swapping floor, wall, and ceiling finishes to predict acoustic outcomes. Source: Authors

Note: Weighted absorption coefficient (αw): varies from 0 to 1, higher = more sound absorbed.

Wall:

Ceiling:

The takeaway?
Start with sound-absorbing floors to reduce echoes, add ceiling panels to sharpen speech, and use high-performance walls where privacy matters most. These steps do not require sacrificing aesthetics—materials like sleek microperforated wood or acoustic plaster blend seamlessly into designs. By considering acoustics early, designers can create museums that are as comfortable to hear as they are to see.