Using Physics to Solve the Cocktail Party Problem
Keith McElveen – firstname.lastname@example.org
151 King Street
Charleston, SC USA 29401
Popular version of paper ‘Robust speech separation in underdetermined conditions by estimating Green’s functions’
Presented Thursday morning, June 10th, 2021
180th ASA Meeting, Acoustics in Focus
Nearly seventy years ago, a hearing researcher named Colin Cherry said that “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it “the cocktail party problem.” No machine has been constructed to do just this, to filter out one conversation from a number jumbled together.”
Despite many claims of success over the years, the Cocktail Party Problem has resisted solution. The present research investigates a new approach that blends tricks used by human hearing with laws of physics. With this approach, it is possible to isolate a voice based on where it must have come from – somewhat like visualizing balls moving around a billiard table after being struck, except in reverse, and in 3D. This approach is shown to be highly effective in extremely challenging real-world conditions with as few as four microphones – the same number as found in many smart speakers and pairs of hearing aids.
The first “trick” is something that hearing scientists call “glimpsing”. Humans subconsciously piece together audible “glimpses” of a desired voice as it momentarily rises above the level of competing sounds. After gathering enough glimpses, our brains “learn” how the desired voice moves through the room to our ears and use this knowledge to ignore the other sounds.
The second “trick” is based on how humans use sounds that arrive “late”, because they bounced off of one or more large surfaces along the way. Human hearing somehow combines these reflected “copies” of the talker’s voice with the direct version to help us hear more clearly.
The present research mimics human hearing by using glimpses to build a detailed physics model – called a Green’s Function – of how sound travels from the talker to each of several microphones. It then uses the Green’s Function to reject all sounds that arrived via different paths and to reassemble the direct and reflected copies into the desired speech. The accompanying sound file illustrates typical results this approach achieves.
McElveen_Before_Then_Near_Then_Far_Talkers.wav, Original Cocktail Party Sound File, Followed by Separated Nearest Talker, then Farthest
While prior approaches have struggled to equal human hearing in a realistic cocktail party babel, even at close distances, the research results we are presenting imply that it is now possible to not only equal, but to exceed human hearing and solve The Cocktail Party Problem, even with a small number of microphones in no particular arrangement.
The many implications of this research include improved conference call systems, hearing aids, automotive voice command systems, and other voice assistants – such as smart speakers. Our future research plans include further testing as well as devising intuitive user interfaces that can take full advantage of this capability.
No one knows exactly how human hearing solves the Cocktail Party Problem, but it would be very interesting indeed if it is found to use its own version of a Green’s Function.