Finding the Right Tools to Interpret Crowd Noise at Sporting Events with AI

Jason Bickmore – jbickmore17@gmail.com

Instagram: @jason.bickmore
Brigham Young University, Department of Physics and Astronomy, Provo, Utah, 84602, United States

Popular version of 1aCA4 – Feature selection for machine-learned crowd reactions at collegiate basketball games
Presented at the 188th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0037270

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

A mixture of traditional and custom tools is enabling AI to make meaning in an unexplored frontier: crowd noise at sporting events.

The unique link between a crowd’s emotional state and its sound makes crowd noise a promising way to capture feedback about an event continuously and in real-time. Transformed into feedback, crowd noise would help venues improve the experience for fans, sharpen advertisements, and support safety.

To capture this feedback, we turned to machine learning, a popular strategy for making tricky connections. While the tools required to teach AI to interpret speech from a single person are well-understood (think Siri), the tools required to make sense of crowd noise are not.

To find the best tools for this job, we began with a simpler task: teaching an AI model to recognize applause, chanting, distracting the other team, and cheering at college basketball and volleyball games (Fig. 1).

Figure 1: Machine learning identifies crowd behaviors from crowd noise. We helped machine learning models recognize four behaviors: applauding, chanting, cheering, and distracting the other team. Image courtesy of byucougars.com.

We began with a large list of tools, called features, some drawn from traditional speech processing and others created using a custom strategy. After applying five methods to eliminate all but the most powerful features, a blend of traditional and custom features remained. A model trained with these features recognized the four behaviors with at least 70% accuracy.

Based on these results, we concluded that, when interpreting crowd noise, both traditional and custom features have a place. Even though crowd noise is not the situation the traditional tools were designed for, they are still valuable. The custom tools are useful too, complementing the traditional tools and sometimes outperforming them. The tools’ success at recognizing the four behaviors indicates that a similar blend of traditional and custom tools could enable AI models to navigate crowd noise well enough to translate it into real-time feedback. In future work, we will investigate the robustness of these features by checking whether they enable AI to recognize crowd behaviors equally well at events other than college basketball and volleyball games.

Enhancing Speech Recognition in Healthcare

Andrzej Czyzewski – andczyz@gmail.com

Gdańsk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Multimedia Systems Department, Gdańsk, Pomerania, 80-233, Poland

Popular version of 1aSP6 – Strategies for Preprocessing Speech to Enhance Neural Model Efficiency in Speech-to-Text Applications
Presented at the 187th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0034984

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–


Effective communication in healthcare is essential, as accurate information can directly impact patient care. This paper discusses research aimed at improving speech recognition technology to help medical professionals document patient information more effectively. By using advanced techniques, we can make speech-to-text systems more reliable for healthcare, ensuring they accurately capture spoken information.

In healthcare settings, professionals often need to quickly and accurately record patient interactions. Traditional typing can be slow and error-prone, while speech recognition allows doctors to dictate notes directly into electronic health records (EHRs), saving time and reducing miscommunication.

The main goal of our research was to test various ways of enhancing speech-to-text accuracy in healthcare. We compared several methods to help the system understand spoken language more clearly. These methods included different ways of analyzing sound, like looking at specific sound patterns or filtering background noise.

In this study, we recorded around 80,000 voice samples from medical professionals. These samples were then processed to highlight important speech patterns, making it easier for the system to learn and recognize medical terms. We used a method called Principal Component Analysis (PCA) to keep the data simple while ensuring essential information was retained.

Our findings showed that combining several techniques to capture speech patterns improved system performance. We saw an average accuracy improvement, with fewer word and character recognition errors.

The potential benefits of this work are significant:

  • Smoother documentation: Medical staff can record notes more efficiently, freeing up time for patient care.
  • Improved accuracy: Patient records become more reliable, reducing the chance of miscommunication.
  • Better healthcare outcomes: Enhanced communication can improve the quality of care.

This study highlights the promise of advanced speech recognition in healthcare. With further development, these systems can support medical professionals in delivering better patient care through efficient and accurate documentation.

Figure1. Frontpage of the ADMEDVOICE corpus containing medical text and their spoken equivalents

Artificial intelligence in music production: controversy & opportunity

Joshua Reiss Reiss – joshua.reiss@qmul.ac.uk
Twitter: @IntelSoundEng

Queen Mary University of London, Mile End Road, London, England, E1 4NS, United Kingdom

Popular version of 3aSP1-Artificial intelligence in music production: controversy and opportunity, presented at the 183rd ASA Meeting.

Music production
In music production, one typically has many sources. They each need to be heard simultaneously, but can all be created in different ways, in different environments and with different attributes. The mix should have all sources sound distinct yet contribute to a nice clean blend of the sounds. To achieve this is labour intensive and requires a professional engineer. Modern production systems help, but they’re incredibly complex and all require manual manipulation. As technology has grown, it has become more functional but not simpler for the user.

Intelligent music production
Intelligent systems could analyse all the incoming signals and determine how they should be modified and combined. This has the potential to revolutionise music production, in effect putting a robot sound engineer inside every recording device, mixing console or audio workstation. Could this be achieved? This question gets to the heart of what is art and what is science, what is the role of the music producer and why we prefer one mix over another.

Artificial Intelligence Figure 1: The architecture of an automatic mixing system. [Image courtesy of the author] Figure 1 Caption: The architecture of an automatic mixing system. [Image courtesy of the author]

Perception of mixing
But there is little understanding of how we perceive audio mixes. Almost all studies have been restricted to lab conditions; like measuring the perceived level of a tone in the presence of background noise. This tells us very little about real world cases. It doesn’t say how well one can hear lead vocals when there are guitar, bass and drums.

Best practices
And we don’t know why one production will sound dull while another makes you laugh and cry, even though both are on the same piece of music, performed by competent sound engineers. So we needed to establish what is good production, how to translate it into rules and exploit it within algorithms. We needed to step back and explore more fundamental questions, filling gaps in our understanding of production and perception.

Knowledge engineering
We used an approach that incorporated one of the earliest machine learning methods, knowledge engineering. Its so old school that its gone out of fashion. It assumes experts have already figured things out, they are experts after all. So let’s capture best practices as a set of rules and processes. But this is no easy task. Most sound engineers don’t know what they did. Ask a famous producer what he or she did on a hit song and you often get an answer like ‘I turned the knob up to 11 to make it sound phat.” How do you turn that into a mathematical equation? Or worse, they say it was magic and can’t be put into words.

We systematically tested all the assumptions about best practices and supplemented them with listening tests that helped us understand how people perceive complex sound mixtures. We also curated multitrack audio, with detailed information about how it was recorded, multiple mixes and evaluations of those mixes.

This enabled us to develop intelligent systems that automate much of the music production process.

Video Caption: An automatic mixing system based on a technology we developed.

Transformational impact
I gave a talk about this once in a room that had panel windows all around. These talks are usually half full. But this time it was packed, and I could see faces outside pressed up against the windows. They all wanted to find out about this idea of automatic mixing. It’s  a unique opportunity for academic research to have transformational impact on an entire industry. It addresses the fact that music production technologies are often not fit for purpose. Intelligent systems open up new opportunities. Amateur musicians can create high quality mixes of their content, small venues can put on live events without needing a professional engineer, time and preparation for soundchecks could be drastically reduced, and large venues and broadcasters could significantly cut manpower costs.

Taking away creativity
Its controversial. We entered an automatic mix in a student recording competition as a sort of Turing Test. Technically we cheated, because the mixes were supposed to be made by students, not by an ‘artificial intelligence’ (AI) created by a student. Afterwards I asked the judges what they thought of the mix. The first two were surprised and curious when I told them how it was done. The third judge offered useful comments when he thought it was a student mix. But when I told him that it was an ‘automatic mix’, he suddenly switched and said it was rubbish and he could tell all along.

Mixing is a creative process where stylistic decisions are made. Is this taking away creativity, is it taking away jobs? Such questions come up time and time again with new technologies, going back to 19th century protests by the Luddites, textile workers who feared that time spent on their skills and craft would be wasted as machines could replace their role in industry.

Not about replacing sound engineers
These are valid concerns, but its important to see other perspectives. A tremendous amount of music production work is technical, and audio quality would be improved by addressing these problems. As the graffiti artist Banksy said “All artists are willing to suffer for their work. But why are so few prepared to learn to draw?”

Creativity still requires technical skills. To achieve something wonderful when mixing music, you first have to achieve something pretty good and address issues with masking, microphone placement, level balancing and so on.

Video Caption: Time offset (comb filtering) correction, a technical problem in music production solved by an intelligent system.

The real benefit is not replacing sound engineers. Its dealing with all those situations when a talented engineer is not available; the band practicing in the garage, the small restaurant venue that does not provide any support, or game audio, where dozens of sounds need to be mixed and there is no miniature sound engineer living inside the games console.