Hear This! Transforming Health Care with Speech-to-Text Technology #ASA187

Hear This! Transforming Health Care with Speech-to-Text Technology #ASA187

Researchers study the importance of enunciation in medical text to speech software

Media Contact:
AIP Media
301-209-3090
media@aip.org

MELVILLE, N.Y., Nov. 21, 2024 – Speech-to-text programs are becoming more popular for everyday tasks like hands-free dictation, helping people who are visually impaired, and transcribing speech for those who are hard of hearing. These tools have many uses, and researcher Bożena Kostek from Gdańsk University of Technology is exploring how STT can be better used in the medical field. By studying how clear speech affects STT accuracy, she hopes to improve its usefulness for health care professionals.

“Automating note-taking for patient data is crucial for doctors and radiologists, as it gives the doctors more face-to-face time with patients and allows for better data collection,” Kostek says.

Enunciation may have a crucial role to play in the accuracy of medial record dictation. This image was created with DALL-E 2. Credit: Bozena Kostek

Kostek also explains the challenges they face in this work.

“STT models often struggle with medical terms, especially in Polish, since many have been trained mainly on English. Also, most resources focus on simple language, not specialized medical vocabulary. Noisy hospital environments make it even harder, as health care providers may not speak clearly due to stress or distractions.”

To tackle these issues, a detailed audio dataset was created with Polish medical terms spoken by doctors and specialists in areas like cardiology and pulmonology. This dataset was analyzed using an Automatic Speech Recognition model, technology that converts speech into text, for transcription. Several metrics, such as Word Error Rate and Character Error Rate, were used to evaluate the quality of the speech recognition. This analysis helps understand how speech clarity and style affect the accuracy of STT.

Kostek will present this data Thursday, Nov. 21, at 3:25 p.m. ET as part of the virtual 187th Meeting of the Acoustical Society of America, running Nov. 18-22, 2024.

“Medical jargon can be tricky, especially with abbreviations that differ across specialties. This is an even more difficult task when we refer to realistic hospital situations in which the room is not acoustically prepared.” Kostek said.

Currently, the focus is on Polish, but there are plans to expand the research to other languages, like Czech. Collaborations are being established with the University Hospital in Brno to develop medical term resources, aiming to enhance the use of STT technology in health care.

“Even though artificial intelligence is helpful in many situations, many problems should be investigated analytically rather than holistically, focusing on breaking a whole picture into individual parts.”

———————– MORE MEETING INFORMATION ———————–
​Main Meeting Website: https://acousticalsociety.org/asa-virtual-fall-2024/
Technical Program: https://eppro01.ativ.me/src/EventPilot/php/express/web/planner.php?id=ASAFALL24

ASA PRESS ROOM
In the coming weeks, ASA’s Press Room will be updated with newsworthy stories and the press conference schedule at https://acoustics.org/asa-press-room/.

LAY LANGUAGE PAPERS
ASA will also share dozens of lay language papers about topics covered at the conference. Lay language papers are summaries (300-500 words) of presentations written by scientists for a general audience. They will be accompanied by photos, audio, and video. Learn more at https://acoustics.org/lay-language-papers/.

PRESS REGISTRATION
ASA will grant free registration to credentialed and professional freelance journalists. If you are a reporter and would like to attend the virtual meeting and/or press conferences, contact AIP Media Services at media@aip.org. For urgent requests, AIP staff can also help with setting up interviews and obtaining images, sound clips, or background information.

ABOUT THE ACOUSTICAL SOCIETY OF AMERICA
The Acoustical Society of America is the premier international scientific society in acoustics devoted to the science and technology of sound. Its 7,000 members worldwide represent a broad spectrum of the study of acoustics. ASA publications include The Journal of the Acoustical Society of America (the world’s leading journal on acoustics), JASA Express Letters, Proceedings of Meetings on Acoustics, Acoustics Today magazine, books, and standards on acoustics. The society also holds two major scientific meetings each year. See https://acousticalsociety.org/.

Shhh! Smart Tech at Work: Zoning in on Target Sounds Amid the Noise

Jingya Yang – jing.ya161@gmail.com

Department of Power Mechanical Engineering, National Tsing Hua University, Hsinchu, -, 300, Taiwan

Popular version of 1aSP2 – Target-Direction Sound Extraction Using a Hybrid DSP/Deep Learning Approach
Presented at the 187th ASA Meeting
Read the abstract at https://eppro01.ativ.me//web/index.php?page=IntHtml&project=ASAFALL24&id=3771518

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–


In a noisy world, capturing clear audio from specific directions can be a game-changer. Imagine a system that can zero in on a target sound, even amid background noise. This is the goal of Target Directional Sound Extraction (TDSE), a process designed to isolate sounds from a particular direction, while filtering out unwanted noise.

Our team has developed an innovative TDSE system that combines Digital Signal Processing (DSP) and deep learning. Traditional sound extraction relies on signal processing, but it struggles when multiple sounds come from various directions or when using fewer microphones. Deep learning can help, but it sometimes results in distorted audio. By integrating DSP-based spatial filtering with a deep neural network (DNN), our system extracts clear target audio with minimal interference, even with limited microphones.

The system relies on spatial filtering techniques like beamforming and blocking. Beamforming serves as a signal estimator, enhancing sounds from the target direction, while blocking acts as a noise estimator, suppressing sounds from the target direction and leaving other unwanted noises intact. Using a deep learning model, our system processes spatial features and sound embeddings (unique characteristics of the target sound), yielding clear, isolated audio. In our tests, this method improved sound quality by 3-9 dB and performed well with different microphone setups, even those not used during training.

Audio 1 & Audio 2

TDSE could transform various industries, from virtual meetings to entertainment, by enhancing audio clarity in real time. Our system’s design offers flexibility, making it adaptable for real-world applications where clear directional audio is crucial.

This approach is an exciting step toward more robust, adaptive audio processing systems, allowing users to capture target sounds even in challenging environments.

The Trump Rally Shooting: Listening to an Assassination Attempt

Robert C Maher – rmaher@montana.edu

Montana State University, Electrical and Computer Engineering Department, PO Box 173780, Bozeman, MT, 59717-3780, United States

Popular version of 3pSP10 – Interpreting user-generated recordings from the Trump assassination attempt on July 13, 2024
Presented at the 187th ASA Meeting
Read the abstract at https://eppro01.ativ.me//web/index.php?page=IntHtml&project=ASAFALL24&id=3771549

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–


On Saturday, July 13, 2024, thousands of supporters attended an outdoor rally held by presidential candidate Donald J. Trump at the Butler Farm Show grounds in Butler, Pennsylvania. Shortly after Mr. Trump began speaking, gunshots rang out. Several individuals in the crowd were seriously wounded and killed.

While the gunfire was clearly audible to thousands at the scene—and soon millions online—many significant details of the incident could only be discerned by the science of audio forensic analysis. More than two dozen mobile phone videos from rally attendees provided an unprecedented amount of relevant audio information for quick forensic analysis. Audio forensic science identified a total of ten gunshots: eight shots from a single location later determined to be the perpetrator’s perch, and two shots from law enforcement rifles.

In our era of rapid spread of speculative rumors on the internet, the science of audio forensics was critically important in quickly documenting and confirming the actual circumstances from the Trump rally scene.

 

Where did the shots come from?

Individuals near the stage described hearing pop pop pop noises that they reported to be “small-arms fire.” However, scientific audio forensic examination of the audio picked up by the podium microphones immediately revealed that the gunshot sounds were not small-arms fire as the earwitnesses had reported, but instead showed the characteristic sounds of supersonic bullets from a rifle.

When a bullet travels faster than sound, it creates a small sonic boom that moves with the bullet as it travels down range. A microphone near the bullet’s path will pick up the “crack” of the bullet passing by, and then a fraction of a second later, the familiar “bang” of the gun’s muzzle blast arrives at the microphone (see Figure 1).

Figure 1: Sketch depicting the position of the supersonic bullet’s shock wave and the firearm’s muzzle blast.

 

From the Trump rally, audio forensic analysis of the first audible shots in the podium microphone recording showed the “crack” sound due to the supersonic bullet passing the microphone, followed by the “bang” sound of the firearm’s muzzle blast. Only a small fraction of a second separated the “crack” and the “bang” for each audible shot, but the audio forensic measurement of those tiny time intervals (see Figure 2) was sufficient to estimate that the shooter was 130 meters from the microphone—a little more than the length of a football field away. The acoustic calculation prediction was soon confirmed when the body of the presumed perpetrator was found on a nearby rooftop, precisely that distance away from the podium.

Figure 2: Stereo audio waveform and spectrogram from podium microphone recording showing the first three shots (A, B, C), with manual annotation.

 

How many shots were fired?

The availability of nearly two dozen video and audio recordings of the gunfire from bystanders at locations all around the venue offered a remarkable audio forensic opportunity, and our audio forensic analysis identified a total of ten gunshots, labeled A-J in Figure 3.

Figure 3: User-generated mobile phone recording from a location near the sniper’s position, showing the ten audible gunshots.

 

The audio forensic analysis revealed that the first eight shots (labeled A-H) came from the identified perpetrator’s location, because all the available recordings gave the same time sequence between each of those first eight shots. This audio forensic finding was confirmed later when officials released evidence that eight spent shell casings had been recovered from the perpetrator’s location on the rooftop

Comparing the multiple audio recordings, the two additional audible shots (I and J) did not come from the perpetrator’s location, but from two different locations. Audio forensic analysis placed shot “I” as coming from a location northeast of the podium. Matching the audio forensic analysis, officials later confirmed that shot “I” came from a law enforcement officer firing toward the perpetrator from the ground northeast of the podium. The final audible shot “J” came from a location south of the podium. Again, consistent with the audio forensic analysis, officials confirmed that shot “J” was the fatal shot at the perpetrator by a Secret Service counter-sniper located on the roof of a building southeast of the podium.

Analysis of sounds from the Trump rally accurately described the location and characteristics of the audible gunfire, and helped limit the spread of rumors and speculation after the incident. While the unique audio forensic viewpoint cannot answer every question, this incident demonstrated that many significant details of timing, sound identification, and geometric orientation can be discerned and documented using the science of audio forensic analysis.

Please feel free to contact the author for more information.

Enhancing Speech Recognition in Healthcare

Andrzej Czyzewski – andczyz@gmail.com

Gdańsk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Multimedia Systems Department, Gdańsk, Pomerania, 80-233, Poland

Popular version of 1aSP6 – Strategies for Preprocessing Speech to Enhance Neural Model Efficiency in Speech-to-Text Applications
Presented at the 187th ASA Meeting
Read the abstract at https://eppro01.ativ.me//web/index.php?page=IntHtml&project=ASAFALL24&id=3771522

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–


Effective communication in healthcare is essential, as accurate information can directly impact patient care. This paper discusses research aimed at improving speech recognition technology to help medical professionals document patient information more effectively. By using advanced techniques, we can make speech-to-text systems more reliable for healthcare, ensuring they accurately capture spoken information.

In healthcare settings, professionals often need to quickly and accurately record patient interactions. Traditional typing can be slow and error-prone, while speech recognition allows doctors to dictate notes directly into electronic health records (EHRs), saving time and reducing miscommunication.

The main goal of our research was to test various ways of enhancing speech-to-text accuracy in healthcare. We compared several methods to help the system understand spoken language more clearly. These methods included different ways of analyzing sound, like looking at specific sound patterns or filtering background noise.

In this study, we recorded around 80,000 voice samples from medical professionals. These samples were then processed to highlight important speech patterns, making it easier for the system to learn and recognize medical terms. We used a method called Principal Component Analysis (PCA) to keep the data simple while ensuring essential information was retained.

Our findings showed that combining several techniques to capture speech patterns improved system performance. We saw an average accuracy improvement, with fewer word and character recognition errors.

The potential benefits of this work are significant:

  • Smoother documentation: Medical staff can record notes more efficiently, freeing up time for patient care.
  • Improved accuracy: Patient records become more reliable, reducing the chance of miscommunication.
  • Better healthcare outcomes: Enhanced communication can improve the quality of care.

This study highlights the promise of advanced speech recognition in healthcare. With further development, these systems can support medical professionals in delivering better patient care through efficient and accurate documentation.

 

Figure1. Frontpage of the ADMEDVOICE corpus containing medical text and their spoken equivalents

Unlocking the Secrets of Ocean Dynamics: Insights from ALMA

Florent Le Courtois – florent.lecourtois@gmail.com

DGA Tn, Toulon, Var, 83000, France

Samuel Pinson, École Navale, Rue du Poulmic, 29160 Lanvéoc, France
Victor Quilfen, Shom, 13 Rue de Châtellier, 29200 Brest, France
Gaultier Real, CMRE, Viale S. Bartolomeo, 400, 19126 La Spezia, Italy
Dominique Fattaccioli, DGA Tn, Avenue de la Tour Royale, 83000 Toulon, France

Popular version of 4aUW7 – The Acoustic Laboratory for Marine Applications (ALMA) applied to fluctuating environment analysis
Presented at the 186th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0027503

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

Ocean dynamics happen at various spatial and temporal scales. They cause the displacement and the mixing of water bodies of different temperatures. Acoustic propagation is strongly impacted by these fluctuations as sound speed depends mainly on the underwater temperature. Monitoring underwater acoustic propagation and its fluctuations remains a scientific challenge, especially at mid-frequency (typically the order of 1 to 10 kHz). Dedicated measurement campaigns have to be conducted to increase the understanding of the fluctuations, their impacts on the acoustic propagation and thus to develop appropriate localization processing.

The Acoustic Laboratory for Marine Application (ALMA) has been proposed by the French MOD Procurement Agency (DGA) to conduct research for passive and active sonar since 2014, in support of future sonar array design and processing. Since its inception in 2014, ALMA has undergone remarkable transformations, evolving from a modest array of hydrophones to a sophisticated system equipped with 192 hydrophones and advanced technology. With each upgrade, ALMA’s capabilities have expanded, allowing us to delve deeper into the secrets of the sea.

ALMA

Figure 1. Evolution of the ALMA array configuration, from 2014 to 2020. Real and Fattacioli, 2018

Bulletin of sea temperature to understand the acoustic propagation
The campaign of 2016 took place Nov 7 – 17, 2016, off the Western Coast of Corsica in the Mediterranean Sea, located by the blue dot in Fig.2 (around 42.4 °N and 9.5 °E). We analyzed signals from a controlled acoustic source and temperature recording, corresponding approximately to 14 hours of data.

Figure 2. Map of surface temperature during the campaign. Heavy rains of previous days caused a vortex in the north of Corsica. Pinson et. al, 2022

The map of sea temperature during the campaign was computed. It is similar to a weather bulletin for the sea. From previous days, heavy rains caused a global cooling over the areas. A vortex appeared in the Ligurian Sea between Italy and the North of Corsica. Then the cold waters traveled Southward along Corsica Western coast to reach the measurement area. The water cooling was measured as well on the thermometers. The main objective was to understand the changes in the echo pattern in relation to the temperature change. Echos can characterize the acoustic paths. We are mainly interested in the amplitude, the time of travel and the angle of arrival of echoes to describe the acoustic path between the source and ALMA array.

All echoes extracted by processing ALMA data are plotted as dots in 3D. They depend on the time of the campaign, the angle of arrival and the time of flight. The loudness of the echo is indicated by the colorscale. The 3D image is sliced in Fig. 3 a), b) and c) for better readability. The directions of the last reflection are estimated in Fig. 3 a): positive angles come from the surface reflection while negative angles come from seabed reflection. The global cooling of the waters caused a slowly increasing fluctuation of the time of flight between the source and the array in Fig. 3 b). A surprising result was a group of spooky arrivals, who appeared briefly during the campaign at an angle close to 0 ° during 3 and 12 AM in Fig. 3 b) and c).

All the echoes detected by processing the acoustic data. Pinson et. al, 2022

Figure 3. Evolution of the acoustic paths during the campaign. Each path is a dot defined by the time of flight and the angle of arrival during the period of the campaign. Pinson et. al, 2022

The acoustic paths were computed using the bulletin of sea temperature. A more focused map of the depth of separation between cold and warm waters, also called mixing layer depths (MLD), is plotted in Fig 4. We noticed that, when the mixing layer depth is below the depth of the source, the cooling causes acoustic paths to be trapped by bathymetry in the lower part of the water column. It explains the apparition of the spooky echoes. Trapped paths are plotted in the blue line while regular paths are plotted in black in Fig. 5.

Figure 4. Evolution of the depth of separation between cold and warm water during the campaign. Pinson et. al, 2022

Figure 5. Example of acoustic paths in the area: black lines indicate regular propagation of the sound; blue lines indicate the trapped paths of the spooky echoes. Pinson et. al, 2022

Overview
The ALMA system and the associated tools allowed illustrating practical ocean acoustics phenomena. ALMA has been deployed during 5 campaigns, representing 50 days at sea, mostly in the Western Mediterranean Sea, but also in the Atlantic to tackle other complex physical problems.

Tools for shaping the sound of the future city in virtual reality

Christian Dreier – cdr@akustik.rwth-aachen.de

Institute for Hearing Technology and Acoustics
RWTH Aachen University
Aachen, Northrhine-Westfalia 52064
Germany

– Christian Dreier (lead author, LinkedIn: Christian Dreier)
– Rouben Rehman
– Josep Llorca-Bofí (LinkedIn: Josep Llorca Bofí, X: @Josepllorcabofi, Instagram: @josep.llorca.bofi)
– Jonas Heck (LinkedIn: Jonas Heck)
– Michael Vorländer (LinkedIn: Michael Vorländer)

Popular version of 3aAAb9 – Perceptual study on combined real-time traffic sound auralization and visualization
Presented at the 186th ASA Meeting
Read the abstract at https://doi.org/10.1121/10.0027232

–The research described in this Acoustics Lay Language Paper may not have yet been peer reviewed–

“One man’s noise is another man’s signal”. This famous quote by Edward Ng from a 1990’s New York Times article breaks down a major learning from noise research. A rule of thumb within noise research states the community response to noise, when asked for “annoyance” ratings, is said to be statistically explained only to one third by acoustic factors (like the well-known A-weighted sound pressure level, which can be found on household devices as “dB(A)” information). Referring to Ng’s quote, another third is explained by non-acoustic, personal or social variables, whereas the last third cannot be explained according to the current state of research.

Noise reduction in built urban environments is an important goal for urban planners, as noise is not only a cause of cardio-vascular diseases, but also affects learning and work performance in schools and offices. To achieve this goal, a number of solutions are available, ranging from switching to electrified public transport, speed limits, traffic flow management or masking of annoyant noise by pleasant noise, for example fountains.

In our research, we develop a tool for making the sound of virtual urban scenery audible and visible. From its visual appearance, the result is comparable to a computer game, with the difference that the acoustic simulation is physics-based, a technique that is called auralization. The research software “Virtual Acoustics” simulates the entire physical “history” of a sound wave for producing an audible scene. Therefore, the sonic characteristics of traffic sound sources (cars, motorcycles, aircraft) are modeled, the sound wave’s interaction with different materials at building and ground surfaces are calculated, and human hearing is considered.

You might have recognized a lightning strike sounding dull when being far away and bright when being close, respectively. The same applies for aircraft sound too. In an according study, we auralized the sound of an aircraft for different weather conditions. A 360° video compares how the same aircraft typically sounds during summer, autumn and winter when the acoustical changes due to the weather conditions are considered (use headphones for full experience!)

In another work we prepared a freely available project template for using Virtual Acoustics. Therefore, we acoustically and graphically modeled the IHTApark, that is located next to the Institute for Hearing Technology and Acoustics (IHTA): https://www.openstreetmap.org/#map=18/50.78070/6.06680.

In our latest experiment, we focused on the perception of especially annoyant traffic sound events. Therefore, we presented the traffic situations by using virtual reality headsets and asked the participants to assess them. How (un)pleasant would be the drone for you during a walk in the IHTApark?