Text-to-Audio Models Make Music from Scratch #ASA183

Text-to-Audio Models Make Music from Scratch #ASA183

Much like machine learning can create images from text, it can also generate sounds.

Media Contact:
Ashley Piccone
AIP Media

NASHVILLE, Tenn., Dec. 7, 2022 – Type a few words into a text-to-image model, and you’ll end up with a weirdly accurate, completely unique picture. While this tool is fun to play with, it also opens up avenues of creative application and exploration and provides workflow-enhancing tools for visual artists and animators. For musicians, sound designers, and other audio professionals, a text-to-audio model would do the same.

The algorithm transforms a text prompt into audio. Credit: Zach Evans

As part of the 183rd Meeting of the Acoustical Society of America, Zach Evans, of Stability AI, will present progress toward this end in his talk, “Musical audio samples generated from joint text embeddings.” The presentation will take place on Dec. 7 at 10:45 a.m. Eastern U.S. in the Rail Yard room, as part of the meeting running Dec. 5-9 at the Grand Hyatt Nashville Hotel.

“Text-to-image models use deep neural networks to generate original, novel images based on learned semantic correlations with text captions,” said Evans. “When trained on a large and varied dataset of captioned images, they can be used to create almost any image that can be described, as well as modify images supplied by the user.”

A text-to-audio model would be able to do the same, but with music as the end result. Among other applications, it could be used to create sound effects for video games or samples for music production.

But training these deep learning models is more difficult than their image counterparts.

“One of the main difficulties with training a text-to-audio model is finding a large enough dataset of text-aligned audio to train on,” said Evans. “Outside of speech data, research datasets available for text-aligned audio tend to be much smaller than those available for text-aligned images.”

Evans and his team, including Belmont University’s Dr. Scott Hawley, have shown early success in generating coherent and relevant music and sound from text. They employed data compression methods to generate the audio with reduced training time and improved output quality.

The researchers plan to expand to larger datasets and release their model as an open-source option for other researchers, developers, and audio professionals to use and improve.

Main meeting website: https://acousticalsociety.org/asa-meetings/
Technical program: https://eppro02.ativ.me/web/planner.php?id=ASAFALL22&proof=true

In the coming weeks, ASA’s Press Room will be updated with newsworthy stories and the press conference schedule at https://acoustics.org/asa-press-room/.

ASA will also share dozens of lay language papers about topics covered at the conference. Lay language papers are 300 to 500 word summaries of presentations written by scientists for a general audience. They will be accompanied by photos, audio, and video. Learn more at https://acoustics.org/lay-language-papers/.

ASA will grant free registration to credentialed and professional freelance journalists. If you are a reporter and would like to attend the meeting or virtual press conferences, contact AIP Media Services at media@aip.org.  For urgent requests, AIP staff can also help with setting up interviews and obtaining images, sound clips, or background information.

The Acoustical Society of America (ASA) is the premier international scientific society in acoustics devoted to the science and technology of sound. Its 7,000 members worldwide represent a broad spectrum of the study of acoustics. ASA publications include The Journal of the Acoustical Society of America (the world’s leading journal on acoustics), JASA Express Letters, Proceedings of Meetings on Acoustics, Acoustics Today magazine, books, and standards on acoustics. The society also holds two major scientific meetings each year. See https://acousticalsociety.org/.

Assessment of road surfaces using sound analysis

Andrzej Czyzewski – andcz@multimed.org

Multimedia Systems, The Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gdansk, Pomorskie, 80-233, Poland

Jozef Kotus – Multimedia Systems, The Faculty of Electronics, Telecommunications and Informatics,
Grzegorz Szwoch – Multimedia Systems, The Faculty of Electronics, Telecommunications and Informatics],
Bozena Kostek – Audio Acoustics Lab., Gdansk Univ. of Technology, Gdansk, Poland

Popular version of 3pPAb1-Assessment of road surface state with acoustic vector sensor, presented at the 183rd ASA Meeting.

Have you ever listened to the sound of road vehicles passing by? Perhaps you’ve noticed that the sound differs depending on whether the road surface is dry or wet (for example, after the rain). This observation is the basis of the presented algorithm that assesses the road surface state using sound analysis.

Listen to the sound of a car moving on a dry road.
And this is the sound of a car on a wet road.

A wet road surface not only sounds different, but it also affects road safety for drivers and pedestrians. Knowing the state of the road (dry/wet), it is possible to notify the drivers about dangerous road conditions, for example, using signs displayed on the road.

There are various methods of assessing the road surface. For example, there are optical (laser) sensors, but they are expensive. Therefore, we have decided to develop an acoustic sensor that ‘listens” to the sound of vehicles moving along the road and determines whether the surface is dry or wet.

The task may seem simple, but we must remember that the sensor records the sound of road vehicles and other environmental sounds (people speaking, aircraft, animals, etc.). Therefore, instead of a single microphone, we use a special acoustic sensor built from six miniature digital microphones mounted on a small cube (10 mm side length). With this sensor, we can select sounds incoming from the road, ignoring sounds from other directions, and also detect the direction in which a vehicle moves.

Since the sound of road vehicles moving on a dry and wet surface differ, performing frequency analysis of the vehicle sounds is recommended.

The figures below present how the sound spectrum changes in time when a vehicle moves on a dry surface (left figure) and a wet surface (right figure). It is evident that in the case of a damp surface, the spectrum is expanded towards higher frequencies (the upper part of the plot) compared with the dry surface plot. Colors on the plot represent the direction of arrival of sound generated by vehicle passing by (the angle in degrees). You can observe how the vehicles moved in relation to the sensor.

Plots of the sound spectrum for cars moving on a dry road (left) and a wet road (right). Color denotes the sound source azimuth. In both cases, two vehicles moving in opposite directions were observed.Plots of the sound spectrum for cars moving on a dry road (left) and a wet road (right). Color denotes the sound source azimuth. In both cases, two vehicles moving in opposite directions were observed.

In our algorithm, we have developed a parameter that describes the amount of water on the road. The parameter value is low for a dry surface. However, as the road surface becomes increasingly wet during rainfall, the parameter value becomes more extensive.

The results obtained from our algorithm were verified by comparing them with data from a professional road surface sensor that measures the thickness of a water layer on the road using a laser beam (VAISALA Remote Road Surface State Sensor DSC111). The plot below shows the results from analyzing sounds recorded from 1200 road vehicles passing by the sensor, compared with data obtained from the reference sensor. The data were obtained from a continuous 6-hour observation period, starting from a dry surface, then observing rainfall until the road surface had dried.

A surface state measure calculated with the proposed algorithm and obtained from the reference device A surface state measure calculated with the proposed algorithm and obtained from the reference device

As one can see, the results obtained from our algorithm are consistent with data from the professional device. Therefore, the results are promising, and the cheap sensor is easy to install at multiple points within a road network. Hence, it makes the proposed solution an attractive method of road condition assessment for intelligent road management systems.

Presence of a drone and estimating its range simply from the drone audio emissions

Kaliappan Gopalan – kgopala@pnw.edu

Purdue University Northwest, Hammond, IN, 46323, United States

Brett Y. Smolenski, North Point Defense, Rome, NY, USA
Darren Haddad, Information Exploitation Branch, Air Force Research Laboratory, Rome, NY, USA

Popular version of 1ASP8-Detection and Classification of Drones using Fourier-Bessel Series Representation of Acoustic Emissions, presented at the 183rd ASA Meeting.

With the proliferation of drones – from medical supply and hobbyist to surveillance, fire detection and illegal drug delivery, to name a few – of various sizes and capabilities flying day or night, it is imperative to detect their presence and estimate their range for security, safety and privacy reasons.

Our paper describes a technique for detecting the presence of a drone, as opposed to environmental noise such as from birds and moving vehicles, simply from the audio emissions of the drone from its motors, propellers and mechanical vibrations. By applying a feature extraction technique that separates a drone’s distinct audio spectrum from that of atmospheric noise, and employing machine learning algorithms, we were able to identify drones from three different classes flying outdoors with correct class in over 78 % of cases. Additionally, we estimated the range of a drone from the observation point correctly to within ±50 cm in over 85 % of cases.

We evaluated unique features characterizing each type of drone using a mathematical technique known as the Fourier-Bessel series expansion. Using these features which not only differentiated the drone class but also differentiated the drone range, we applied machine learning algorithms to train a deep learning network with ground truth values of drone type, or its range as a discrete variable at intervals of 50 cm. When the trained learning network was tested with new, unused features, we obtained the correct type of drone – with a nonzero range – and a range class that was within the appropriate class, that is, within ±50 cm of the actual range.

Any point along the main diagonal line indicates correct range class, that is, within ±50 cm of actual range, while off-diagonal values correspond to false classification error.

For identifying more than three types of drones, we tested seven different types of drones, namely, DJI S1000, DJI M600, Phantom 4 Pro, Phantom 4 QP with a quieter set of propellers, Mavic Pro Platinum, Mavic 2 Pro, and Mavic Pro, all tethered in an anechoic chamber in an Air Force laboratory and controlled by an operator to go through a series of propeller maneuvers (idle, left roll, right roll, pitch forward, pitch backward, left yaw, right yaw, half throttle, and full throttle) to fully capture the array of sounds the craft emit. Our trained deep learning network correctly identified the drone type in 84 % of our test cases.  Figure 1 shows the results of range classification for each outdoor drone flying between a line-of-sight range of 0 (no-drone) to 935 m.

3pSP4 – Imaging Watermelons

Dr. David Joseph Zartman
Zartman Inc., L.L.C.,
Loveland, Colorado

Popular version of 3pSP4 – Imaging watermelons
Presented Wednesday afternoon, May 25, 2022
182nd ASA Meeting, Denver
Click here to read the abstract

When imaging watermelons, everything can be simplified down to measuring a variable called ripeness, which is a measure of the internal medium of the watermelon, rather than looking for internal reflections from any contents such as seeds. The optimal experimental approach acoustically is thus a through measurement, exciting the wave on one side and measuring the result on the other.

Before investigating the acoustic properties, it is useful to examine watermelons’ ripening properties from a material perspective.  As the fruit develops, it starts off very hard and fibrous with a thick skin. Striking an object like this would be similar to hitting a rock, or possibly a stick given the fibrous nature of the internal contents of the watermelon.

As the watermelon ripens, this solid fiber starts to contain more and more liquid, which also sweetens over time. This process continues and transforms the fruit from something too fibrous and bitter to something juicy and sweet. Most people have their own preference for exactly how crunchy versus sweet they personally prefer. The skin also thins throughout this process. As the fibers continue to be broken down beyond optimal ripeness, the fruit becomes mostly fluid, possibly overly sweet, and with a very thin skin.  Striking the fruit at this stage would be similar to hitting some sort of water balloon. While the sweet juice sounds like a positive, the overall texture at the stage is usually not considered desirable.

In review, as watermelons ripen, they transform from something extremely solid to something more resembling a liquid filled water balloon. These are the under-ripe and over-ripe conditions; thus, the personal ideal exists somewhere between the two. Some choose to focus on the crunchy earlier stage at the cost of some of the sweetness, possibly also preferable to anyone struggling with blood sugar issues, in contrast to those preferring to maximize the sweet juicy nature of the later stages at the cost of crunchy texture.

The common form of acoustic measurement in this situation is to simply strike the surface of the watermelon with a finger knuckle and listen to the sound. More accuracy is possible by feeling with fingertips on the opposite side of the watermelon when it is struck. Both young and old fruit do not have much response, one being too hard, getting an immediate sharp response and being more painful on the impacting finger. The other is more liquid and thus is more difficult to sonically excite. A young watermelon may make a sound described as a hard ‘tink’, while an old one could be described more as a soft ‘phlub’. In between, it is possible to feel the fibers in the liquid vibrating for a period of time, creating a sound more like a ‘toong’. A shorter resonance, ‘tong’, indicates younger fruit, while more difficulty getting a sound through, ‘tung’, indicates older.

An optimal watermelon can thus be chosen by feeling or hearing the resonant properties of the fruit when it is struck and choosing to preference.

1aSPa5 – Saving Lives During Disasters by Using Drones

Macarena Varela – macarena.varela@fkie.fraunhofer.de
Wulf-Dieter Wirth – wulf-dieter.wirth@fkie.fraunhofer.de
Fraunhofer FKIE/ Department of Sensor Data and Information Fusion (SDF)
Fraunhoferstr. 20
53343 Wachtberg, Germany

Popular version of ‘1aSPa5 Bearing estimation of screams using a volumetric microphone array mounted on a UAV’
Presented Tuesday morning 9:30 AM – 11:15 AM, June 8, 2021
180th ASA Meeting, Acoustics in Focus
Read the abstract by clicking here.

During disasters, such as earthquakes or shipwrecks, every minute counts to find survivors.

Unmanned Aerial Vehicles (UAVs), also called drones, can better reach and cover inaccessible and larger areas than rescuers on the ground or other types of vehicles, such as Unmanned Ground Vehicles.  Nowadays, UAVs could be equipped with state-of-the-art technology to provide quick situational awareness, and support rescue teams to locate victims during disasters.

[Video: Field experiment using the MEMS system mounted on the drone to hear impulsive sounds produced by a potential victim.mp4]

Survivors typically plead for help by producing impulsive sounds, such as screams. Therefore, an accurate acoustic system mounted on a drone is currently being developed at Fraunhofer FKIE, focused on localizing those potential victims.

The system will be filtering environmental and UAV noise in order to get positive detections on human screams or other impulsive sounds. It will be using a particular type of microphone array, called “Crow’s Nest Array” (CNA) combined with advanced signal processing techniques (beamforming) to provide accurate locations of the specific sounds produced by missing people (see Figure 1). The spatial distribution and number of microphones in arrays have a crucial influence on the estimated location accuracy, therefore it is important to select them properly.

Figure 1: Conceptual diagram to localize victims

The system components are minimized in quantity, weight and size, for the purpose of being mounted on a drone. With this in mind, the microphone array is composed of a large number of tinny digital Micro-Electro-Mechanical-Systems (MEMS) microphones to find the locations of the victims. In addition, one supplementary condenser microphone covering a larger frequency spectrum will be used to have a more precise signal for detection and classification purposes.

Figure 2: Acoustic system mounted on a drone

Figure 2: Acoustic system mounted on a drone

Different experiments, including open field experiments, have successfully been conducted, demonstrating the good performance of the ongoing project.

3aSP1 – Using Physics to Solve the Cocktail Party Problem

Keith McElveen – keith.mcelveen@wavesciencescorp.com
Wave Sciences
151 King Street
Charleston, SC USA 29401

Popular version of paper ‘Robust speech separation in underdetermined conditions by estimating Green’s functions’
Presented Thursday morning, June 10th, 2021
180th ASA Meeting, Acoustics in Focus

Nearly seventy years ago, a hearing researcher named Colin Cherry said that “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it the cocktail party problem.” No machine has been constructed to do just this, to filter out one conversation from a number jumbled together.”

Despite many claims of success over the years, the Cocktail Party Problem has resisted solution.  The present research investigates a new approach that blends tricks used by human hearing with laws of physics. With this approach, it is possible to isolate a voice based on where it must have come from – somewhat like visualizing balls moving around a billiard table after being struck, except in reverse, and in 3D. This approach is shown to be highly effective in extremely challenging real-world conditions with as few as four microphones – the same number as found in many smart speakers and pairs of hearing aids.

The first “trick” is something that hearing scientists call “glimpsing”. Humans subconsciously piece together audible “glimpses” of a desired voice as it momentarily rises above the level of competing sounds. After gathering enough glimpses, our brains “learn” how the desired voice moves through the room to our ears and use this knowledge to ignore the other sounds.

The second “trick” is based on how humans use sounds that arrive “late”, because they bounced off of one or more large surfaces along the way. Human hearing somehow combines these reflected “copies” of the talker’s voice with the direct version to help us hear more clearly.

The present research mimics human hearing by using glimpses to build a detailed physics model – called a Green’s Function – of how sound travels from the talker to each of several microphones. It then uses the Green’s Function to reject all sounds that arrived via different paths and to reassemble the direct and reflected copies into the desired speech. The accompanying sound file illustrates typical results this approach achieves.

Original Cocktail Party Sound File, Followed by Separated Nearest Talker, then Farthest

While prior approaches have struggled to equal human hearing in a realistic cocktail party babel, even at close distances, the research results we are presenting imply that it is now possible to not only equal, but to exceed human hearing and solve The Cocktail Party Problem, even with a small number of microphones in no particular arrangement.

The many implications of this research include improved conference call systems, hearing aids, automotive voice command systems, and other voice assistants – such as smart speakers. Our future research plans include further testing as well as devising intuitive user interfaces that can take full advantage of this capability.

No one knows exactly how human hearing solves the Cocktail Party Problem, but it would be very interesting indeed if it is found to use its own version of a Green’s Function.