Machine learning allows automatic recognition of everyday sounds

Tuomas Virtanen
Tampere University of Technology
Korkeakoulunkatu 1
FI-33720 Tampere

Popular version of keynote talk “Computational analysis of acoustic events in everyday environments”.

To be presented on Sunday morning, June 25, 2017, 8:20, Ballroom B

173rd Meeting of the Acoustical Society of America

Sound carries information about physical events in an environment. For example, when a car is passing by, we can perceive the approximate size and speed of the car by its sounds. The automatic recognition of sound events in everyday environments (see Figure 1) through signal processing and computer algorithms would therefore enable several new applications. For instance, robots, cars, and mobile devices could become aware of physical events in their surroundings. New surveillance applications could automatically detect dangerous events, such as glass breaking, and multimedia databases could be queried based on their content.

Figure 1: Sound event recognition methods analyse automatically events within an audio signal. Credit: Heittola/TUT.

However, automatic detection of sounds is difficult, since the acoustic characteristics of different sources can be very similar, and there is no single, specific acoustic property that could be utilised for recognition. Furthermore, in realistic environments, there are typically multiple sound sources present simultaneously and their acoustic sources interfere with one another, forming a complex mixture of sounds (see Figure 2).


Figure 2: Sound carries lots of information about physical events in everyday environments. Realistic sound scenes consist of multiple sources which form a complex mixture of sounds. Credit: Arpingstone.


In my keynote speech, I present a generic approach to sound event recognition that can be used to detect many different types of sound events in realistic, everyday environments. In this technique, computer algorithms use machine learning to compile a model for each detectable sound type based on a database of example sounds. Once these models have been obtained at the development stage, they can be deployed to provide an estimate of the sound events that are present in any input audio signal.

State-of-the-art machine-learning models and algorithms are based on deep neural networks [1]. They mimic the processing in the human auditory system by feeding the input audio signal through a sequence of layers that automatically learn hierarchical representations of input sounds (see Figure 3). In turn, these representations can be used to estimate and recognise which sounds are present in the input.

Figure 3: The processing layers of a deep neural network produce representations of the input having different abstraction levels. Credit: Çakır et al./TUT


A recent study conducted at Tampere University of Technology shows that it is possible to recognize 61 various types of everyday sounds, such as footsteps, cars, doors, etc. correctly 70% of the time in everyday environments, such as offices, streets, and shops (see Figure 4) [2]. The study also demonstrates how advanced neural network architectures can provide significant improvements in recognition accuracy. The accompanying video (Video 1) illustrates the recognition output of the methods in comparison to manually annotated sound events.

Figure 4: Automatic recognition accuracy (percentages) of some common sound events in a study conducted at Tampere University of Technology. Credit: Çakır et al./TUT


Video 1: Sound events automatically recognized from a street scene (orange bars). Manually annotated target sounds are illustrated with blue bars, and cases where the automatic recognition and manual annotation coincide with brown bars. Credit: Çakır et al./TUT.



[1] Y. Bengio. Learning Deep Architectures for AI”(PDF). Foundations and Trends in Machine Learning. 2 (1): 1–127, 2009




[2] E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen, and T.  Virtanen. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 25, Issue 6, 2017.




Share This