1pPPB – Emotion Recognition from Speaker-dependent low-level acoustic features

Tejal Udhan – tu13b@my.fsu.edu
Shonda Bernadin – bernadin@eng.famu.fsu.edu
FAMU-FSU College of Engineering,
Department of Electrical and Computer Engineering
2525 Pottsdamer Street Tallahassee
Florida 32310

Popular version of paper 1pPPB: ‘Speaker-dependent low-level acoustic feature extraction for emotion recognition’
Presented Monday afternoon May 7, 2018
175th ASA Meeting, Minneapolis

EmotionSpeech is a most common and fastest means of communication between humans. This fact compelled researchers to study acoustic signals as a fast and efficient means of interaction between humans and machines. For authentic human-machine interaction, the method requires that the machines should have the sufficient intelligence to recognize human voices and their emotional state. Speech emotion recognition, extracting the emotional state of speakers from acoustic data, plays an important role in enabling machines to be ‘intelligent’. Audio and speech processing provides better, noninvasive and easy to acquire solutions than other biomedical signals such as electrocardiograms (ECG), and electroencephalograms (EEG).

Speech is an informative source for the perception of emotions. For example, talking in a loud voice when feeling very happy, speaking in an uncharacteristically high pitched voice when greeting a desirable person, or the presence of vocal tremor when something fearful or sad have been experienced. This cognitive recognition of emotions in turn indicates that listeners are able to infer the emotional state of the speaker reasonably accurately even in the absence of visual presence of information [1]. This theory of cognitive emotion inference forms the basis for speech emotion recognition. Acoustic emotion recognition finds so many applications in modern world ranging from interactive entertainment systems, medical therapies and monitoring to various human safety devices.

We conducted some preliminary experiments to classify four human emotions anger, happy, sad and neutral (no emotion) in male and female speakers. We chose two simple acoustic features, pitch and intensity, for this analysis. The choice of features is based on readily available tools for their calculation. Pitch is a relative highness or lowness of a tone as perceived by the ear and intensity is the energy contained in speech as it is produced. Since these are one- dimensional features, they can be easily analyzed for any acoustic emotion recognition system. We designed decision-tree based algorithm in MATLAB to perform emotion classification. LDC Emotional Prosody Dataset samples are used for this experiment [2]. One sample of each emotion for one male and one female speaker are given below.

{audio missing}

We observed that male speaker does not have many variations in the pitch for all the emotions. The pitch is consistently similar for any given emotion. The median intensity over each emotion class, though changing, remains consistently similar to training data values. As a result, emotion recognition in male speaker has accuracy of 88% for acoustic test signals. Though pitch is almost similar, there is clear distinction in intensities for emotions happy and sad. This dissimilarity in intensity resulted in higher accuracy of emotion recognitions in male speaker data. For female speaker, the pitch ranges anywhere from 230 Hz to 435 Hz for three different emotions, namely, happy, sad and anger. Hence, the median intensity becomes the sole criterion for emotion recognition. The intensities for emotions, happy and angry are almost similar since both the emotions are high arousal emotions. This resulted in low accuracy of emotion recognition in female speaker of about 63%. The overall accuracy of emotion recognition using this method is 75%.

Emotion

Fig. 1. Emotion Recognition Accuracy Comparison

Our algorithm successfully recognized emotions in male speaker. Since the pitch is consistent across each emotion in male speaker, the selected features, pitch and intensity, resulted in better accuracy of emotion recognition. For female acoustic data, selected features are insufficient to describe the emotions and hence in future research of this work, other features which are independent of voice quality such as prosodic, formant or spectral features will be evaluated.

[1]Fonagy, I. Emotions, voice and music. Sundberg J (Ed.) Research aspects on singing. Royal Swedish Academy of Music and Payot, Stockholm and Paris; pp 51–79, 1981.
[2]Liberman, Mark, et al. Emotional Prosody Speech and Transcripts LDC2002S28. Web Download. Philadelphia: Linguistic Data Consortium, 2002.

3aPA8 – High Altitude Venus Operational Concept (HAVOC)

Adam Trahan – ajt6261@louisiana.edu
Andi Petculescu – andi@louisiana.edu

University of Louisiana at Lafayette
Physics Department
240 Hebrard Blvd., Broussard Hall
Lafayette, LA 70503-2067

Popular version of paper 3aPA8
Presented Wednesday morning, May 9, 2018
175th ASA Meeting, Minneapolis, MN

HAVOC

Artist’s rendition of the envisioned HAVOC mission. (Credit: NASA Systems Analysis and Concepts Directorate, sacd.larc.nasa.gov/smab/havoc)

The motivation for this research stems from NASA’s proposed High Altitude Venus Operational Concept (HAVOC), which, if successful, would lead to a possible month-long human presence above the cloud layer of Venus.

The atmosphere of Venus is composed of primarily carbon dioxide with small amounts of Nitrogen and other trace molecules in the parts-per-million. With surface temperatures exceeding that of Earth’s by about 2.5 times and pressures roughly 100 times, the Venusian surface is quite a hostile environment. Higher into the atmosphere, however, the environment becomes relatively benign, with temperatures and pressures similar to those at Earth’s surface. In the 40-70 km region, condensational sulfuric acid clouds prevail, which contribute to the so-called “runaway greenhouse” effect.

The main condensable species on Venus is a binary mixture of sulfuric acid dissolved in water. The existence of aqueous sulfuric acid droplets is restricted to a thin region in Venus’ atmosphere, namely40-70 km from the surface. Nothing more than a light haze can exist in liquid form above and below this main cloud layer due to evaporation below and above. Inside the cloud layer, there exist three further sublayers; the upper cloud layer is produced using energy from the sun, while the lower and middle cloud layers are produced via condensation. The goal of this research is to determine how the lower and middle condensational cloud layers, affect the propagation of a sound waves, as they travel through the atmosphere.

It is true that for most waves to travel there must be a medium present, except for the case of electromagnetic waves (light), which are able to travel through the vacuum of space. But for sound waves, a fluid (gas or liquid) is necessary to support the wave. The presence of tiny particles affects the propagation of acoustic waves via energy loss processes; these effects have been well studied in Earth’s atmosphere. Using theoretical and numerical techniques, we are able to predict how much an acoustic wave would be weakened (attenuated) for every kilometer traveled in Venus’ clouds.

(attenuation_v_freq.jpg)

Figure 2. The frequency dependence of the wave attenuation coefficient. The attenuation is stronger at high frequencies, with a large transition region between 1 and 100 Hz.

Figure 2 shows how the attenuation parameter changes with frequency. At higher frequencies (greater than 100 Hz), the attenuation is larger than at lower frequencies, due primarily to the motion of the liquid cloud droplets as they react to the passing acoustic wave. In the lower frequency region, the attenuation is lower and is due primarily to evaporation and condensation processes, which require energy from the acoustic wave.

For the present study, the cloud environment was treated like a perfect (ideal) gas, which assumes the gas molecules behave like billiard balls, simply bouncing off one another. This assumption is valid for low-frequency sound waves. To complete the model, real-gas effects are added, to obtain the background attenuation in the surrounding atmosphere. This will enable us to predict the net amount of losses an acoustic wave is likely to experience at the projected HAVOC altitudes.

The results of this study could prove valuable for guiding the development of acoustic sensors designed to investigate atmospheric properties on Venus.

This research was sponsored by a grant from the Louisiana Space Consortium (LaSPACE).

2pNS8 – Noise Dependent Coherence-Super Gaussian based Dual Microphone Speech Enhancement for Hearing Aid Application using Smartphone

Nikhil Shankar– nxs162330@utdallas.edu
Gautam Shreedhar Bhat – gxs160730@utdallas.edu
Chandan K A Reddy – cxk131330@utdallas.edu
Dr. Issa M S Panahi – imp015000@utdallas.edu
Statistical Signal Processing Laboratory (SSPRL)
The University of Texas at Dallas
800W Campbell Road,
Richardson, TX – 75080, USA

Popular Version of Paper 2pNS8, “Noise dependent coherence-super Gaussian based dual microphone speech enhancement for hearing aid application using smartphone” will be presented Tuesday afternoon, May 8, 2018, 3:25 – 3:40 PM, NICOLLET D3
175th ASA Meeting, Minneapolis

Records by National Institute on Deafness and Other Communication Disorders (NIDCD) indicate that nearly 15% of adults (37million) aged 18 and over report some kind of hearing loss in the United States. Amongst the entire world population, 360 million people suffer from hearing loss.

Over the past decade, researchers have developed many feasible solutions for hearing impaired in the form of Hearing Aid Devices (HADs) and Cochlear Implants (CI). However, the performance of the HADs degrade in the presence of different types of background noise and lacks the computational power, due to the design constraints and to handle obligatory signal processing algorithms. Lately, HADs manufacturers are using a pen or a necklace as an external microphone to capture speech and transmit the signal and data by wire or wirelessly to HADs. The expense of these existing auxiliary devices poses as a limitation. An alternative solution is the use of smartphone which can capture the noisy speech data using the two microphones, perform complex computations using the Speech Enhancement algorithm and transmit the enhanced speech to the HADs.

In this work, the coherence between speech and noise signals [1] is used to obtain a Speech Enhancement (SE) gain function, in combination with a Super Gaussian Joint Maximum a Posteriori (SGJMAP) [2,3] single microphone SE gain function. The weighted union of these two gain functions strikes a balance between noise suppression and speech distortion. The theory behind the coherence method is that the speech from the two microphones is correlated, while the noise is uncorrelated with speech. The block diagram of the proposed method is as shown in Figure 1.

Speech Enhancement

Fig. 1. Block diagram of proposed SE method.

For the objective measure of quality of speech, we use Perceptual Evaluation of Speech Quality (PESQ). Coherence Speech Intelligibility Index (CSII) is used to measure the intelligibility of speech. PESQ ranges between 0.5 and 4, with 4 being high speech quality. CSII ranges between 0 and 1, with 1 being high intelligibility. Figure 2 shows the plots of PESQ and CSII versus SNR for two noise types, and performance comparison of proposed SE method with the conventional Coherence and LogMMSE SE methods.

Fig.2. Objective measures of speech quality and intelligibility

Along with Objective measures, we perform Mean Opinion Score (MOS) tests on 20 normal hearing both male and female subjects. Subjective test results are shown in Figure 3, which illustrates the effectiveness of the proposed method in various background noise.

Fig. 3. Subjective test results

Please refer our lab website https://www.utdallas.edu/ssprl/hearing-aid-project/ for video demos and the sample audio files are as attached below.

Audios: Audio files go here:

Noisy

Enhanced

Key References:
[1] N. Yousefian and P. Loizou, “A Dual-Microphone Speech Enhancement algorithm based on the Coherence Function,” IEEE Trans. Audio, Speech, and Lang. Processing, vol. 20, no.2, pp. 599-609, Feb 2012.
[2] Lotter, P. Vary, “Speech Enhancement by MAP Spectral Amplitude Estimation using a super-gaussian speech model,” EURASIP Journal on Applied Sig. Process, pp. 1110-1126, 2005.
[3] C. Karadagur Ananda Reddy, N. Shankar, G. Shreedhar Bhat, R. Charan and I. Panahi, “An Individualized Super-Gaussian Single Microphone Speech Enhancement for Hearing Aid Users With Smartphone as an Assistive Device,” in IEEE Signal Processing Letters, vol. 24, no. 11, pp. 1601-1605, Nov. 2017.

*This work was supported by the National Institute of the Deafness and Other Communication Disorders (NIDCD) of the National Institutes of Health (NIH) under the grant number 5R01DC015430-02. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors are with the Statistical Signal Processing Research Laboratory (SSPRL), Department of Electrical and Computer Engineering, The University of Texas at Dallas.

[embeddoc url=”https://acoustics.org/wp-content/uploads/2018/05/Shankar-LLP-2.docx” viewer=”microsoft”]

1pPA – Assessment of Learning Algorithms to Model Perception of Sound

Menachem Rafaelof
National Institute of Aerospace (NIA)

Andrew Schroeder
NASA Langley Research Center (NIFS intern, summer 2017)

175th Meeting of the
Acoustical Society of America
Minneapolis Minnesota
7-11 May 2018
1pPA, Novel Methods in Computational Acoustics II

Sound and its Perception
Sound waves are basically fluctuations of air pressure at points in a space. While this simple physical description of sound captures what sound is, its perception is much more complicated involving physiological and psychological processes.

Physiological processes involve a number of functions during transmission of sound through the outer, middle and inner ear before transduction into neural signals. Examples of these processes include amplification due to resonance within the outer ear, substantial attenuation at low frequencies within the inner ear and frequency component separation within the inner ear. Central processing of sound is based on neural impulses (counts of electrical signals) transferred to the auditory center of the brain. This transformation occurs at different levels in the brain. A major component in this processing is the auditory cortex, where sound is consciously perceived as being, for example, loud, soft, pleasing, or annoying.

Motivation
Currently an effort is underway to develop and put to use “air taxis”, vehicles for on-demand passenger transport. A major concern with these plans is operation of air vehicles close to the public and the potential negative impact of their noise. This concern motivates the need for the development of an approach to predict human perception of sound. Such capability will enable the designers to compare different vehicle configurations and their sounds, and address design factors that are important to noise perception.

Approach
Supervised learning algorithms are a class of machine learning algorithms capable of learning from examples. During the learning stage samples of input and matching response data are used to construct a predictive model. This work compared the performance of four supervised learning algorithms (Linear Regression (LR), Support Vector Machines (SVM), Decision Trees (DTs) and Random Forests (RFs)) to predict human annoyance from sounds. Construction of predictive models included three stages: 1) sample sounds for training are analyzed in term of loudness (N), roughness (R) , sharpness (S) , tone prominence ratio (PR) and fluctuation strength (FS). These parameters quantify various subjective attributes of sound and serve as predictors within the model. 2) Each training sound is presented to a group of test subjects and their annoyance response (Y in Figure 1) to each sound is gathered. 3) A predictive model (H-hat) is constructed using a machine learning algorithm and is used to predict the annoyance of new sample sounds (Y-hat).

Figure 1: Construction of a model (H-hat) to predict the annoyance of sound. Path a: training sounds are presented to subjects and their annoyance rating (Y) is gathered. Subject rating of training samples and matching predictors are used to construct the model, H-hat. Path b: annoyance of a new sound is estimated using H-hat.

Findings
In this work the performance of four models, or learning algorithms, was examined. Construction of these models relied on the annoyance response of 38 subjects to 103 sounds from 10 different sound sources grouped in four categories: road vehicles, unmanned aerial vehicles for package delivery, distributed electric propulsion aircraft and a simulated quadcopter. Comparison of these algorithms in terms of prediction accuracy (see Figure 2), model interpretability, versatility and computation time points to Random Forests as the best algorithms for the task. These results are encouraging considering the precision demonstrated using a low-dimension model (five predictors only) and the variety of sounds used.

Future Work
• Account for variance in human response data and establish a target error tolerace.
• Explore the use one or two additional predictors (i.e., impulsiveness and audibility)
• Develop an inexpensive, standard, process to gather human response data
• Collect additional human response data
• Establish an annoyance scale for air taxi vehicles

Figure 2: Prediction accuracy for the algorithms examined. Accuracy here is expressed as the fraction of points predicted within error tolerance (in terms of Mean Absolute Error (MAE)) vs. error tolerance or absolute deviation. For each case, Area Over the Curve (AOC) represents the total MAE.

1pPP – Trends that are shaping the future of hearing aid technology

Brent Edwards – Brent.Edwards@nal.gov.au

Popular version of paper 1pPPa, “Trends that are shaping the future of hearing aid technology”
Presented Monday afternoon, May 7, 2018, 1:00PM, Nicollet D2 Room
175th ASA Meeting, Minneapolis

Hearing aid technology is experiencing a faster rate of change than it has in the history of its existence. A primary reason for this is its convergence with consumer electronics, resulting in an acceleration of the pace of innovation and a change in its nature from incremental to disruptive.

Hearable and wearable technology are non-medical devices that use sensors to measure and inform the user about their biometric data in addition to providing other sensory information. Since hearing aids are worn every day and the ear is an ideal location to place many of these sensors, hearing aids have the potential to become the ideal form factor for consumer wearables. Conversely, hearable devices that augment and enhance audio for normal hearing consumers while also measuring their biometric data have the potential to become a new form of hearing aids for those with hearing loss, combining medical functionality of hearing loss compensation with such consumer functionality as speech recognition with always-on access to Siri. The photo below shows one hearable on the market that allows the wearer to measure their hearing with a smartphone app and adjust the audibility of the hearing to personalise the sound for the individual’s hearing ability, a process that has similarities to the fitting of a traditional hearing aid by an audiologist.

Hearing aid technologyAccelerating this convergence between medical and consumer hearing technologies is the recently passed congressional bill that mandates the creation of a new over-the-counter hearing aid that consumers can purchase in a store and fit their own prescription. E-health technologies already exist that allow a consumer to measure their own hearing loss and apply clinically-validated prescriptions to their hearable devices. This technology development will explode once over-the-counter hearing aids are a reality.

Deep science is also impacting hearing aid innovation. The integration of cognitive function with hearing aid technology will continue to be one of the strongest trends in the field. Neural measures of the brain using EEG have the potential to be used to fit hearing devices and also to demonstrate hearing aid benefit by showing how wearing devices affects activity in the brain. Brain sensors have been proven able to determine which talker a person is listening to, a capability that could be included in future hearing aids to enhance the speech from the desired talker and suppress all other sounds. Finally, science continues to advance our understanding of how hearing aid technology can benefit cognitive function. These scientific and other medical developments such as light-driven hearing aids will advance hearing aid benefit through the more traditional medical channel, complementing the advances on the consumer side of the healthcare delivery spectrum.