2pSC14 – Improving the Accuracy of Automatic Detection of Emotions From Speech
Reza Asadi and Harriet Fell
Popular version of poster 2pSC14 “Improving the accuracy of speech emotion recognition using acoustic landmarks and Teager energy operator features.”
Presented Tuesday afternoon, May 19, 2015, 1:00 pm – 5:00 pm, Ballroom 2
169th ASA Meeting, Pittsburgh
“You know, I can feel the fear that you carry around and I wish there was… something I could do to help you let go of it because if you could, I don’t think you’d feel so alone anymore.”
— Samantha, a computer operating system in the movie “Her”
Computers that can recognize human emotions could react appropriately to a user’s needs and provide more human like interactions. Emotion recognition can also be used as a diagnostic tool for medical purposes, onboard car driving systems to keep the driver alert if stress is detected, a similar system in aircraft cockpits, and also electronic tutoring and interaction with virtual agents or robots. But is it really possible for computers to detect the emotions of their users?
During the past fifteen years, computer and speech scientists have worked on the automatic detection of emotion in speech. In order to interpret emotions from speech the machine will gather acoustic information in the form of sound signals, then extract related information from the signals and find patterns which relate acoustic information to the emotional state of speaker. In this study new combinations of acoustic feature sets were used to improve the performance of emotion recognition from speech. Also a comparison of feature sets for detecting different emotions is provided.
Three sets of acoustic features were selected for this study: Mel-Frequency Cepstral Coefficients, Teager Energy Operator features and Landmark features.
Mel-Frequency Cepstral Coefficients:
In order to produce vocal sounds, vocal cords vibrate and produce periodic pulses which result in glottal wave. The vocal tract starting from the vocal cords and ending in the mouth and nose acts as a filter on the glottal wave. The Cepstrum is a signal analysis tool which is useful in separating source from filter in acoustic waves. Since the vocal tract acts as a filter on a glottal wave we can use the cepstrum to extract information only related to the vocal tract.
The mel scale is a perceptual scale for pitches as judged by listeners to be equal in distance from one another. Using mel frequencies in cepstral analysis approximates the human auditory system’s response more closely than using the linearly-spaced frequency bands. If we map frequency powers of energy in original speech wave spectrum to mel scale and then perform cepstral analysis we get Mel-Frequency Cepstral Coefficients (MFCC). Previous studies use MFCC for speaker and speech recognition. It has also been used to detect emotions.
Teager Energy Operator features:
Another approach to modeling speech production is to focus on the pattern of airflow in the vocal tract. While speaking in emotional states of panic or anger, physiological changes like muscle tension alter the airflow pattern and can be used to detect stress in speech. It is difficult to mathematically model the airflow, therefore Teager proposed the Teager Energy Operators (TEO), which computes the energy of vortex-flow interaction at each instance of time. Previous studies show that TEO related features contain information which can be used to determine stress in speech.
Acoustic landmarks are locations in the speech signal where important and easily perceptible speech properties are rapidly changing. Previous studies show that the number of landmarks in each syllable might reflect underlying cognitive, mental, emotional, and developmental states of the speaker.
|Figure 1 – Spectrogram (top) and acoustic landmarks (bottom) detected in neutral speech sample||
Sound File 1 – A speech sample with neutral emotion
Figure 2 – Spectrogram (top) and acoustic landmarks (bottom) detected in anger speech sample
Sound File 2 – A speech sample with anger emotion
The data used in this study came from the Linguistic Data Consortium’s Emotional Prosody and Speech Transcripts. In this database four actresses and three actors, all in their mid-20s, read a series of semantically neutral utterances (four-syllable dates and numbers) in fourteen emotional states. A description for each emotional state was handed over to the participants to be articulated in the proper emotional context. Acoustic features described previously were extracted from the speech samples in this database. These features were used for training and testing Support Vector Machine classifiers with the goal of detecting emotions from speech. The target emotions included anger, fear, disgust, sadness, joy, and neutral.
The results of this study show an average detection accuracy of approximately 91% among these six emotions. This is 9% better than a previous study conducted at CMU on the same data set.
Specifically TEO features resulted in improvements in detecting anger and fear and landmark features improved the results for detecting sadness and joy. The classifier had the highest accuracy, 92%, in detecting anger and the lowest, 87%, in detecting joy.