Tejal Udhan – tu13b@my.fsu.edu
Shonda Bernadin – bernadin@eng.famu.fsu.edu
FAMU-FSU College of Engineering,
Department of Electrical and Computer Engineering
2525 Pottsdamer Street Tallahassee
Florida 32310

Popular version of paper 1pPPB: ‘Speaker-dependent low-level acoustic feature extraction for emotion recognition’
Presented Monday afternoon May 7, 2018
175th ASA Meeting, Minneapolis

EmotionSpeech is a most common and fastest means of communication between humans. This fact compelled researchers to study acoustic signals as a fast and efficient means of interaction between humans and machines. For authentic human-machine interaction, the method requires that the machines should have the sufficient intelligence to recognize human voices and their emotional state. Speech emotion recognition, extracting the emotional state of speakers from acoustic data, plays an important role in enabling machines to be ‘intelligent’. Audio and speech processing provides better, noninvasive and easy to acquire solutions than other biomedical signals such as electrocardiograms (ECG), and electroencephalograms (EEG).

Speech is an informative source for the perception of emotions. For example, talking in a loud voice when feeling very happy, speaking in an uncharacteristically high pitched voice when greeting a desirable person, or the presence of vocal tremor when something fearful or sad have been experienced. This cognitive recognition of emotions in turn indicates that listeners are able to infer the emotional state of the speaker reasonably accurately even in the absence of visual presence of information [1]. This theory of cognitive emotion inference forms the basis for speech emotion recognition. Acoustic emotion recognition finds so many applications in modern world ranging from interactive entertainment systems, medical therapies and monitoring to various human safety devices.

We conducted some preliminary experiments to classify four human emotions anger, happy, sad and neutral (no emotion) in male and female speakers. We chose two simple acoustic features, pitch and intensity, for this analysis. The choice of features is based on readily available tools for their calculation. Pitch is a relative highness or lowness of a tone as perceived by the ear and intensity is the energy contained in speech as it is produced. Since these are one- dimensional features, they can be easily analyzed for any acoustic emotion recognition system. We designed decision-tree based algorithm in MATLAB to perform emotion classification. LDC Emotional Prosody Dataset samples are used for this experiment [2]. One sample of each emotion for one male and one female speaker are given below.

{audio missing}

We observed that male speaker does not have many variations in the pitch for all the emotions. The pitch is consistently similar for any given emotion. The median intensity over each emotion class, though changing, remains consistently similar to training data values. As a result, emotion recognition in male speaker has accuracy of 88% for acoustic test signals. Though pitch is almost similar, there is clear distinction in intensities for emotions happy and sad. This dissimilarity in intensity resulted in higher accuracy of emotion recognitions in male speaker data. For female speaker, the pitch ranges anywhere from 230 Hz to 435 Hz for three different emotions, namely, happy, sad and anger. Hence, the median intensity becomes the sole criterion for emotion recognition. The intensities for emotions, happy and angry are almost similar since both the emotions are high arousal emotions. This resulted in low accuracy of emotion recognition in female speaker of about 63%. The overall accuracy of emotion recognition using this method is 75%.


Fig. 1. Emotion Recognition Accuracy Comparison

Our algorithm successfully recognized emotions in male speaker. Since the pitch is consistent across each emotion in male speaker, the selected features, pitch and intensity, resulted in better accuracy of emotion recognition. For female acoustic data, selected features are insufficient to describe the emotions and hence in future research of this work, other features which are independent of voice quality such as prosodic, formant or spectral features will be evaluated.

[1]Fonagy, I. Emotions, voice and music. Sundberg J (Ed.) Research aspects on singing. Royal Swedish Academy of Music and Payot, Stockholm and Paris; pp 51–79, 1981.
[2]Liberman, Mark, et al. Emotional Prosody Speech and Transcripts LDC2002S28. Web Download. Philadelphia: Linguistic Data Consortium, 2002.

Share This
%d bloggers like this: