Study of Acoustic Correlates Associated with Emotional Speech
Serdar Yildirim - yildirim@usc.edu
Sungbok Lee, Murtaza Bulut, Chul Min Lee, Abe Kazemzadeh, Carlos Busso, Shrikanth
Narayanan
University of Southern California
Los Angeles, CA, 90089
Popular version of paper 1aSC10
Presented Monday morning, November 15, 2004
148th ASA Meeting, San Diego, CA
Human speech carries information about both the linguistic content as well
as the emotional/attitudinal state of the speaker. This study investigates the
acoustic characteristics of four different emotions expressed in speech. The
goal is to obtain detailed acoustic knowledge on how the speech signal is modulated
by changes from an emotionally neutral state to a specific emotionally aroused
state. Such knowledge is necessary for the automatic assessment of emotional
content and strength as well as for emotional speech synthesis which should
help develop a more efficient and user-friendly man-machine communication system.
For instance consider an automated call center application, where depending
on the detected emotional state of the user during the interaction -- such as
displeasure or anger due to errors in understanding user's requests -- the system
could transfer the troubled user to a human operator before premature man-machine
dialogue disruption. Similarly, development of speech synthesis systems capable
of emotional speech will enable the computer to interact with the user more
naturally such as by adopting an appropriate tone of the voice suitable to a
given situation.
In this study, emotional speech data obtained from two semi-professional actresses
are analyzed and compared. Each subject produced 211 sentences with four different
emotions: neutral, sad, angry, happy. We analyze changes in speech acoustic
parameters such as magnitude and variability of segmental (i.e., phonemic)
duration, fundamental frequency and first three formant frequencies as a function
of the emotion type. Segmental duration here means duration of spoken phoneme.
RMS energy is correlated with loudness of speech and the fundamental frequency
and formant frequencies are related to speaker's individual voice characteristics.
The changes of these acoustic or speech parameters over time are known to be
correlated with not only what is said but also how it is said. Therefore, change
in emotion is expected to be reflected in changes in such acoustic parameters,
when compared to those of neutral speech. Acoustic differences among the emotions
are also explored through mutual information computation, multidimensional scaling
and acoustic likelihood comparison with normal, neutral speech. Those are mathematical
methods which are used to quantify or visualize similarity among objects.
Current results indicate that speech associated with anger and happiness is
characterized by longer segmental durations, shorter inter-word silence, higher
pitch and energy with wider dynamic range. Sadness is distinguished from other
emotions by lower energy and longer inter-word silence. Interestingly, the difference
in formant pattern between [happiness/anger] and [neutral/sadness] are better
reflected in back vowels such as /a/(/father/) than in front vowels. Some detailed
results on segmental duration, fundamental frequency and formant frequencies,
and energy are given below.
Duration
Statistical data analysis (using analysis of variance, ANOVA) showed
that effect of emotions on duration parameters such as utterance
durations, inter-word silence/speech ratio, speaking rate, and average
vowel durations are significant. Moreover, our results showed that angry
and happy speech have longer average utterance and vowel durations
compare to that of neutral and sad. In terms of inter-word
silence/speech ratio, sad speech contains more pauses between words
compare to that of other emotions. Our analysis also indicated that sad,
angry, and happy have greater variability in speaking rate than that of
neutral speech.
Fundamental Frequency and Formant Frequencies
Our statistical analysis indicated that the effect of emotion on fundamental frequency
(F0) is significant (p < 0.001). The mean (standard deviation) of
F0 for neutral are 188 (49) Hz, for sad 195 (66) Hz, for angry 233
(84) Hz, and for happy 237 (83) Hz. Earlier studies report that
the mean F0 is lower in sad speech compared to that of neutral
speech [Murray93]. This tendency is not observed for this
particular subject. However, it is confirmed that angry and happy
speech have higher F0 values and greater variations compared to
that of neutral speech. As we can observe from Figure 1, mean vowel F0
values for neutral speech are less than that of other emotion
categories. It is also observed that anger/happy and sad/neutral
show similar F0 values on average, suggesting that F0 modulation
between the two within-group emotions.
Energy
Our analysis based on RMS energy showed that sad speech has less
median value and lower spread in energy than that of other emotions.
Angry and happy speech have higher median values and greater spread in energy.
Also, ANOVA indicates that effect of emotion is significant (p < 0.001).
According to our statistical analysis, RMS Energy is the best single
parameter to separate emotion classes.
emotion | Neutral | Sad | Angry | Happy |
Mutual Info (bits) | 0.4810 | 0.5202 | 0.8189 | 0.7988 |