Popular version of poster 2pSC14 “Improving the accuracy of speech emotion recognition using acoustic landmarks and Teager energy operator features.”
Presented Tuesday afternoon, May 19, 2015, 1:00 pm – 5:00 pm, Ballroom 2
169th ASA Meeting, Pittsburgh
“You know, I can feel the fear that you carry around and I wish there was… something I could do to help you let go of it because if you could, I don’t think you’d feel so alone anymore.”
— Samantha, a computer operating system in the movie “Her”
Computers that can recognize human emotions could react appropriately to a user’s needs and provide more human like interactions. Emotion recognition can also be used as a diagnostic tool for medical purposes, onboard car driving systems to keep the driver alert if stress is detected, a similar system in aircraft cockpits, and also electronic tutoring and interaction with virtual agents or robots. But is it really possible for computers to detect the emotions of their users?
During the past fifteen years, computer and speech scientists have worked on the automatic detection of emotion in speech. In order to interpret emotions from speech the machine will gather acoustic information in the form of sound signals, then extract related information from the signals and find patterns which relate acoustic information to the emotional state of speaker. In this study new combinations of acoustic feature sets were used to improve the performance of emotion recognition from speech. Also a comparison of feature sets for detecting different emotions is provided.
Three sets of acoustic features were selected for this study: Mel-Frequency Cepstral Coefficients, Teager Energy Operator features and Landmark features.
Mel-Frequency Cepstral Coefficients:
In order to produce vocal sounds, vocal cords vibrate and produce periodic pulses which result in glottal wave. The vocal tract starting from the vocal cords and ending in the mouth and nose acts as a filter on the glottal wave. The Cepstrum is a signal analysis tool which is useful in separating source from filter in acoustic waves. Since the vocal tract acts as a filter on a glottal wave we can use the cepstrum to extract information only related to the vocal tract.
The mel scale is a perceptual scale for pitches as judged by listeners to be equal in distance from one another. Using mel frequencies in cepstral analysis approximates the human auditory system’s response more closely than using the linearly-spaced frequency bands. If we map frequency powers of energy in original speech wave spectrum to mel scale and then perform cepstral analysis we get Mel-Frequency Cepstral Coefficients (MFCC). Previous studies use MFCC for speaker and speech recognition. It has also been used to detect emotions.
Teager Energy Operator features:
Another approach to modeling speech production is to focus on the pattern of airflow in the vocal tract. While speaking in emotional states of panic or anger, physiological changes like muscle tension alter the airflow pattern and can be used to detect stress in speech. It is difficult to mathematically model the airflow, therefore Teager proposed the Teager Energy Operators (TEO), which computes the energy of vortex-flow interaction at each instance of time. Previous studies show that TEO related features contain information which can be used to determine stress in speech.
Acoustic landmarks are locations in the speech signal where important and easily perceptible speech properties are rapidly changing. Previous studies show that the number of landmarks in each syllable might reflect underlying cognitive, mental, emotional, and developmental states of the speaker.
Sound File 1 – A speech sample with neutral emotion
Sound File 2 – A speech sample with anger emotion
Figure 1 – Spectrogram (top) and acoustic landmarks (bottom) detected in neutral speech sample
Figure 2 – Spectrogram (top) and acoustic landmarks (bottom) detected in anger speech sample
The data used in this study came from the Linguistic Data Consortium’s Emotional Prosody and Speech Transcripts. In this database four actresses and three actors, all in their mid-20s, read a series of semantically neutral utterances (four-syllable dates and numbers) in fourteen emotional states. A description for each emotional state was handed over to the participants to be articulated in the proper emotional context. Acoustic features described previously were extracted from the speech samples in this database. These features were used for training and testing Support Vector Machine classifiers with the goal of detecting emotions from speech. The target emotions included anger, fear, disgust, sadness, joy, and neutral.
The results of this study show an average detection accuracy of approximately 91% among these six emotions. This is 9% better than a previous study conducted at CMU on the same data set.
Specifically TEO features resulted in improvements in detecting anger and fear and landmark features improved the results for detecting sadness and joy. The classifier had the highest accuracy, 92%, in detecting anger and the lowest, 87%, in detecting joy.
Hollow vs. Foam-filled racket: Feel-good vibrations
Kritika Vayur – firstname.lastname@example.org
Dr. Daniel A. Russell – email@example.com
Pennsylvania State University
201 Applied Science Building
State College, PA, 16802
Popular version of paper 3aSA11, “Vibrational analysis of hollow and foam-filled graphite tennis rackets”
Presented Wednesday morning, May 20, 2015, 11:15 AM in room Kings 3
169th ASA Meeting, Pittsburgh
Tennis Rackets and Injuries
The typical modern tennis racket has a light-weight, hollow graphite frame with a large head. Though these rackets are easier to swing, there seems to be an increase in the number of players experiencing injuries commonly known as “tennis elbow”. Recently, even notable professional players such as Rafael Nadal, Victoria Azarenka, and Novak Djokovic have withdrawn from tournaments because of wrist, elbow or shoulder injuries.
A recent new solid foam-filled graphite racket design claims to reduce the risk of injury. Previous testing has suggested that these foam-filled rackets are less stiff and damp the vibrations more than hollow rackets, thus reducing the risk of injury and shock delivered to the arm of the player . Figure 1 shows cross-sections of the handles of hollow and foam-filled versions of the same model racket.
The preliminary study reported in this paper was an attempt to identify the vibrational characteristics that might explain why foam-filled rackets improve feel and reduce risk of injury.
Figure 1: Cross-section of the handle of a foam-filled racket (left) and a hollow racket (right).
The first vibrational characteristic we set out to identify was the damping associated with first few bending and torsional vibrations of the racket frame. A higher damping rate means the unwanted vibration dies away faster and results in a less painful vibration delivered to the hand, wrist, and arm. Previous research on handheld sports equipment (baseball and softball bats and field hockey sticks) has demonstrated that bats and sticks with higher damping feel better and minimize painful sting [2,3,4].
We measured the damping rates of 20 different tennis rackets, by suspending the racket from the handle with rubber bands, striking the racket frame in the head region, and measuring the resulting vibration at the handle using an accelerometer. Damping rates were obtained from the frequency response of the racket using a frequency analyzer. We note that suspending the racket from rubber bands is a free boundary condition, but other research has shown that this free boundary condition more closely reproduces the vibrational behavior of a hand-held racket than does a clamped-handle condition [5,6].
Measured damping rates for the first bending mode, shown in Fig. 2, indicate no difference between the damping and decay rates for hollow and foam-filled graphite rackets. Similar results were obtained for other bending and torsional modes. This result suggests that the benefit of or preference for foam-filled rackets is not due to a higher damping that could cause unwanted vibrations to decay more quickly.
Figure 2: Damping rates of the first bending mode for 20 rackets, hollow (open circles) and foam-filled (solid squares). A higher damping rate means the vibration will have a lower amplitude and will decay more quickly.
Vibrational Mode Shapes and Frequencies
Experimental modal analysis is a common method to determine how the racket vibrates with various mode shapes at its resonance frequencies . In this experiment, two rackets were tested, a hollow and a foam-filled racket of the same make and model. Both rackets were freely suspended by rubber bands, as shown in Fig. 3. An accelerometer, fixed at one location, measured the vibrational response to a force hammer impact at each of approximately 180 locations around the frame and strings of the racket. The resulting Frequency Response Functions for each impact location were post-processed with a modal analysis software to extract vibrational mode shapes and resonance frequencies. An example of the vibrational mode shapes for hollow graphite tennis racket may be found on Dr. Russell’s website.
Figure 3: Modal analysis set up for a freely suspended racket.
Figure 4 compares the first and third bending modes and the first torsional mode for a hollow and foam-filled racket. The only difference between the two rackets is that one was hollow and the other was foam-filled. In the figure, the pink and green regions represent motion in opposite directions, and the white regions indicate regions, called nodes, where no vibration occurs. The sweet spot of a tennis racket is often identified as being at the center of the nodal line of the first bending mode shape in the head region . An impact from an incoming ball at this location results in zero vibration at the handle, and therefore a better “feel” for the player. The data in Fig. 4 shows that there are very few differences between the mode shapes of the hollow and foam-filled rackets. The frequencies at which the mode shapes for the foam-filled rackets occur are slightly higher than those of the hollow rackets, but the difference in shapes are negligible between the two types.
Figure 4: Contour maps representing the out-of-plane vibration amplitude for the first bending (left), first torsional (middle), and third bending (right) modes for a hollow (top) and a foam-filled racket (bottom) of the same make and model.
This preliminary study shows that damping rates for this particular design of foam-filled rackets are not higher than those of hollow rackets. The modal analysis gives a closer, yet non-conclusive, look at the intrinsic properties of the hollow and foam-filled rackets. The benefit of using this racket design is perhaps related to a larger impact shock, but additional testing is needed to discover this conjecture.
Tags: tennis, vibrations, graphite, design
 Ferrara, L., & Cohen, A. (2013). A mechanical study on tennis racquets to investigate design factors that contribute to reduced stress and improved vibrational dampening. Procedia Engineering, 60, 397-402.
 Russell D.A. (2012). Vibration damping mechanisms for the reduction of sting in baseball bats. In 164th meeting of the Acoustical Society of America, Kansas City, MO, Oct 22-26. Journal of Acoustical Society of America, 132(3) Pt.2, 1893.
 Russell, D.A. (2012). Flexural vibration and the perception of sting in hand-held sports implements. In Proceedings of InterNoise 2012, August 19-22, New York City, NY.
 Russell, D.A. (2006). Bending modes, damping, and the sensation of string in
baseball bats. In Proceedings 6th IOMAC Conference, 1, 11-16.
 Banwell, G.H., Roberts, J.R., & Halkon, B.J. (2014). Understanding the dynamics behavior of a tennis racket under play conditions. Experimental Mechanics, 54, 527-537.
 Kotze, J., Mitchell, S.R., & Rothberg, S.J. (2000).The role of the racket in high-speed tennis serves. Sports Engineering, 3, 67-84.
 Schwarz, B.J., & Richardson, M.H. (1999). Experimental modal analysis. CSI Reliability Week, 35(1), 1-12.
 Cross, R. (2004). Center of percussion of hand-held implements. American Journal of Physics, 72, 622-630.
Chucri A. Kardous – firstname.lastname@example.org
Peter B. Shaw – email@example.com
National Institute for Occupational Safety and Health
Centers for Disease Control and Prevention
1090 Tusculum Avenue
Cincinnati, Ohio 45226
Popular version of paper 2pNSb, “Use of smartphone sound measurement apps for occupational noise assessments”
Presented Tuesday May 19, 2015, 3:55 PM, Ballroom 1
169th ASA Meeting, Pittsburgh, PA
Our world is getting louder. Excessive noise is a public health problem and can cause a range of health issues; noise exposure can induce hearing impairment, cardiovascular disease, hypertension, sleep disturbance, and a host of other psychological and social behavior problems. The World Health Organization (WHO) estimates that there are 360 million people with disabling hearing loss. Occupational hearing loss is the most common work-related illness in the United States; the National Institute for Occupational Safety and Health (NIOSH) estimates that approximately 22 million U.S. workers are exposed to hazardous noise.
Smartphones users are expected to hit the 2 billion mark in 2015. The ubiquity of smartphones and the sophistication of current sound measurement applications (apps) present a great opportunity to revolutionize the way we look at noise and its effects on our hearing and overall health. Through the use of crowdsourcing techniques, people around the world may be able to collect and share noise exposure data using their smartphones. Scientists and public health professionals could rely on such shared data to promote better hearing health and prevention efforts. In addition, the ability to acquire and display real-time noise exposure data raises people’s awareness about their work (and off-work) environment and allows them to make informed decisions about hazards to their hearing and overall well-being. For instance, the European Environment Agency (EEA) developed the Noise Watch app that allows citizens around the world to make noise measurements whether at their work or during their leisure activities, and upload that data to a database in real time and using the smartphone GPS capabilities to construct a map of the noisiest places and sources in their environment.
However, not all smartphone sound measurements apps are equal. Some are basic and not very accurate while some are much more sophisticated. NIOSH researchers conducted a study of 192 smartphone sound measurement apps to examine the accuracy and functionality of such apps. We conducted the study in our acoustics laboratory and compared the results to a professional sound level meter. Only 10 apps met our selection criteria, and of those only 4 met our accuracy requirements of being within ±2 decibels (dB) of type 1 professional sound level meter. Apps developed for the iOS platform were more advanced, functionality and performance wise, than Android apps. You can read more about our original study on our NIOSH Science Blog at: http://blogs.cdc.gov/niosh-science-blog/2014/04/09/sound-apps/ or download our JASA paper at: http://scitation.aip.org/content/asa/journal/jasa/135/4/10.1121/1.4865269.
Figure 1. Testing the SoundMeter app on the iPhone 5 and iPhone 4S against a ½” Larson-Davis 2559 random incidence reference microphone
Today, we will present on our additional efforts to examine the accuracy of smartphone sound measurement apps using external microphones that can be calibrated. There are several external microphones available mostly for the consumer market, and although they vary greatly in price, they all possess similar acoustical specifications and have performed similarly in our laboratory tests. Preliminary results showed even greater agreement with professional sound measurement instruments (± 1 dB) over our testing range.
Figure 2. Calibrating the SPLnFFT app with MicW i436 external microphone using the Larson-Davis CAL250 acoustic calibrator (114 dB SPL @ 250Hz)
Figure 3. Laboratory testing of 4 iOS devices using MicW i436 and comparing the measurements to a Larson-Davis type 831 sound level meter (pink noise at 75 dBA)
We will also discuss our plans to develop and distribute a free NIOSH Sound Level Meter app in an effort to facilitate future occupational research efforts and build an noise job exposure database.
Challenges remain with using smartphones to collect and document noise exposure data. Some of the main issues encountered in recent studies relate to privacy and collection of personal data, sustained motivation to participate in such studies, bad or corrupted data, and mechanisms for storing and accessing such data.
Speech: An eye and ear affair!
Pamela Trudeau-Fisette – firstname.lastname@example.org
Lucie Ménard – email@example.com
Université du Quebec à Montréal
320 Ste-Catherine E.
Montréal, H3C 3P8
Popular version of poster session 2aSC, “Auditory feedback perturbation of vowel production: A comparative study of congenitally blind speakers and sighted speakers”
Presented Tuesday morning, May 19, 2015, Ballroom 2, 8:00 AM – 12:00 noon
169th ASA Meeting, Pittsburgh
When learning to speak, young infants and toddlers use auditory and visual cues to correctly associate speech movements to a specific speech sound. In doing so, typically developing children compare their own speech and those of their ambient language to build and improve the relationship between what they hear, see and feel, and how to produce it.
In many day-to-day situations, we exploit the multimodal nature of speech: in noisy environments, for instance like in a cocktail party, we look at our interlocutor’s face and use lip reading to recover speech sounds. When speaking clearly, we open our mouth wider to make ourself sound more intelligible. Sometimes, just seeing someone’s face is enough to communicate!
What happens in cases of congenital blindness? Despite the fact that blind speakers learn to produce intelligible speech, they do not quite speak like sighted speakers do. Since they do not perceive others’ visual cues, blind speakers do not produce visible labial movements as much as their sighted peers do.
Production of the French vowel “ou” (similar as in cool) produced by a sighted adult speaker (on the left) and a congenitally blind adult speaker (on the right). We can clearly see that the articulatory movements of the lips are more explicit for the sighted speaker.
Therefore, blind speakers put more weight on what they hear (auditory feedback) than sighted speakers, because one sensory input is lacking. How does that affect the way blind individuals speak?
To answer this question, we conducted an experiment during which we asked congenitally blind adult speakers and sighted adult speakers to produce multiple repetitions of the French vowel “eu”. While they were producing the 130 utterances, we gradually altered their auditory feedback through headphones – without them knowing it- so that they were not hearing the exact sound they were saying. Consequently, they needed to modify the way they produced the vowel in order to compensate for the acoustic manipulation, so they could hear the vowel they were asked to produce (and the one they thought they were saying all along!).
What we were interested in is whether blind speakers and sighted speakers would react differently to this auditory manipulation. The blind speakers not being able to rely on visual feedback, we hypothesized that they would grant more importance on their auditory feedback and, therefore, compensate to a greater extent for the acoustic manipulation.
To explore this matter, we observed the acoustic (produced sounds) and articulatory (lips and tongue movements) differences between the two groups at three distinct time points of the experiment phases.
As predicted, congenitally blind speakers compensated for the altered auditory feedback in a greater extent than their sighted peers. More specifically, even though both speaker groups adapted their productions, the blind group compensated more than the control group did, as if they were integrating the auditory information more strongly. Also, we found that both speaker groups used different articulatory strategies to respond to the applied manipulation: blind participants used their tongue (which is not visible when you speak) more to compensate. This latter observation is not surprising considering the fact that blind speakers do not use their lips (which is visible when you speak) as much as their sighted peers do.
Acoustic multi-pole source inversions of volcano infrasound
Keehoon Kim – firstname.lastname@example.org
University of Alaska Fairbanks
Wilson Infrasound Observatory, Alaska Volcano Observatory, Geophysical Institute
903 Koyukuk Drive, Fairbanks, Alaska 99775
David Fee – email@example.com
University of Alaska Fairbanks
Wilson Infrasound Observatory, Alaska Volcano Observatory, Geophysical Institute
903 Koyukuk Drive, Fairbanks, Alaska 99775
Akihiko Yokoo – firstname.lastname@example.org
Institute for Geothermal Sciences
Jonathan M. Lees – email@example.com
University of North Carolina Chapel Hill
Department of Geological Sciences
104 South Road, Chapel Hill, North Carolina 27599
Mario Ruiz – firstname.lastname@example.org
Escuela Politecnica Nacional
Popular version of paper 4aPA4, “Acoustic multipole source inversions of volcano infrasound”
Presented Thursday morning, May 21, 2015, at 9:30 AM in room Kings 1
169th ASA Meeting, Pittsburgh
Volcanoes are outstanding natural sources of infrasound (low-frequency acoustic waves below 20 Hz). In the last few decades local infrasound networks have become an essential part of geophysical monitoring systems for volcanic activity. Unlike seismic networks dedicated to monitoring subsurface activity (c.f., magma or fluid transportation) infrasound monitoring facilitates detecting and characterizing eruption activity at the earth’s surface. Figure 1a shows Sakurajima Volcano in southern Japan and an infrasound network deployed in July 2013. Figure 1b is an image of a typical explosive eruption during the field experiment, which produces loud infrasound.
Figure 1. a) A satellite image of Sakurajima Volcano, adapted from Kim and Lees (2014). Five stand-alone infrasound sensors were deployed around Showa Crater in July 2013, indicated by inverted triangles. b) An image of a typical explosive eruption observed during the field campaign.
Source of volcano infrasound
One of the major sources of volcano infrasound is a volume change in the atmosphere. Mass discharge from volcanic eruptions displaces the atmosphere near and around the vent and this displacement propagates into the atmosphere as acoustic waves. Infrasound signals can, therefore, represent a time history of the atmospheric volume change during eruptions. Volume flux inferred from infrasound data can be further converted into mass eruption rate with the density of the erupting mixture. Mass eruption rate is a critical parameter for forecasting ash-cloud dispersal during eruptions and consequently important for aviation safety. One of the problems associated with the volume flux estimation is that observed infrasound signals can be affected by propagation path effects between the source and receivers. Hence, these path effects must be appropriately accounted for and removed from the signals in order to obtain the accurate source parameter.
Infrasound propagation modeling
Figure 2 shows the results of numerical modeling of sound propagation from the vent of Sakurajima Volcano. The sound propagation is simulated by solving the acoustic wave equation using a Finite-Difference Time-Domain method taking into account volcanic topography. The synthetic wavefield is excited by a Gaussian-like source time function (with 1 Hz corner frequency) inserted at the center of Showa Crater (Figure 2a). Homogeneous atmosphere is assumed since atmospheric heterogeneity should have limited influence in this local range (< 7 km). The numerical modeling demonstrates that both amplitude and waveform of infrasound are significantly affected by the local topography. In Figure 2a, Sound Pressure Level (SPL) relative to the source amplitude is calculated at each computational grid node on the ground surface. The SPL map indicates an asymmetric radiation pattern of acoustic energy. Propagation paths to the northwest of Showa Crater are obstructed by the summit of the volcano (Minamidake), and as a result acoustic shadow zones are created northwest of the summit. Infrasound waveform also shows significant variation across the network. In Figure 2b, synthetic infrasound signals computed at the station positions (ARI - SVO) show bipolar pulses followed by oscillations in pressure while the pressure time history at the source location exhibits only a positive unipolar pulse. This result indicates that the oscillatory infrasound waveforms can be produced by not only source effects but also propagation path effects. Hence, this waveform distortion must be considered for source parameter inversion.
Figure 2. a) Sound pressure level in dB relative to the peak pressure at the source position. b) Variation of infrasound waveforms across the network caused by propagation path effects.
Volume flux estimates
Because wavelengths of volcano infrasound are usually longer than the dimension of source region, the acoustic sources are typically treated as a monopole, which is a point source approximation of volume expansion or contraction. Then, infrasound data represent the convolution of volume flux history at the source and the response of the propagation medium, called Green’s function. Volume flux history can be obtained by deconvolving the Green’s functions from the data. The Green’s functions can be obtained by two different ways: 3-D numerical modeling considering local topography (Case 1) and the analytic solution in a half-space neglecting volcanic topography (Case 2). Resultant volume histories for a selected infrasound event are compared in Figure 3. Case 1 results in gradually decreasing volume flux curve, but Case 2 shows pronounced oscillation in volume flux. In Case 2, propagation path effects are not appropriately removed from the data leading to misinterpretation of the source effect.
Proper Green’s function is critical for accurate volume flux history estimation. We obtained a reasonable volume flux history using the 3-D numerical Green’s function. In this study only simple source model (monopole) was considered for volcanic explosions. More general representation can be obtained by multipole expansion of acoustic sources. In 169th ASA Meeting presentation, we will further discuss source complexity of volcano infrasound, which requires the higher-order terms of the multipole series.
Figure 3. Volume flux history inferred from infrasound data. In Case 1, the Green’s function is computed by 3-D numerical modeling considering volcanic topography. In Case 2, the analytic solution of the wave equation in a half-space is used, neglecting the topography.
Kim, K. and J. M. Lees (2014). Local Volcano Infrasound and Source Localization Investigated by 3D Simulation. Seismological Research Letters, 85, 1177-1186