Speech | Acoustics.org

1pSC2 – Deciding to go (or not to go) to the party may depend as much on your memory as on your hearing

Kathy Pichora-Fuller – k.pichora.fuller@utoronto.ca
Department of Psychology, University of Toronto,
3359 Mississauga Road,
Mississauga, Ontario, CANADA L5L 1C6

Sherri Smith – Sherri.Smith@va.gov
Audiologic Rehabilitation Laboratory, Veterans Affairs Medical Center,
Mountain Home, Tennessee, UNITED STATES 37684

Popular version of paper 1pSC2 Effects of age, hearing loss and linguistic complexity on listening effort as measured by working memory span
Presented Monday afternoon, May 18, 2015 (Session: Listening Effort II)
169th ASA Meeting, Pittsburgh

Understanding conversation in noisy everyday situations can be a challenge for listeners, especially individuals who are older and/or hard-of-hearing. Listening in some everyday situations (e.g., at dinner parties) can be so challenging that people might even decide that they would rather stay home than go out. Eventually, avoiding these situations can damage relationships with family and friends and reduce enjoyment of and participation in activities. What are the reasons for these difficulties and why are some people affected more than other people?

How easy or challenging it is to listen may vary from person to person because some people have better hearing abilities and/or cognitive abilities compared to other people. The hearing abilities of some people may be affected by the degree or type of their hearing loss. The cognitive abilities of some people, for example how well they can attend to and remember what they have heard, can also affect how easy it is for them to follow conversation in challenging listening situations. In addition to hearing abilities, cognitive abilities seem to be particularly relevant because in many everyday listening situations people need to listen to more than one person talking at the same time and/or they may need to listen while doing something else such as driving a car or crossing a busy street. The auditory demands that a listener faces in a situation increase as background noise becomes louder or as more interfering sounds combine with each other. The cognitive demands in a situation increase when listeners need to keep track of more people talking or to divide their attention as they try to do more tasks at the same time. Both auditory and cognitive demands could result in the situation becoming very challenging and these demands may even totally overload a listener.

One way to measure information overload is to see how much a person remembers after they have completed a set of tasks. For several decades, cognitive psychologists have been interested in ‘working memory’, or a person’s limited capacity to process information while doing tasks and to remember information after the tasks have been completed. Like a bank account, the more cognitive capacity is spent on processing information while doing tasks, the less cognitive capacity will remain available for remembering and using the information later. Importantly, some people have bigger working memories than other people and people who have a bigger working memory are usually better at understanding written and spoken language. Indeed, many researchers have measured working memory span for reading (i.e., a task involving the processing and recall of visual information) to minimize ‘contamination’ from the effects of hearing loss that might be a problem if they measured working memory span for listening. However, variations in difficulty due to hearing loss may be critically important in assessing how the demands of listening affect different individuals when they are trying to understand speech in noise. Some researchers have studied the effects of the acoustical properties of speech and interfering noises on listening, but less is known about how variations in the type of language materials (words, sentences, stories) might alter listening demands for people who have hearing loss. Therefore, to learn more about why some people cope better when listening to conversation in noise, we need to discover how both their auditory and their cognitive abilities come into play during everyday listening for a range of spoken materials.

We predicted that speech understanding would be more highly associated with working memory span for listening than with listening span for reading, especially when more realistic language materials are used to measure speech understanding. To test these predictions, we conducted listening and reading tests of working memory and we also measured memory abilities using five other measures (three auditory memory tests and two visual memory tests). Speech understanding was measured with six tests (two tests with words, one in quiet and one in noise; three tests with sentences, one in quiet and two in noise; one test with stories in quiet). The tests of speech understanding using words and sentences were selected from typical clinical tests and involved simple immediate repetition of the words or sentences that were heard. The test using stories has been used in laboratory research and involved comprehension questions after the end of the story. Three groups with 24 people in each group were tested: one group of younger adults (mean age = 23.5 years) with normal hearing and two groups of older adults with hearing loss (one group with mean age = 66.3 years and the other group with mean age 74.3 years).

There was a wide range in performance on the listening test of working memory, but performance on the reading test of working memory was more limited and poorer. Overall, there was a significant correlation between the results on the reading and listening working memory measures. However, when correlations were conducted for each of the three groups separately, the correlation reached significance only for the oldest listeners with hearing loss; this group had lower mean scores on both tests. Surprisingly, for all three groups, there were no significant correlations among the working memory and speech understanding measures. To further investigate this surprising result, a factor analysis was conducted. The results of the factor analysis suggest that there was one factor including age, hearing test results and performance on speech understanding measures when the speech-understanding task was simply to repeat words or sentences – these seem to reflect auditory abilities. In addition, separate factors were found for performance on the speech understanding measures involving the comprehension of discourse or the use of semantic context in sentences – these seem to reflect linguistic abilities. Importantly, the majority of the memory measures were distinct from both kinds of speech understanding measures, and also a more basic and less cognitively demanding memory measure involving only the repetition of sets of numbers. Taken together, these findings suggest that working memory measures reflect differences between people in cognitive abilities that are distinct from those tapped by the sorts of simple measures of hearing and speech understanding that have been used in the clinic. Above and beyond current clinical tests, by testing working memory, especially listening working memory, useful information could be gained about why some people cope better than others in everyday challenging listening situations.

tags: age, hearing, memory, linguistics, speech

2pSC14 – Improving the Accuracy of Automatic Detection of Emotions From Speech

Reza Asadi and Harriet Fell

Popular version of poster 2pSC14 “Improving the accuracy of speech emotion recognition using acoustic landmarks and Teager energy operator features.”
Presented Tuesday afternoon, May 19, 2015, 1:00 pm – 5:00 pm, Ballroom 2
169th ASA Meeting, Pittsburgh

“You know, I can feel the fear that you carry around and I wish there was… something I could do to help you let go of it because if you could, I don’t think you’d feel so alone anymore.”
— Samantha, a computer operating system in the movie “Her”

Introduction
Computers that can recognize human emotions could react appropriately to a user’s needs and provide more human like interactions. Emotion recognition can also be used as a diagnostic tool for medical purposes, onboard car driving systems to keep the driver alert if stress is detected, a similar system in aircraft cockpits, and also electronic tutoring and interaction with virtual agents or robots. But is it really possible for computers to detect the emotions of their users?

During the past fifteen years, computer and speech scientists have worked on the automatic detection of emotion in speech. In order to interpret emotions from speech the machine will gather acoustic information in the form of sound signals, then extract related information from the signals and find patterns which relate acoustic information to the emotional state of speaker. In this study new combinations of acoustic feature sets were used to improve the performance of emotion recognition from speech. Also a comparison of feature sets for detecting different emotions is provided.

Methodology
Three sets of acoustic features were selected for this study: Mel-Frequency Cepstral Coefficients, Teager Energy Operator features and Landmark features.

Mel-Frequency Cepstral Coefficients:
In order to produce vocal sounds, vocal cords vibrate and produce periodic pulses which result in glottal wave. The vocal tract starting from the vocal cords and ending in the mouth and nose acts as a filter on the glottal wave. The Cepstrum is a signal analysis tool which is useful in separating source from filter in acoustic waves. Since the vocal tract acts as a filter on a glottal wave we can use the cepstrum to extract information only related to the vocal tract.

The mel scale is a perceptual scale for pitches as judged by listeners to be equal in distance from one another. Using mel frequencies in cepstral analysis approximates the human auditory system’s response more closely than using the linearly-spaced frequency bands. If we map frequency powers of energy in original speech wave spectrum to mel scale and then perform cepstral analysis we get Mel-Frequency Cepstral Coefficients (MFCC). Previous studies use MFCC for speaker and speech recognition. It has also been used to detect emotions.

Teager Energy Operator features:
Another approach to modeling speech production is to focus on the pattern of airflow in the vocal tract. While speaking in emotional states of panic or anger, physiological changes like muscle tension alter the airflow pattern and can be used to detect stress in speech. It is difficult to mathematically model the airflow, therefore Teager proposed the Teager Energy Operators (TEO), which computes the energy of vortex-flow interaction at each instance of time. Previous studies show that TEO related features contain information which can be used to determine stress in speech.

Acoustic landmarks:
Acoustic landmarks are locations in the speech signal where important and easily perceptible speech properties are rapidly changing. Previous studies show that the number of landmarks in each syllable might reflect underlying cognitive, mental, emotional, and developmental states of the speaker.

Figure 1 – Spectrogram (top) and acoustic landmarks (bottom) detected in neutral speech sample

Sound File 1 – A speech sample with neutral emotion

Figure 2 – Spectrogram (top) and acoustic landmarks (bottom) detected in anger speech sample

Sound File 2 – A speech sample with anger emotion

Classification:
The data used in this study came from the Linguistic Data Consortium’s Emotional Prosody and Speech Transcripts. In this database four actresses and three actors, all in their mid-20s, read a series of semantically neutral utterances (four-syllable dates and numbers) in fourteen emotional states. A description for each emotional state was handed over to the participants to be articulated in the proper emotional context. Acoustic features described previously were extracted from the speech samples in this database. These features were used for training and testing Support Vector Machine classifiers with the goal of detecting emotions from speech. The target emotions included anger, fear, disgust, sadness, joy, and neutral.

Results
The results of this study show an average detection accuracy of approximately 91% among these six emotions. This is 9% better than a previous study conducted at CMU on the same data set.

Specifically TEO features resulted in improvements in detecting anger and fear and landmark features improved the results for detecting sadness and joy. The classifier had the highest accuracy, 92%, in detecting anger and the lowest, 87%, in detecting joy.

4aAAa1 – Speech-in-noise recognition as both an experience- and signal-dependent process

Ann Bradlow – abradlow@northwestern.edu
Department of Linguistics
Northwestern UniversitY
2016 Sheridan Road
Evanston, IL 60208
Popular version of paper 4aAAa1

Presented Thursday morning, October 30, 2014
168th ASA Meeting, Indianapolis

Real-world speech understanding in naturally “crowded” auditory soundscapes is a complex operation that acts upon an integrated speech-plus-noise signal. Does all of the auditory “clutter” that surrounds speech make its way into our heads along with the speech? Or, do we perceptually isolate and discard background noise at an early stage of processing based on general acoustic properties that differentiate sounds from non-speech noise sources and those from human vocal tracts (i.e. speech)?

We addressed these questions by first examining the ability to tune into speech while simultaneously tuning out noise. Is this ability influenced by properties of the listener (their experience-dependent knowledge) as well as by properties of the signal (factors that make it more or less difficult to separate a given target from a given masker)? Listeners were presented with English sentences in a background of competing speech that was either English (matched-language, English-in-English recognition) or another language (mismatched-language, e.g. English-in-Mandarin recognition). Listeners were either native or non-native listeners of English and were either familiar or unfamiliar with the language of the to-be-ignored, background speech (English, Mandarin, Dutch, or Croatian). Overall, we found that matched-language speech-in-speech understanding (English-in-English) is significantly harder than mismatched-language speech-in-speech understanding (e.g. English-in-Mandarin). Importantly, listener familiarity with the background language modulated the magnitude of the mismatched-language benefit On a smaller time scale of experience, we also find that this benefit is modulated by short-term adaptation to a consistent background language within a test session. Thus, we conclude that speech understanding in conditions that involve competing background speech engages experience-dependent knowledge in addition to signal-dependent processes of auditory stream segregation.

Experiment Series 2 then asked if listeners’ memory traces for spoken words with concurrent background noise remain associated in memory with the background noise. Listeners were presented with a list of spoken words and for each word they were asked to indicate if the word was “old” (i.e. had occurred previously in the test session) or “new” (i.e. had not been presented over the course of the experiment). All words were presented with concurrent noise that was either aperiodic in a limited frequency band (i.e. like wind in the trees) or a pure tone. Importantly, both types of noise were clearly from a sound source that was very different from the speech source. In general, words were more likely to be correctly recognized as previously-heard if the noise on the second presentation matched the noise on the first presentation (e.g. pure tone on both first and second presentations of the word). This suggests that the memory trace for spoken words that have been presented in noisy backgrounds includes an association with the specific concurrent noise. That is, even sounds that quite clearly emanate from an entirely different source remain integrated with the cognitive representation of speech rather than being permanently discarded during speech processing.

These findings suggest that real-world speech understanding in naturally “crowded” auditory soundscapes involves an integrated speech-plus-noise signal at various stages of processing and representation. All of the auditory “clutter” that surrounds speech somehow makes its way into our heads along with the speech leaving us with exquisitely detailed auditory memories from which we build rich representations of our unique experiences.

Important note: The work in this presentation was conducted in a highly collaborative laboratory at Northwestern University. Critical contributors to this work are former group members Susanne Brouwer (now at Utrecht University, Netherlands), Lauren Calandruccio (now at UNC-Chapel Hill), and Kristin Van Engen (now at Washington University, St. Louis), and current group member, Angela Cooper.

Next Entries »

1pSC2 – Deciding to go (or not to go) to the party may depend as much on your memory as on your hearing

2pSC14 – Improving the Accuracy of Automatic Detection of Emotions From Speech

4aAAa1 – Speech-in-noise recognition as both an experience- and signal-dependent process

Search for papers by Acoustics Keyword