4aSC2 – Effects of language and music experience on speech perception

T. Christina Zhao — zhaotc@uw.edu
Patricia K. Kuhl — pkkuhl@uw.edu
Institute for Learning & Brain Sciences
University of Washington, BOX 357988
Seattle, WA, 98195

Popular version of paper 4aSC2, “Top-down linguistic categories dominate over bottom-up acoustics in lexical tone processing”
Presented Thursday morning, May 21st, 2015, 8:00 AM, Ballroom 2
169th ASA Meeting, Pittsburgh

Speech perception involves constant interplay between top-down and bottom-up processing. For example, to process phonemes (e.g. ‘b’ from ‘p’), the listener must accurately process the acoustical information in the speech signals (i.e. bottom-up strategy) and assign these sounds efficiently to a category (i.e. top-down strategy). Listeners’ performance in speech perception tasks is influenced by their experience in either processing strategy. Here, we use lexical tone processing as a window to examine how extensive experience in both strategies influence speech perception.

Lexical tones are contrastive pitch contour patterns at the word level. That is, a small difference in the pitch contour can result in different word meaning. Native speakers of a tonal language thus have extensive experience in using the top-down strategy to assign highly variable pitch contours into lexical tone categories. This top-down influence is reflected by the reduced sensitivity to acoustic differences within a phonemic category compared to across categories (Halle, Chang, & Best, 2004). On the other hand, individuals with extensive music training early in life exhibit enhanced sensitivities to pitch differences not only in music, but also in speech, reflecting stronger bottom-up influence. Such bottom-up influence is reflected by the enhanced sensitivity in detecting differences between lexical tones when the listeners are non-tonal language speakers (Wong, Skoe, Russo, Dees, & Kraus, 2007).
How does extensive experience in both strategies influence lexical tone processing? To address this question, native Mandarin speakers with extensive music training (N=17) completed a music pitch discrimination task and a lexical tone discrimination task. We compared their performance with individuals with extensive experience in only one of the processing strategies (i.e. Mandarin nonmusicians (N=20) and English musicians (N=20), data from Zhao & Kuhl (2015)).

Despite the enhanced performance in the music pitch discrimination task in Mandarin musicians, their performance in the lexical tone discrimination task is similar to the performance of the Mandarin nonmusicians, and different from the English musicians’ performance (Fig. 1, ‘Sensitivity across lexical tone continuum by group’).
ZhaoFig1
That is, they exhibited reduced sensitivities within phonemic categories (i.e. on either end of the line) compared to within categories (i.e. the middle of the line), and their overall performance is lower than the English musicians. This result strongly suggests a dominant effect of the top-down influence in processing lexical tone. Yet, further analyses revealed that Mandarin musicians and Mandarin nonmusicians may still be relying on different underlying mechanisms for performing in the lexical tone discrimination task. In the Mandarin musician, their music pitch discrimination scores are correlated with their lexical tone discrimination scores, suggesting a contribution of the bottom-up strategy in their lexical tone discrimination performance (Fig. 2, ‘Music pitch and lexical tone discrimination’, purple). This relation is similar to the English musicians (Fig. 2, peach) but very different from the Mandarin non-musicians (Fig. 2, yellow). Specifically, for Mandarin nonmusicians, the music pitch discrimination scores do not correlate with the lexical tone discrimination scores, suggesting independent processes.

ZhaoFig2

Halle, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395-421. doi: 10.1016/s0095-4470(03)00016-0
Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nat. Neurosci., 10(4), 420-422. doi: 10.1038/nn1872
Zhao, T. C., & Kuhl, P. K. (2015). Effect of musical experience on learning lexical tone categories. The Journal of the Acoustical Society of America, 137(3), 1452-1463. doi: doi:http://dx.doi.org/10.1121/1.4913457

2aSC – Speech: An eye and ear affair!

Pamela Trudeau-Fisette – ptrudeaufisette@gmail.com
Lucie Ménard – menard.lucie@uqam.ca
Université du Quebec à Montréal
320 Ste-Catherine E.
Montréal, H3C 3P8

Popular version of poster session 2aSC, “Auditory feedback perturbation of vowel production: A comparative study of congenitally blind speakers and sighted speakers”
Presented Tuesday morning, May 19, 2015, Ballroom 2, 8:00 AM – 12:00 noon
169th ASA Meeting, Pittsburgh
———————————
When learning to speak, young infants and toddlers use auditory and visual cues to correctly associate speech movements to a specific speech sound. In doing so, typically developing children compare their own speech and those of their ambient language to build and improve the relationship between what they hear, see and feel, and how to produce it.

In many day-to-day situations, we exploit the multimodal nature of speech: in noisy environments, for instance like in a cocktail party, we look at our interlocutor’s face and use lip reading to recover speech sounds. When speaking clearly, we open our mouth wider to make ourself sound more intelligible. Sometimes, just seeing someone’s face is enough to communicate!

What happens in cases of congenital blindness? Despite the fact that blind speakers learn to produce intelligible speech, they do not quite speak like sighted speakers do. Since they do not perceive others’ visual cues, blind speakers do not produce visible labial movements as much as their sighted peers do.

Production of the French vowel “ou” (similar as in cool) produced by a sighted adult speaker (on the left) and a congenitally blind adult speaker (on the right). We can clearly see that the articulatory movements of the lips are more explicit for the sighted speaker.

Therefore, blind speakers put more weight on what they hear (auditory feedback) than sighted speakers, because one sensory input is lacking. How does that affect the way blind individuals speak?

To answer this question, we conducted an experiment during which we asked congenitally blind adult speakers and sighted adult speakers to produce multiple repetitions of the French vowel “eu”. While they were producing the 130 utterances, we gradually altered their auditory feedback through headphones – without them knowing it- so that they were not hearing the exact sound they were saying. Consequently, they needed to modify the way they produced the vowel in order to compensate for the acoustic manipulation, so they could hear the vowel they were asked to produce (and the one they thought they were saying all along!).

What we were interested in is whether blind speakers and sighted speakers would react differently to this auditory manipulation. The blind speakers not being able to rely on visual feedback, we hypothesized that they would grant more importance on their auditory feedback and, therefore, compensate to a greater extent for the acoustic manipulation.

To explore this matter, we observed the acoustic (produced sounds) and articulatory (lips and tongue movements) differences between the two groups at three distinct time points of the experiment phases.

As predicted, congenitally blind speakers compensated for the altered auditory feedback in a greater extent than their sighted peers. More specifically, even though both speaker groups adapted their productions, the blind group compensated more than the control group did, as if they were integrating the auditory information more strongly. Also, we found that both speaker groups used different articulatory strategies to respond to the applied manipulation: blind participants used their tongue (which is not visible when you speak) more to compensate. This latter observation is not surprising considering the fact that blind speakers do not use their lips (which is visible when you speak) as much as their sighted peers do.

Tags: speech, language, learning, vision, blindness

1pSC2 – Deciding to go (or not to go) to the party may depend as much on your memory as on your hearing

Kathy Pichora-Fuller – k.pichora.fuller@utoronto.ca
Department of Psychology, University of Toronto,
3359 Mississauga Road,
Mississauga, Ontario, CANADA L5L 1C6

Sherri Smith – Sherri.Smith@va.gov
Audiologic Rehabilitation Laboratory, Veterans Affairs Medical Center,
Mountain Home, Tennessee, UNITED STATES 37684

Popular version of paper 1pSC2 Effects of age, hearing loss and linguistic complexity on listening effort as measured by working memory span
Presented Monday afternoon, May 18, 2015 (Session: Listening Effort II)
169th ASA Meeting, Pittsburgh

Understanding conversation in noisy everyday situations can be a challenge for listeners, especially individuals who are older and/or hard-of-hearing. Listening in some everyday situations (e.g., at dinner parties) can be so challenging that people might even decide that they would rather stay home than go out. Eventually, avoiding these situations can damage relationships with family and friends and reduce enjoyment of and participation in activities. What are the reasons for these difficulties and why are some people affected more than other people?

How easy or challenging it is to listen may vary from person to person because some people have better hearing abilities and/or cognitive abilities compared to other people. The hearing abilities of some people may be affected by the degree or type of their hearing loss. The cognitive abilities of some people, for example how well they can attend to and remember what they have heard, can also affect how easy it is for them to follow conversation in challenging listening situations. In addition to hearing abilities, cognitive abilities seem to be particularly relevant because in many everyday listening situations people need to listen to more than one person talking at the same time and/or they may need to listen while doing something else such as driving a car or crossing a busy street. The auditory demands that a listener faces in a situation increase as background noise becomes louder or as more interfering sounds combine with each other. The cognitive demands in a situation increase when listeners need to keep track of more people talking or to divide their attention as they try to do more tasks at the same time. Both auditory and cognitive demands could result in the situation becoming very challenging and these demands may even totally overload a listener.

One way to measure information overload is to see how much a person remembers after they have completed a set of tasks. For several decades, cognitive psychologists have been interested in ‘working memory’, or a person’s limited capacity to process information while doing tasks and to remember information after the tasks have been completed. Like a bank account, the more cognitive capacity is spent on processing information while doing tasks, the less cognitive capacity will remain available for remembering and using the information later. Importantly, some people have bigger working memories than other people and people who have a bigger working memory are usually better at understanding written and spoken language. Indeed, many researchers have measured working memory span for reading (i.e., a task involving the processing and recall of visual information) to minimize ‘contamination’ from the effects of hearing loss that might be a problem if they measured working memory span for listening. However, variations in difficulty due to hearing loss may be critically important in assessing how the demands of listening affect different individuals when they are trying to understand speech in noise. Some researchers have studied the effects of the acoustical properties of speech and interfering noises on listening, but less is known about how variations in the type of language materials (words, sentences, stories) might alter listening demands for people who have hearing loss. Therefore, to learn more about why some people cope better when listening to conversation in noise, we need to discover how both their auditory and their cognitive abilities come into play during everyday listening for a range of spoken materials.

We predicted that speech understanding would be more highly associated with working memory span for listening than with listening span for reading, especially when more realistic language materials are used to measure speech understanding. To test these predictions, we conducted listening and reading tests of working memory and we also measured memory abilities using five other measures (three auditory memory tests and two visual memory tests). Speech understanding was measured with six tests (two tests with words, one in quiet and one in noise; three tests with sentences, one in quiet and two in noise; one test with stories in quiet). The tests of speech understanding using words and sentences were selected from typical clinical tests and involved simple immediate repetition of the words or sentences that were heard. The test using stories has been used in laboratory research and involved comprehension questions after the end of the story. Three groups with 24 people in each group were tested: one group of younger adults (mean age = 23.5 years) with normal hearing and two groups of older adults with hearing loss (one group with mean age = 66.3 years and the other group with mean age 74.3 years).

There was a wide range in performance on the listening test of working memory, but performance on the reading test of working memory was more limited and poorer. Overall, there was a significant correlation between the results on the reading and listening working memory measures. However, when correlations were conducted for each of the three groups separately, the correlation reached significance only for the oldest listeners with hearing loss; this group had lower mean scores on both tests. Surprisingly, for all three groups, there were no significant correlations among the working memory and speech understanding measures. To further investigate this surprising result, a factor analysis was conducted. The results of the factor analysis suggest that there was one factor including age, hearing test results and performance on speech understanding measures when the speech-understanding task was simply to repeat words or sentences – these seem to reflect auditory abilities. In addition, separate factors were found for performance on the speech understanding measures involving the comprehension of discourse or the use of semantic context in sentences – these seem to reflect linguistic abilities. Importantly, the majority of the memory measures were distinct from both kinds of speech understanding measures, and also a more basic and less cognitively demanding memory measure involving only the repetition of sets of numbers. Taken together, these findings suggest that working memory measures reflect differences between people in cognitive abilities that are distinct from those tapped by the sorts of simple measures of hearing and speech understanding that have been used in the clinic. Above and beyond current clinical tests, by testing working memory, especially listening working memory, useful information could be gained about why some people cope better than others in everyday challenging listening situations.

tags: age, hearing, memory, linguistics, speech

2pSC14 – Improving the Accuracy of Automatic Detection of Emotions From Speech

Reza Asadi and Harriet Fell

Popular version of poster 2pSC14 “Improving the accuracy of speech emotion recognition using acoustic landmarks and Teager energy operator features.”
Presented Tuesday afternoon, May 19, 2015, 1:00 pm – 5:00 pm, Ballroom 2
169th ASA Meeting, Pittsburgh

“You know, I can feel the fear that you carry around and I wish there was… something I could do to help you let go of it because if you could, I don’t think you’d feel so alone anymore.”
— Samantha, a computer operating system in the movie “Her”

Introduction
Computers that can recognize human emotions could react appropriately to a user’s needs and provide more human like interactions. Emotion recognition can also be used as a diagnostic tool for medical purposes, onboard car driving systems to keep the driver alert if stress is detected, a similar system in aircraft cockpits, and also electronic tutoring and interaction with virtual agents or robots. But is it really possible for computers to detect the emotions of their users?

During the past fifteen years, computer and speech scientists have worked on the automatic detection of emotion in speech. In order to interpret emotions from speech the machine will gather acoustic information in the form of sound signals, then extract related information from the signals and find patterns which relate acoustic information to the emotional state of speaker. In this study new combinations of acoustic feature sets were used to improve the performance of emotion recognition from speech. Also a comparison of feature sets for detecting different emotions is provided.

Methodology
Three sets of acoustic features were selected for this study: Mel-Frequency Cepstral Coefficients, Teager Energy Operator features and Landmark features.

Mel-Frequency Cepstral Coefficients:
In order to produce vocal sounds, vocal cords vibrate and produce periodic pulses which result in glottal wave. The vocal tract starting from the vocal cords and ending in the mouth and nose acts as a filter on the glottal wave. The Cepstrum is a signal analysis tool which is useful in separating source from filter in acoustic waves. Since the vocal tract acts as a filter on a glottal wave we can use the cepstrum to extract information only related to the vocal tract.

The mel scale is a perceptual scale for pitches as judged by listeners to be equal in distance from one another. Using mel frequencies in cepstral analysis approximates the human auditory system’s response more closely than using the linearly-spaced frequency bands. If we map frequency powers of energy in original speech wave spectrum to mel scale and then perform cepstral analysis we get Mel-Frequency Cepstral Coefficients (MFCC). Previous studies use MFCC for speaker and speech recognition. It has also been used to detect emotions.

Teager Energy Operator features:
Another approach to modeling speech production is to focus on the pattern of airflow in the vocal tract. While speaking in emotional states of panic or anger, physiological changes like muscle tension alter the airflow pattern and can be used to detect stress in speech. It is difficult to mathematically model the airflow, therefore Teager proposed the Teager Energy Operators (TEO), which computes the energy of vortex-flow interaction at each instance of time. Previous studies show that TEO related features contain information which can be used to determine stress in speech.

Acoustic landmarks:
Acoustic landmarks are locations in the speech signal where important and easily perceptible speech properties are rapidly changing. Previous studies show that the number of landmarks in each syllable might reflect underlying cognitive, mental, emotional, and developmental states of the speaker.

Asadi1 - EmotionsFigure 1 – Spectrogram (top) and acoustic landmarks (bottom) detected in neutral speech sample

Sound File 1 – A speech sample with neutral emotion

Asadi2 - Emotions

Figure 2 – Spectrogram (top) and acoustic landmarks (bottom) detected in anger speech sample

Sound File 2 – A speech sample with anger emotion

 

Classification:
The data used in this study came from the Linguistic Data Consortium’s Emotional Prosody and Speech Transcripts. In this database four actresses and three actors, all in their mid-20s, read a series of semantically neutral utterances (four-syllable dates and numbers) in fourteen emotional states. A description for each emotional state was handed over to the participants to be articulated in the proper emotional context. Acoustic features described previously were extracted from the speech samples in this database. These features were used for training and testing Support Vector Machine classifiers with the goal of detecting emotions from speech. The target emotions included anger, fear, disgust, sadness, joy, and neutral.

Results
The results of this study show an average detection accuracy of approximately 91% among these six emotions. This is 9% better than a previous study conducted at CMU on the same data set.

Specifically TEO features resulted in improvements in detecting anger and fear and landmark features improved the results for detecting sadness and joy. The classifier had the highest accuracy, 92%, in detecting anger and the lowest, 87%, in detecting joy.

4aAAa1 – Speech-in-noise recognition as both an experience- and signal-dependent process

Ann Bradlow – abradlow@northwestern.edu
Department of Linguistics
Northwestern UniversitY
2016 Sheridan Road
Evanston, IL 60208
Popular version of paper 4aAAa1

Presented Thursday morning, October 30, 2014
168th ASA Meeting, Indianapolis

Real-world speech understanding in naturally “crowded” auditory soundscapes is a complex operation that acts upon an integrated speech-plus-noise signal.   Does all of the auditory “clutter” that surrounds speech make its way into our heads along with the speech? Or, do we perceptually isolate and discard background noise at an early stage of processing based on general acoustic properties that differentiate sounds from non-speech noise sources and those from human vocal tracts (i.e. speech)?

We addressed these questions by first examining the ability to tune into speech while simultaneously tuning out noise. Is this ability influenced by properties of the listener (their experience-dependent knowledge) as well as by properties of the signal (factors that make it more or less difficult to separate a given target from a given masker)? Listeners were presented with English sentences in a background of competing speech that was either English (matched-language, English-in-English recognition) or another language (mismatched-language, e.g. English-in-Mandarin recognition). Listeners were either native or non-native listeners of English and were either familiar or unfamiliar with the language of the to-be-ignored, background speech (English, Mandarin, Dutch, or Croatian). Overall, we found that matched-language speech-in-speech understanding (English-in-English) is significantly harder than mismatched-language speech-in-speech understanding (e.g. English-in-Mandarin). Importantly, listener familiarity with the background language modulated the magnitude of the mismatched-language benefit On a smaller time scale of experience, we also find that this benefit is modulated by short-term adaptation to a consistent background language within a test session. Thus, we conclude that speech understanding in conditions that involve competing background speech engages experience-dependent knowledge in addition to signal-dependent processes of auditory stream segregation.

Experiment Series 2 then asked if listeners’ memory traces for spoken words with concurrent background noise remain associated in memory with the background noise. Listeners were presented with a list of spoken words and for each word they were asked to indicate if the word was “old” (i.e. had occurred previously in the test session) or “new” (i.e. had not been presented over the course of the experiment). All words were presented with concurrent noise that was either aperiodic in a limited frequency band (i.e. like wind in the trees) or a pure tone. Importantly, both types of noise were clearly from a sound source that was very different from the speech source. In general, words were more likely to be correctly recognized as previously-heard if the noise on the second presentation matched the noise on the first presentation (e.g. pure tone on both first and second presentations of the word). This suggests that the memory trace for spoken words that have been presented in noisy backgrounds includes an association with the specific concurrent noise. That is, even sounds that quite clearly emanate from an entirely different source remain integrated with the cognitive representation of speech rather than being permanently discarded during speech processing.

These findings suggest that real-world speech understanding in naturally “crowded” auditory soundscapes involves an integrated speech-plus-noise signal at various stages of processing and representation. All of the auditory “clutter” that surrounds speech somehow makes its way into our heads along with the speech leaving us with exquisitely detailed auditory memories from which we build rich representations of our unique experiences.

Important note: The work in this presentation was conducted in a highly collaborative laboratory at Northwestern University. Critical contributors to this work are former group members Susanne Brouwer (now at Utrecht University, Netherlands), Lauren Calandruccio (now at UNC-Chapel Hill), and Kristin Van Engen (now at Washington University, St. Louis), and current group member, Angela Cooper.