Popular version of poster presentation 2pSCb11, “Effect of menstrual phase on dichotic listening”
Presented Tuesday afternoon, November 3, 2015, 3:30 PM, Grand Ballroom 8
How speech is processed by the brain has long been of interest to researchers and clinicians. One method to evaluate how the two sides of the brain work when hearing speech is called a dichotic listening task. In a dichotic listening task two words are presented simultaneously to a participant’s left and right ears via headphones. One word is presented to the left ear and a different one to the right ear. These words are spoken at the same pitch and loudness levels. The listener then indicates what word was heard. If the listener regularly reports hearing the words presented to one ear, then there is an ear advantage. Since most language processing occurs in the left hemisphere of the brain, most listeners attend more closely to the right ear. The regular selection of the word presented to the right ear is termed a right ear advantage (REA).
Previous researchers reported different responses from males and females to dichotic presentation of words. Those investigators found that males more consistently heard the word presented to the right ear and demonstrated a stronger REA. The female listeners in those studies exhibited more variability as to the ear of the word that was heard. Further research seemed to indicate that women exhibit different lateralization of speech processing at different phases of their menstrual cycle. In addition, data from recent studies indicate that the degree to which women can focus on the input to one ear or the other varies with their menstrual cycle.
However, the previous studies used a small number of participants. The purpose of the present study was to complete a dichotic listening study with a larger sample of female participants. In addition, the previous studies focused on women who did not take oral contraceptives as they were assumed to have smaller shifts in the lateralization of speech processing. Although this hypothesis is reasonable, it needs to be tested. For this study, it was hypothesized that the women would exhibit a greater REA during the days that they menstruate than during other days of their menstrual cycle. This hypothesis was based on the previous research reports. In addition, it was hypothesized that the women taking oral contraceptives will exhibit smaller fluctuations in the lateralization of their speech processing.
Participants in the study were 64 females, 19-25 years of age. Among the women 41 were taking oral contraceptives (OC) and 23 were not. The participants listened to the sound files during nine sessions that occurred once per week. All of the women were in good general health and had no speech, language, or hearing deficits.
The dichotic listening task was executed using the Alvin software package for speech perception research. The sound file consisted of consonant-vowel syllables comprised of the six plosive consonants /b/, /d/, /g/, /p/, /t/, and /k/ paired with the vowel “ah”. The listeners heard the syllables over stereo headphones. Each listener set the loudness of the syllables to a comfortable level.
At the beginning of the listening session, each participant wrote down the date of the initiation of her most recent menstrual period on a participant sheet identified by her participant number. Then, they heard the recorded syllables and indicated the consonant heard by striking that key on the computer keyboard. Each listening session consisted of three presentations of the syllables. There were different randomizations of the syllables for each presentation. In the first presentation, the stimuli will be presented in a non-forced condition. In this condition the listener indicted the plosive that she heard most clearly. After the first presentation, the experimental files were presented in a manner referred to as a forced left or right condition. In these two conditions the participant was directed to focus on the signal in the left or right ear. The sequence of focus on signal to the left ear or to the right ear was counterbalanced over the sessions.
The statistical analyses of the listeners’ responses revealed that no significant differences occurred between the women using oral contraceptives and those who did not. In addition, correlations between the day of the women’s menstrual cycle and their responses were consistently low. However, some patterns did emerge for the women’s responses across the experimental sessions as opposed to the days of their menstrual cycle. The participants in both groups exhibited a higher REA and lower percentage of errors for the final sessions in comparison to earlier sessions.
The results from the current subjects differ from those previously reported. Possibly the larger sample size of the current study, the additional month of data collection, or the data recording method affected the results. The larger sample size might have better represented how most women respond to dichotic listening tasks. The additional month of data collection may have allowed the women to learn how to respond to the task and then respond in a more consistent manner. The short data collection period may have confused the learning to respond to a novel task with a hormonally dependent response. Finally, previous studies had the experimenter record the subjects’ responses. That method of data recording may have added bias to the data collection. Further studies with large data sets and multiple months of data collection are needed to determine any sex and oral contraceptive use effects on REA.
Popular version of poster 2pSC14 “Improving the accuracy of speech emotion recognition using acoustic landmarks and Teager energy operator features.”
Presented Tuesday afternoon, May 19, 2015, 1:00 pm – 5:00 pm, Ballroom 2
169th ASA Meeting, Pittsburgh
“You know, I can feel the fear that you carry around and I wish there was… something I could do to help you let go of it because if you could, I don’t think you’d feel so alone anymore.”
— Samantha, a computer operating system in the movie “Her”
Computers that can recognize human emotions could react appropriately to a user’s needs and provide more human like interactions. Emotion recognition can also be used as a diagnostic tool for medical purposes, onboard car driving systems to keep the driver alert if stress is detected, a similar system in aircraft cockpits, and also electronic tutoring and interaction with virtual agents or robots. But is it really possible for computers to detect the emotions of their users?
During the past fifteen years, computer and speech scientists have worked on the automatic detection of emotion in speech. In order to interpret emotions from speech the machine will gather acoustic information in the form of sound signals, then extract related information from the signals and find patterns which relate acoustic information to the emotional state of speaker. In this study new combinations of acoustic feature sets were used to improve the performance of emotion recognition from speech. Also a comparison of feature sets for detecting different emotions is provided.
Three sets of acoustic features were selected for this study: Mel-Frequency Cepstral Coefficients, Teager Energy Operator features and Landmark features.
Mel-Frequency Cepstral Coefficients:
In order to produce vocal sounds, vocal cords vibrate and produce periodic pulses which result in glottal wave. The vocal tract starting from the vocal cords and ending in the mouth and nose acts as a filter on the glottal wave. The Cepstrum is a signal analysis tool which is useful in separating source from filter in acoustic waves. Since the vocal tract acts as a filter on a glottal wave we can use the cepstrum to extract information only related to the vocal tract.
The mel scale is a perceptual scale for pitches as judged by listeners to be equal in distance from one another. Using mel frequencies in cepstral analysis approximates the human auditory system’s response more closely than using the linearly-spaced frequency bands. If we map frequency powers of energy in original speech wave spectrum to mel scale and then perform cepstral analysis we get Mel-Frequency Cepstral Coefficients (MFCC). Previous studies use MFCC for speaker and speech recognition. It has also been used to detect emotions.
Teager Energy Operator features:
Another approach to modeling speech production is to focus on the pattern of airflow in the vocal tract. While speaking in emotional states of panic or anger, physiological changes like muscle tension alter the airflow pattern and can be used to detect stress in speech. It is difficult to mathematically model the airflow, therefore Teager proposed the Teager Energy Operators (TEO), which computes the energy of vortex-flow interaction at each instance of time. Previous studies show that TEO related features contain information which can be used to determine stress in speech.
Acoustic landmarks are locations in the speech signal where important and easily perceptible speech properties are rapidly changing. Previous studies show that the number of landmarks in each syllable might reflect underlying cognitive, mental, emotional, and developmental states of the speaker.
Sound File 1 – A speech sample with neutral emotion
Sound File 2 – A speech sample with anger emotion
Figure 1 – Spectrogram (top) and acoustic landmarks (bottom) detected in neutral speech sample
Figure 2 – Spectrogram (top) and acoustic landmarks (bottom) detected in anger speech sample
The data used in this study came from the Linguistic Data Consortium’s Emotional Prosody and Speech Transcripts. In this database four actresses and three actors, all in their mid-20s, read a series of semantically neutral utterances (four-syllable dates and numbers) in fourteen emotional states. A description for each emotional state was handed over to the participants to be articulated in the proper emotional context. Acoustic features described previously were extracted from the speech samples in this database. These features were used for training and testing Support Vector Machine classifiers with the goal of detecting emotions from speech. The target emotions included anger, fear, disgust, sadness, joy, and neutral.
The results of this study show an average detection accuracy of approximately 91% among these six emotions. This is 9% better than a previous study conducted at CMU on the same data set.
Specifically TEO features resulted in improvements in detecting anger and fear and landmark features improved the results for detecting sadness and joy. The classifier had the highest accuracy, 92%, in detecting anger and the lowest, 87%, in detecting joy.
Speech: An eye and ear affair!
Pamela Trudeau-Fisette – firstname.lastname@example.org
Lucie Ménard – email@example.com
Université du Quebec à Montréal
320 Ste-Catherine E.
Montréal, H3C 3P8
Popular version of poster session 2aSC, “Auditory feedback perturbation of vowel production: A comparative study of congenitally blind speakers and sighted speakers”
Presented Tuesday morning, May 19, 2015, Ballroom 2, 8:00 AM – 12:00 noon
169th ASA Meeting, Pittsburgh
When learning to speak, young infants and toddlers use auditory and visual cues to correctly associate speech movements to a specific speech sound. In doing so, typically developing children compare their own speech and those of their ambient language to build and improve the relationship between what they hear, see and feel, and how to produce it.
In many day-to-day situations, we exploit the multimodal nature of speech: in noisy environments, for instance like in a cocktail party, we look at our interlocutor’s face and use lip reading to recover speech sounds. When speaking clearly, we open our mouth wider to make ourself sound more intelligible. Sometimes, just seeing someone’s face is enough to communicate!
What happens in cases of congenital blindness? Despite the fact that blind speakers learn to produce intelligible speech, they do not quite speak like sighted speakers do. Since they do not perceive others’ visual cues, blind speakers do not produce visible labial movements as much as their sighted peers do.
Production of the French vowel “ou” (similar as in cool) produced by a sighted adult speaker (on the left) and a congenitally blind adult speaker (on the right). We can clearly see that the articulatory movements of the lips are more explicit for the sighted speaker.
Therefore, blind speakers put more weight on what they hear (auditory feedback) than sighted speakers, because one sensory input is lacking. How does that affect the way blind individuals speak?
To answer this question, we conducted an experiment during which we asked congenitally blind adult speakers and sighted adult speakers to produce multiple repetitions of the French vowel “eu”. While they were producing the 130 utterances, we gradually altered their auditory feedback through headphones – without them knowing it- so that they were not hearing the exact sound they were saying. Consequently, they needed to modify the way they produced the vowel in order to compensate for the acoustic manipulation, so they could hear the vowel they were asked to produce (and the one they thought they were saying all along!).
What we were interested in is whether blind speakers and sighted speakers would react differently to this auditory manipulation. The blind speakers not being able to rely on visual feedback, we hypothesized that they would grant more importance on their auditory feedback and, therefore, compensate to a greater extent for the acoustic manipulation.
To explore this matter, we observed the acoustic (produced sounds) and articulatory (lips and tongue movements) differences between the two groups at three distinct time points of the experiment phases.
As predicted, congenitally blind speakers compensated for the altered auditory feedback in a greater extent than their sighted peers. More specifically, even though both speaker groups adapted their productions, the blind group compensated more than the control group did, as if they were integrating the auditory information more strongly. Also, we found that both speaker groups used different articulatory strategies to respond to the applied manipulation: blind participants used their tongue (which is not visible when you speak) more to compensate. This latter observation is not surprising considering the fact that blind speakers do not use their lips (which is visible when you speak) as much as their sighted peers do.
Effects of language and music experience on speech perception
T. Christina Zhao — firstname.lastname@example.org
Patricia K. Kuhl — email@example.com
Institute for Learning & Brain Sciences
University of Washington, BOX 357988
Seattle, WA, 98195
Popular version of paper 4aSC2, “Top-down linguistic categories dominate over bottom-up acoustics in lexical tone processing”
Presented Thursday morning, May 21st, 2015, 8:00 AM, Ballroom 2
169th ASA Meeting, Pittsburgh
Speech perception involves constant interplay between top-down and bottom-up processing. For example, to process phonemes (e.g. ‘b’ from ‘p’), the listener must accurately process the acoustical information in the speech signals (i.e. bottom-up strategy) and assign these sounds efficiently to a category (i.e. top-down strategy). Listeners’ performance in speech perception tasks is influenced by their experience in either processing strategy. Here, we use lexical tone processing as a window to examine how extensive experience in both strategies influence speech perception.
Lexical tones are contrastive pitch contour patterns at the word level. That is, a small difference in the pitch contour can result in different word meaning. Native speakers of a tonal language thus have extensive experience in using the top-down strategy to assign highly variable pitch contours into lexical tone categories. This top-down influence is reflected by the reduced sensitivity to acoustic differences within a phonemic category compared to across categories (Halle, Chang, & Best, 2004). On the other hand, individuals with extensive music training early in life exhibit enhanced sensitivities to pitch differences not only in music, but also in speech, reflecting stronger bottom-up influence. Such bottom-up influence is reflected by the enhanced sensitivity in detecting differences between lexical tones when the listeners are non-tonal language speakers (Wong, Skoe, Russo, Dees, & Kraus, 2007).
How does extensive experience in both strategies influence lexical tone processing? To address this question, native Mandarin speakers with extensive music training (N=17) completed a music pitch discrimination task and a lexical tone discrimination task. We compared their performance with individuals with extensive experience in only one of the processing strategies (i.e. Mandarin nonmusicians (N=20) and English musicians (N=20), data from Zhao & Kuhl (2015)).
Despite the enhanced performance in the music pitch discrimination task in Mandarin musicians, their performance in the lexical tone discrimination ask is similar to the performance of the Mandarin nonmusicians, and different from the English musicians’ performance
(Fig. 1, ‘Sensitivity across lexical tone continuum by group’). That is, they exhibited reduced sensitivities within phonemic categories (i.e. on either end of the line) compared to within categories (i.e. the middle of the line), and their overall performance is lower than the English musicians. This result strongly suggests a dominant effect of the top-down influence in processing lexical tone. Yet, further analyses revealed that Mandarin musicians and Mandarin nonmusicians may still be relying on different underlying mechanisms for performing in the lexical tone discrimination task. In the Mandarin musician, their music pitch discrimination scores are correlated with their lexical tone discrimination scores, suggesting a contribution of the bottom-up strategy in their lexical tone discrimination performance (Fig. 2, ‘Music pitch and lexical tone discrimination’, purple). This relation is similar to the English musicians (Fig. 2, peach) but very different from the Mandarin non-musicians
(Fig. 2, yellow). Specifically, for Mandarin nonmusicians, the music pitch discrimination scores do not correlate with the lexical tone discrimination scores, suggesting independent processes.
Halle, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395-421. doi: 10.1016/s0095-4470(03)00016-0
Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nat. Neurosci., 10(4), 420-422. doi: 10.1038/nn1872
Zhao, T. C., & Kuhl, P. K. (2015). Effect of musical experience on learning lexical tone categories. The Journal of the Acoustical Society of America, 137(3), 1452-1463. doi: doi:http://dx.doi.org/10.1121/1.4913457
Understanding conversation in noisy everyday situations can be a challenge for listeners, especially individuals who are older and/or hard-of-hearing. Listening in some everyday situations (e.g., at dinner parties) can be so challenging that people might even decide that they would rather stay home than go out. Eventually, avoiding these situations can damage relationships with family and friends and reduce enjoyment of and participation in activities. What are the reasons for these difficulties and why are some people affected more than other people?
How easy or challenging it is to listen may vary from person to person because some people have better hearing abilities and/or cognitive abilities compared to other people. The hearing abilities of some people may be affected by the degree or type of their hearing loss. The cognitive abilities of some people, for example how well they can attend to and remember what they have heard, can also affect how easy it is for them to follow conversation in challenging listening situations. In addition to hearing abilities, cognitive abilities seem to be particularly relevant because in many everyday listening situations people need to listen to more than one person talking at the same time and/or they may need to listen while doing something else such as driving a car or crossing a busy street. The auditory demands that a listener faces in a situation increase as background noise becomes louder or as more interfering sounds combine with each other. The cognitive demands in a situation increase when listeners need to keep track of more people talking or to divide their attention as they try to do more tasks at the same time. Both auditory and cognitive demands could result in the situation becoming very challenging and these demands may even totally overload a listener.
One way to measure information overload is to see how much a person remembers after they have completed a set of tasks. For several decades, cognitive psychologists have been interested in ‘working memory’, or a person’s limited capacity to process information while doing tasks and to remember information after the tasks have been completed. Like a bank account, the more cognitive capacity is spent on processing information while doing tasks, the less cognitive capacity will remain available for remembering and using the information later. Importantly, some people have bigger working memories than other people and people who have a bigger working memory are usually better at understanding written and spoken language. Indeed, many researchers have measured working memory span for reading (i.e., a task involving the processing and recall of visual information) to minimize ‘contamination’ from the effects of hearing loss that might be a problem if they measured working memory span for listening. However, variations in difficulty due to hearing loss may be critically important in assessing how the demands of listening affect different individuals when they are trying to understand speech in noise. Some researchers have studied the effects of the acoustical properties of speech and interfering noises on listening, but less is known about how variations in the type of language materials (words, sentences, stories) might alter listening demands for people who have hearing loss. Therefore, to learn more about why some people cope better when listening to conversation in noise, we need to discover how both their auditory and their cognitive abilities come into play during everyday listening for a range of spoken materials.
We predicted that speech understanding would be more highly associated with working memory span for listening than with listening span for reading, especially when more realistic language materials are used to measure speech understanding. To test these predictions, we conducted listening and reading tests of working memory and we also measured memory abilities using five other measures (three auditory memory tests and two visual memory tests). Speech understanding was measured with six tests (two tests with words, one in quiet and one in noise; three tests with sentences, one in quiet and two in noise; one test with stories in quiet). The tests of speech understanding using words and sentences were selected from typical clinical tests and involved simple immediate repetition of the words or sentences that were heard. The test using stories has been used in laboratory research and involved comprehension questions after the end of the story. Three groups with 24 people in each group were tested: one group of younger adults (mean age = 23.5 years) with normal hearing and two groups of older adults with hearing loss (one group with mean age = 66.3 years and the other group with mean age 74.3 years).
There was a wide range in performance on the listening test of working memory, but performance on the reading test of working memory was more limited and poorer. Overall, there was a significant correlation between the results on the reading and listening working memory measures. However, when correlations were conducted for each of the three groups separately, the correlation reached significance only for the oldest listeners with hearing loss; this group had lower mean scores on both tests. Surprisingly, for all three groups, there were no significant correlations among the working memory and speech understanding measures. To further investigate this surprising result, a factor analysis was conducted. The results of the factor analysis suggest that there was one factor including age, hearing test results and performance on speech understanding measures when the speech-understanding task was simply to repeat words or sentences – these seem to reflect auditory abilities. In addition, separate factors were found for performance on the speech understanding measures involving the comprehension of discourse or the use of semantic context in sentences – these seem to reflect linguistic abilities. Importantly, the majority of the memory measures were distinct from both kinds of speech understanding measures, and also a more basic and less cognitively demanding memory measure involving only the repetition of sets of numbers. Taken together, these findings suggest that working memory measures reflect differences between people in cognitive abilities that are distinct from those tapped by the sorts of simple measures of hearing and speech understanding that have been used in the clinic. Above and beyond current clinical tests, by testing working memory, especially listening working memory, useful information could be gained about why some people cope better than others in everyday challenging listening situations.
Presentation #1pSC2 “Effect of age, hearing loss, and linguistic complexity on listening effort as mentioned by working memory span” by Margaret K. Pichora-Fuller and Sherri L. Smith will be take place on Monday, May 18, 2015, at 1:55 PM in Kings 4 at the Wyndham Grand Pittsburgh Downtown Hotel. The abstract can be found by searching for the presentation number here: