Kristin Van Engen – kvanengen@wustl.edu Avanti Dey Nichole Runge Mitchell Sommers Brent Spehar Jonathen E. Peelle
Washington University in St. Louis 1 Brookings Drive St. Louis, MO 63130
Popular version of paper 4aSC12 Presented Thursday morning, May 10, 2018 175th ASA Meeting, Minneapolis, MN
How hard is it to recognize a spoken word?
Well, that depends. Are you old or young? How is your hearing? Are you at home or in a noisy restaurant? Is the word one that is used often, or one that is relatively uncommon? Does it sound similar to lots of other words in the language?
As people age, understanding speech becomes more challenging, especially in noisy situations like parties or restaurants. This is perhaps unsurprising, given the large proportion of older adults who have some degree of hearing loss. However, hearing measurements do not actually do a very good job of predicting the difficulty a person will have with speech recognition, and older adults tend to do worse than younger adults even when their hearing is good.
We also know that some words are more difficult to recognize than others. Words that are used rarely are more difficult than common words, and words that sound similar to many other words in the language are recognized less accurately than unique-sounding words. Relatively little is known, however, about how these kinds of challenges interact with background noise to affect the process of word recognition or how such effects might change across the lifespan.
In this study, we used eye tracking to investigate how noise and word frequency affect the process of understanding spoken words. Listeners were shown a computer screen displaying four images, and listened the instruction “Click on the” followed by a target word (e.g., “Click on the dog.”). As the speech signal unfolds, the eye tracker records the moment-by-moment direction of the person’s gaze (60 times per second). Since listeners direct their gaze toward the visual information that matches incoming auditory information, this allows us to observe the process of word recognition in real time.
Our results indicate that word recognition is slower in noise than in quiet, slower for low-frequency words than high-frequency words, and slower for older adults than younger adults. Interestingly, young adults were more slowed down by noise than older adults. The main difference, however, was that young adults were considerably faster to recognize words in quiet conditions. That is, word recognition by older adults didn’t differ much from quiet to noisy conditions, but young listeners looked like older listeners when tasked with listening to speech in noise.
Rosario Signorello – rsignorello@ucla.edu Department of Head and Neck Surgery 31-20 Rehab Center, Los Angeles, CA 90095-1794 Phone: +1 (323) 703-9549
Popular version of paper 1pSC26 “Acoustics and Perception of Charisma in Bilingual English-Spanish 2016 United States Presidential Election Candidates” Presented at the 171st Meeting on Monday May 23, 1:00 pm – 5:00 pm, Salon F, Salt Lake Marriott Downtown at City Creek Hotel, Salt Lake City, Utah,
Charisma is the set of leadership characteristics, such as vision, emotions, and dominance used by leaders to share beliefs, persuade listeners and achieve goals. Politicians use voice to convey charisma and appeal to voters to gain social positions of power. “Charismatic voice” refers to the ensemble of vocal acoustic patterns used by speakers to convey personality traits and arouse specific emotional states in listeners. The ability to manipulate charismatic voice results from speakers’ universal and learned strategies to use specific vocal parameters (such as vocal pitch, loudness, phonation types, pauses, pitch contours, etc.) to convey their biological features and their social image (see Ohala, 1994; Signorello, 2014a, 2014b; Puts et al., 2006). Listeners’ perception of the physical, psychological and social characteristics of the leader is influenced by universal ways to emotionally respond to vocalizations (see Ohala, 1994; Signorello, 2014a, 2014b) combined with specific, culturally-mediated, habits to manifest emotional response in public (Matsumoto, 1990; Signorello, 2014a).
Politicians manipulate vocal acoustic patterns (adapting them to the culture, language, social status, educational background and the gender of the voters) to convey specific types of leadership fulfilling everyone’s expectation of what charisma is. But what happen to leaders’ voice when they use different languages to address voters? This study investigates speeches of bilingual politicians to find out the vocal acoustic differences of leaders speaking in different languages. It also investigates how the acoustical differences in different languages can influence listeners’ perception of type of leadership and the emotional state aroused by leaders’ voices.
We selected vocal samples from two bilingual America-English/American-Spanish politicians that participated to the 2016 United States presidential primaries: Jeb Bush and Marco Rubio. We chose words with similar vocal characteristics in terms of average vocal pitch, vocal pitch range, and loudness range. We asked listeners to rate the type of charismatic leadership perceived and to assess the emotional states aroused by those voices. We finally asked participants how the different vocal patterns would affect their voting preference.
Preliminary statistical analyses show that English words like “terrorism” (voice sample 1) and “security” (voice sample 2), characterized by mid vocal pitch frequencies, wide vocal pitch ranges, and wide loudness ranges, convey an intimidating, arrogant, selfish, aggressive, witty, overbearing, lazy, dishonest, and dull type of charismatic leadership. Listeners from different language and cultural backgrounds also reported these vocal stimuli triggered emotional states like contempt, annoyance, discomfort, irritation, anxiety, anger, boredom, disappointment, and disgust. The listeners who were interviewed considered themselves politically liberal and they responded that they would probably vote for a politician with the vocal characteristics listed above.
Results also show that Spanish words like “terrorismo” (voice sample 3) and “ilegal” (voice sample 4) characterized by an average of mid-low vocal pitch frequencies, mid vocal pitch ranges, and narrow loudness ranges convey a personable, relatable, kind, caring, humble, enthusiastic, witty, stubborn, extroverted, understanding, but also weak and insecure type of charismatic. Listeners from different language and cultural backgrounds also reported these vocal stimuli triggered emotional states like happiness, amusement, relief, and enjoyment. The listeners who were interviewed considered themselves politically liberal and they responded that they would probably vote for a politician with the vocal characteristics listed above.
Voice is a very dynamic non-verbal behavior used by politicians to persuade the audience and manipulate voting preference. The results of this study show how acoustic differences in voice convey different types of leadership and arouse differently the emotional states of the listeners. The voice samples studied show how speakers Jeb Bush and Marco Rubio adapt their vocal delivery to audiences of different backgrounds. The two politicians voluntary manipulate their voice parameters while speaking in order to appear as they were endowed of different leadership qualities. The vocal pattern used in English conveys the threatening and dark side of their charisma, inducing the arousal of negative emotions, which triggers a positive voting preference in listeners. The vocal pattern used in English conveys the charming and caring side of their charisma, inducing the arousal of positive emotions, which triggers a negative voting preference in listeners.
The manipulation of voice arouses emotional states that will induce voters to consider a certain type of leadership as more appealing. Experiencing emotions help voters to assess the effectiveness of a political leader. If the emotional arousing matches with voters’ expectation of how a charismatic leader should make them feel then voters would help the charismatic speaker to became their leader.
References Signorello, R. (2014a). Rosario Signorello (2014). La Voix Charismatique : Aspects Psychologiques et Caractéristiques Acoustiques. PhD Thesis. Université de Grenoble, Grenoble, France and Università degli Studi Roma Tre, Rome, Italy.
Signorello, R. (2014b). The biological function of fundamental frequency in leaders’ charismatic voices. The Journal of the Acoustical Society of America 136 (4), 2295-2295.
Ohala, J. (1984). An ethological perspective on common cross-language utilization of F0 of voice. Phonetica, 41(1):1–16.
Puts, D. A., Hodges, C. R., Cárdenas, R. A. et Gaulin, S. J. C. (2007). Men’s voices as dominance signals : vocal fundamental and formant frequencies influence dominance attributions among men. Evolution and Human Behavior, 28(5):340–344.
Lisa Popeil – lisa@popeil.com Voiceworks® 14431 Ventura Blvd #200 Sherman Oaks, CA 91423
Popular version of paper 2aMU4 Presented Tuesday morning, May 24, 2016
There exist a number of ways the human vocal folds can vibrate which create unique sounds used in singing. The two most common vibrational patterns of the vocal folds are commonly called “chest voice” and “head voice”, with chest voice sounding like speaking or yelling and head voice sounding more flute-like or like screaming on high pitches. In the operatic singing tradition, men sing primarily in chest voice while women sing primarily in their head voice. However, in rock singing, men often emit high screams using their head voice while female rock singers use almost exclusively their chest voice for high notes.
Vocal fold vibrational pattern differences are only a part of the story though, since the shaping of the throat, mouth and nose (the vocal tract) play a large part in the perception of the final sound. That means that head voice can be made to “sound” like chest voice on high screams using vocal tract shaping and only the most experienced listener can determine if the vocal register used was chest or head voice.
Using spectrographic analysis, differences and similarities between operatic and rock singers can be seen. One similarity between the two is the heightened output of a resonance commonly called “ring”. This resonance, when amplified by vocal tract shaping, creates a piercing sound that’s perceived by the listener as extremely loud. The amplified ring harmonics can be seen in the 3,000 Hz band in both the male opera sample and in rock singing samples:
MALE OPERA – HIGH B (B4…494 Hz) CHEST VOICEFigure 1
MALE ROCK – HIGH E (E5…659 Hz) CHEST VOICEFigure 2
MALE ROCK – HIGH G (G5…784 Hz) HEAD VOICEFigure 3
Though each of these three male singers exhibit a unique frequency signature and whether singing in chest or head voice, each singer is using the amplified ring strategy in the 3,000Hz range amplify their sound and create excitement.
Mackenzie Parrott – mackenzie.lanae@gmail.com John Nix – john.nix@utsa.edu
Popular version of paper 2aMU5, “Listener Ratings of Singer Expressivity in Musical Performance.” Presented Tuesday, May 24, 2016, 10:20-10:35 am, Salon B/C, ASA meeting, Salt Lake City
Vocal fry is the lowest register of the human voice. Its distinct sound is characterized by a low rumble interspersed with uneven popping and crackling. The use of fry as a vocal mannerism is becoming increasingly common in American speech, fueling discussion about the implications of its use and how listeners perceive the speaker [1]. Previous studies have suggested that listeners find vocal fry to be generally unpleasant in women’s speech, but associate it with positive characteristics in men’s speech [2].
As it has become more prevalent, fry has perhaps not surprisingly found its place in many commercial song styles as well. Many singers are implementing fry as a stylistic device at the onset or offset of a sung tone. This can be found very readily in popular musical styles, presumably to impact and amplify the emotion that the performer is attempting to convey.
Researchers at the University of Texas at San Antonio conducted a survey to analyze whether listeners’ ratings of a singer’s expressivity in musical samples in two contemporary commercial styles (pop and country) were affected by the presence of vocal fry, and to see if there was a difference in listener ratings according to the singer’s gender. A male and a female singer recorded musical samples for the study in a noise reduction booth. As can be seen in the table below, the singers were asked to sing most of the musical selections twice, once using vocal fry at phrase onsets, and once without fry, while maintaining the same vocal quality, tempo, dynamics, and stylization. Some samples were presented more than one time in the survey portion of the study to test listener reliability.
Song
Singer Gender
Vocal Mode
(Hit Me) Baby One More Time
Female
Fry Only
If I Die Young
Female
With and Without Fry
National Anthem
Female
With and Without Fry
Thinking Out Loud
Male
Without Fry Only
Amarillo By Morning
Male
With and Without Fry
National Anthem
Male
With and Without Fry
Across all listener ratings of all the songs, the recordings which included vocal fry were rated as being only slightly more expressive than the recordings which contained no vocal fry. When comparing the use of fry between the male and female singer, there were some differences between the genders. The listeners rated the samples where the female singer used vocal fry higher (e.g., more expressive) than those without fry, which was surprising considering the negative association with women using vocal fry in speech. Conversely, the listeners rated the male samples without fry as being more expressive than those with fry. Part of this preference pattern may have also been an indication of the singer; the male singer was much more experienced with pop styles than the female singer, who is primarily classically trained. The overall expressivity ratings for the male singer were higher than those of the female singer by a statistically significant margin.
There were also listener rating trends between the differing age groups of participants. Younger listeners drove the gap of preference between the female singer’s performances with fry versus non-fry and the male singer’s performances without fry versus with fry further apart. Presumably they are more tuned into stylistic norms of current pop singers. However, this could also imply a gender bias in younger listeners. The older listener groups rated the mean expressivity of the performers as being lower than the younger listener groups. Since most of the songs that we sampled are fairly recent in production, this may indicate a generational trend in preference. Perhaps listeners rate the style of vocal production that is most similar to what they listened to during their young adult years as the most expressive style of singing. These findings have raised many questions for further studies about vocal fry in pop and country music.
Anderson, R.C., Klofstad, C.A., Mayew, W.J., Venkatachalam, M. “Vocal Fry May Undermine the Success of Young Women in the Labor Market. “ PLoS ONE, 2014. 9(5): e97506. doi:10.1371/journal.pone.0097506.
Yuasa, I. P. “Creaky Voice: A New Feminine Voice Quality for Young Urban-Oriented Upwardly Mobile American Women.” American Speech, 2010. 85(3): 315-337.
Human and Intelligent Agent Integration Branch (HIAI) Human Research and Engineering Directorate U.S. Army Research Laboratory Building 520 Aberdeen Proving Ground, MD
Lay language paper 1aPP44, “Speech recognition performance of listeners with normal hearing, sensorineural hearing loss, and sensorineural hearing loss and bothersome tinnitus when using air and bone conduction communication headsets” Presented Monday Morning, May 23, 2016, 8:00 – 12:00, Salon E/F 171st ASA Meeting, Salt Lake City
Military personnel are at high risk for noise-induced hearing loss due to the unprecedented proportion of blast-related acoustic trauma experienced during deployment from high-level impulsive and continuous noise (i.e., transportation vehicles, weaponry, blast-exposure). In fact, noise-induced hearing loss is the primary injury of United States Soldiers returning from Afghanistan and Iraq. Ear injuries, including tympanic membrane perforation, hearing loss, and tinnitus, greatly affect a Soldier’s hearing acuity and, as a result, reduce situational awareness and readiness. Hearing protection devices are accessible to military personnel; however, it has been noted that many troops forego the use of protection believing it may decrease circumstantial responsiveness during combat.
Noise-induced hearing loss is highly associated with tinnitus, the experience of perceiving sound that is not produced by a source outside of the body. Chronic tinnitus causes functional impairment that may result in a tinnitus sufferer to seek help from an audiologist or other healthcare professional. Intervention and management are the only options for those individuals suffering from chronic tinnitus as there is no cure for this condition. Tinnitus affects every aspect of an individual’s life including sleep, daily tasks, relaxation, and conversation to name only a few. In 2011, the United States Government Accountability Office report on noise indicated that tinnitus was the most prevalent service-connected disability. The combination of noise-induced hearing loss and the perception of tinnitus could greatly impact a Soldier’s ability to rapidly and accurately process speech information under high-stress situations.
The prevalence of hearing loss and tinnitus within the military population suggests that Soldier use of hearing protection is extremely important. The addition of hearing protection into reliable communication devices will increase the probability of use among Soldiers. Military communication devices using air and bone-conduction provide clear two-way audio communications through a headset and a microphone.
Air conduction headsets offer passive hearing protection from high ambient noise, and talk-through microphones allow the user to engage in face-to-face conversation and hear ambient environmental sounds, preserving situation awareness. Bone-conduction technology utilizes the bone-conduction pathway and presents auditory information differently than air-conduction devices (see Figure 1). Because headsets with bone conduction transducers do not cover the ears, they allow the user to hear the surrounding environment and the option to communicate over a radio network. Worn with or without hearing protection, bone conduction devices are inconspicuous and fit easily under the helmet. Bone conduction communication devices have been used in the past; however, as newer devices have been designed, they have not been widely adopted for military applications.
A.
B.
Figure 1. Air and Bone conduction headsets used during study: a) Invisio X5 dual in-ear headset and X50 control unit and b) Aftershockz Sports 2 headset.
Since many military personnel operate in high noise environments and with some degree of noise induced hearing damage and/or tinnitus, it is important to understand how speech recognition performance might be altered as a function of headset use. This is an important subject to evaluate as there are two auditory pathways (i.e., air-conduction pathway and bone-conduction pathway) that are responsible for hearing perception. Comparing the differences between the air and bone-conduction devices on different hearing populations will help to describe the overall effects of not only hearing loss, an extremely common disability within the military population, but the effect of tinnitus on situational awareness as well. Additionally, if there are differences between the two types of headsets, this information will help to guide future communication device selection for each type of population (NH vs. SNHL vs. SNHL/Tinnitus).
Based on findings from speech understanding in noise literature, communication devices do have a negative effect on speech intelligibility within the military population when noise is present. However, it is uncertain as to how hearing loss and/or tinnitus effects speech intelligibility and situational awareness under high-level noise environments. This study looked at speech recognition of words presented over AC and BC headsets and measured three groups of listeners: Normal Hearing, sensorineural hearing impaired, and/or tinnitus sufferers. Three levels of speech-to-noise (SNR=0,-6,-12) were created by embedding speech items in pink noise. Overall, performance was marginally, but significantly better for the Aftershockz bone conduction headset (Figure 2). As would be expected, performance increases as the speech to noise ratio increases (Figure 3).
Figure 2. Mean rationalized arcsine units measured for each of the TCAPS under test.
Figure 3. Mean rationalized arcsine units measured as a function of speech to noise ratio.
One of the most fascinating things about the data is that although the effect of hearing profile was significant, it was not practically so, the means for the Normal Hearing, Hearing Loss and Tinnitus groups were 65, 61, and 63, respectively (Figure 4). Nor was there any interaction with any of the other variables under test. One might conclude from the data that if the listener can control the level of presentation, the speech to noise ratio has about the same effect, regardless of hearing loss. There was no difference in performance with the TCAPS due to one’s hearing profile; however, the Aftershockz headset provided better speech intelligibility for all listeners.
Figure 4. Mean rationalized arcsine units observed as a function of the hearing profile of the listener.