3pSC10 – Does increasing the playback speed of men’s and women’s voices reduce their intelligibility by the same amount?

Eric M. Johnson – eric.martin.johnson@utah.edu
Sarah Hargus Ferguson – sarah.ferguson@hsc.utah.edu

Department of Communication Sciences and Disorders
University of Utah
390 South 1530 East, Room 1201
Salt Lake City, UT 84112

Popular version of poster 3pSC10, “Gender and rate effects on speech intelligibility.”
Presented Wednesday afternoon, May 25, 2016, 1:00, Salon G
171st ASA Meeting, Salt Lake City

Older adults seeking hearing help often report having an especially hard time understanding women’s voices. However, this anecdotal observation doesn’t always agree with the findings from scientific studies. For example, Ferguson (2012) found that male and female talkers were equally intelligible for older adults with hearing loss. Moreover, several studies have found that young people with normal hearing actually understand women’s voices better than men’s voices (e.g. Bradlow et al., 1996; Ferguson, 2004). In contrast, Larsby et al. (2015) found that, when listening in background noise, groups of listeners with and without hearing loss were better at understanding a man’s voice than a woman’s voice. The Larsby et al. data suggest that female speech might be more affected by distortion like background noise than male speech is, which could explain why women’s voices may be harder to understand for some people.

We were interested to see if another type of distortion, speeding up the speech, would have an equal effect on the intelligibility of men and women. Speech that has been sped up (or time-compressed) has been shown to be less intelligible than unprocessed speech (e.g. Gordon-Salant & Friedman, 2011), but no studies have explored whether time compression causes an equal loss of intelligibility for male and female talkers. If an increase in playback speed causes women’s speech to be less intelligible than men’s, it could reveal another possible reason why so many older adults with hearing loss report difficulty understanding women’s voices. To this end, our study tested whether the intelligibility of time-compressed speech decreases for female talkers more than it does for male talkers.

Using 32 listeners with normal hearing, we measured how much the intelligibility of two men and two women went down when the playback speed of their speech was increased by 50%. These four talkers were selected based on their nearly equivalent conversational speaking rates. We used digital recordings of each talker and made two different versions of each sentence they spoke: a normal-speed version and a fast version. The software we used allowed us to speed up the recordings without making them sound high-pitched.

Audio sample 1: A sentence at its original speed.

Audio sample 2: The same sentence sped up to 50% faster than its original speed.

All of the sentences were presented to the listeners in background noise. We found that the men and women were essentially equally intelligible when listeners heard the sentences at their original speed. Speeding up the sentences made all of the talkers harder to understand, but the effect was much greater for the female talkers than the male talkers. In other words, there was a significant interaction between talker gender and playback speed. The results suggest that time-compression has a greater negative effect on the intelligibility of female speech than it does on male speech.

johnson & ferguson fig 1

Figure 1: Overall percent correct key-word identification performance for male and female takers in unprocessed and time-compressed conditions. Error bars indicate 95% confidence intervals.

These results confirm the negative effects of time-compression on speech intelligibility and imply that audiologists should counsel the communication partners of their patients to avoid speaking excessively fast, especially if the patient complains of difficulty understanding women’s voices. This counsel may be even more important for the communication partners of patients who experience particular difficulty understanding speech in noise.

 

  1. Bradlow, A. R., Torretta, G. M., and Pisoni, D. B. (1996). “Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics,” Speech Commun. 20, 255-272.
  2. Ferguson, S. H. (2004). “Talker differences in clear and conversational speech: Vowel intelligibility for normal-hearing listeners,” J. Acoust. Soc. Am. 116, 2365-2373.
  3. Ferguson, S. H. (2012). “Talker differences in clear and conversational speech: Vowel intelligibility for older adults with hearing loss,” J. Speech Lang. Hear. Res. 55, 779-790.
  4. Gordon-Salant, S., and Friedman, S. A. (2011). “Recognition of rapid speech by blind and sighted older adults,” J. Speech Lang. Hear. Res. 54, 622-631.
  5. Larsby, B., Hällgren, M., Nilsson, L., and McAllister, A. (2015). “The influence of female versus male speakers’ voice on speech recognition thresholds in noise: Effects of low-and high-frequency hearing impairment,” Speech Lang. Hear. 18, 83-90.

2aSC7 – Effects of aging on speech breathing

Simone Graetzer, PhD. – sgraetz@msu.edu
Eric J. Hunter, PhD. – ejhunter@msu.edu

Voice Biomechanics and Acoustics Laboratory
Department of Communicative Sciences and Disorders
College of Communication Arts & Sciences
Michigan State University
1026 Red Cedar Road
East Lansing, MI 48824

Popular version of paper 2aSC7, entitled: “A longitudinal study of the effects of aging on speech breathing: Evidence of decreased expiratory volume in speech recordings”
Presented Tuesday morning, May 24, 2016, 8:00 – 11:30 AM, Salon F
171st ASA Meeting, Salt Lake City

Content
The aging population is the fastest growing segment of the population. Some voice, speech and breathing disorders occur more frequently as individuals age. For example, lung capacity diminishes in older age due to loss of lung elasticity, which places an upper limit on utterance duration. Further, decreased lung and diaphragm elasticity and muscle strength can occur, and the rib cage can stiffen, leading to reductions in lung pressure and the volume of air that can be expelled by the lungs (‘expiratory volume’). In the laryngeal system, tissues can break down and cartilages can harden, causing more voice breaks, increased hoarseness or harshness, reduced loudness, and pitch changes.

Our study attempted to identify the normal speech and respiratory changes that accompany aging in healthy individuals. Specifically, we examined how long individuals could speak in a single breath group using a series of speeches from six individuals (three females and three males) over the course of many years (between 18 and 49 years). All speakers had been previously recorded in similar environments giving long, monologue speeches. All but one speaker gave their addresses at a podium using a microphone, and most were longer than 30 minutes each. The speakers’ ages ranged between 43 (51 on average) and 98 (84 on average) years. Samples of five minutes in length were extracted from each recording. Subsequently, for each subject, three raters identified the durations of exhalations during speech in these samples.

Two figures illustrate how the breath groups changed with age for one of the women (Figure 1) and one of the men (Figure 2). We found a change in the speech breathing, which might be caused by a less flexible rib cage and the loss of vital capacity and expiratory volume. In males especially, it may also have been caused by poor closure of the vocal folds, resulting in more air leakage during speech. Specifically, we found a decreased breath group duration for all male subjects after 70 years, with overall durations averaging between 1 and 3.5 seconds. Importantly, the point of change appeared to occur between 60 and 65. For females, this change occurred at a later time, between 60-70 years, with durations averaging between 1.5 and 3.5 seconds.

figure_Page_1 - speech breath

Figure 1 For one of the women talkers, the speech breath groups were measured and plotted to correspond with age. The length of the speech breath groups begins to decrease at about 68 years of age.

figure_Page_2 - speech breath

Figure 2 For one of the men talkers, the speech breath groups were measured and plotted to correspond with age. The length of the speech breath groups begins to decrease at about 66 years of age.

The study results indicate decreases in speech breath group duration for most individuals as their age increased (especially from 65 years onwards), consistent with the age-related decline in expiratory volume reported in other studies. Typically, the speech breath group duration of the six subjects decreased from ages 65 to 70 years onwards. There was some variation between individuals in the point at which the durations started to decrease. The decreases indicate that, as they aged, speakers could not sustain the same number of words in a breath group and needed to inhale more frequently while speaking.

Future studies involving more participants may further our understanding of normal age-related changes vs. pathology, but such a corpus of recordings must first be constrained on the basis of communicative intent, venues, knowledge of vocal coaching, and related information.

References
Hunter, E. J., Tanner, K., & Smith, M. E. (2011), Gender differences affecting vocal health of women in vocally demanding careers. Logopedics Phoniatrics Vocology, 36(3), 128-136.

Janssens, J.P. , Pache, J.C. and Nicod, L.P. (1999), Physiological changes in respiratory function associated with ageing. European Respiratory Journal, 13, 197–205.

Acknowledgements
We acknowledge the efforts of Amy Kemp, Lauren Glowski, Rebecca Wallington, Allison Woodberg, Andrew Lee, Saisha Johnson, and Carly Miller. Research was in part supported by the National Institute On Deafness And Other Communication Disorders of the National Institutes of Health under Award Number R01DC012315. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

2aBAa5 – Sound Waves Helps Assess Bone Condition

Max Denis – denis.max@mayo.edu
507-266-7449

Leighton Wan – wan.leighton@mayo.edu
Matthew Cheong – cheong.matthew@mayo.edu
Mostafa Fatemi – fatemi.mostafa@mayo.edu
Azra Alizad – alizad.azra@mayo.edu
507-254-5970

Mayo Clinic College of Medicine
200 1st St SW
Rochester, MN 55905

Popular version of paper 2aBAa5, “Bone demineralization assessment using acoustic radiation force”
Presented Tuesday morning, May 24, 2016, 9:00 AM in Snowbird/Brighton room
171st ASA Meeting, Salt Lake City, Utah

The assessment of the human skeletal health condition is of great importance ranging from newborn infants to the elderly. Annually, approximately fifty percent of the 550,000 premature newborn infants in the United States suffer from bone metabolism related disorders such as osteopenia, which affect the bone development process into childhood. As we age through adulthood, reductions in our bone mass increases due an unbalance activity in the bone reformation process leading to bone diseases such as osteoporosis; putting a person at risk for fractures in the neck, hip and forearm areas.

Currently bone assessment tools include dual-energy X-ray absorptiometry (DEXA), and quantitative ultrasound (QUS). DEXA is the leading clinical bone quality assessment tool, detecting small changes in bone mineral content and density. However, DEXA uses ionizing radiation for imaging thus exposing patients to very low radiation doses. This can be problematic for frequent clinical visits to monitor the efficacy of prescribed medications and therapies.

QUS has been sought as a nonionizing and noninvasive alternative to DEXA. QUS utilizes measurements of ultrasonic waves between a transmitting and a receiving transducer aligned in parallel along bone surface. Speed of sound (SOS) measurements of the received ultrasonic signal is used to characterize the bone material properties. The determination of the SOS parameter is susceptible to the amount of soft tissue between the skin surface and the bone. Thus, we propose utilizing a high intensity ultrasonic wave known as a “push beam” to exert a force on the bone surface thereby generating vibrations. This will minimize the effects of the soft tissue. The radiate sound wave due to these vibrations are captured and used to analyze the bone mechanical properties.

This work demonstrates the feasibility of evaluating bone mechanical properties from sound waves due to bone vibrations. Under an approved protocol by the Mayo Clinic Institutional Review Board (IRB), human volunteers were recruited to undergo our noninvasive bone assessment technique. Our cohort consisted of clinically confirmed osteopenia and osteoporosis patients, as well as normal volunteers without a history of bone fractures. An ultrasound probe and hydrophone were placed along the volunteers’ tibia bone (Figure 1a). A B-mode ultrasound was used to guide the placement of our push beam focal point onto the bone surface underneath the skin layer (Figure 1b). The SOS was obtained from the measurements.

Denis1 bone

Figure 1. (a) Probe and hydrophone alignment along the tibia bone. (b) Diagram of an image-guided push beam focal point excitation on the bone surface.

In total 14 volunteers were recruited in our ongoing study. A boxplot comparison of SOS between normal and bone diseased (osteopenia and osteoporotic) volunteers in Figure 2, shows that typically sound travels faster in healthy bones than osteoporotic and osteopenia bones with SOS median values (red line) of 3733 m/s and 2566 m/s, respectively. Hence, our technique may be useful as a noninvasive method for monitoring the skeletal health status of the premature and aging population.

Denis2 bone

Figure 2. Normal and bone diseased volunteers sound of speed comparisons.

This ongoing project is being done under an approved protocol by Mayo Institutional Review Board.

2pAAa10 – Turn around when you’re talking to me!

Jennifer Whiting – jkwhiting@physics.byu.edu
Timothy Leishman, PhD – tim_leishman@physics.byu.edu
K.J. Bodon – joshuabodon@gmail.com

Brigham Young University
N283 Eyring Science Center
Provo, UT 84602

Popular version of paper 2pAAa10, “High-resolution measurements of speech directivity”
Presented Tuesday afternoon, November 3, 2015, 4:40 PM, Grand Ballroom 3
170th ASA Meeting, Jacksonville

Introduction
In general, most sources of sound do not radiate equally in all directions. The human voice is no exception to this rule. How strongly sound is radiated in a given direction at a specific frequency, or pitch, is called directivity. While many [references] have studied the directivity of speaking and singing voices, some important details are missing. The research reported in this presentation measured directivity of live speech at higher angular and frequency resolutions than have been previously measured, in an effort to capture the missing details.

Measurement methods
The approach uses a semicircular array of 37 microphones spaced with five-degree polar-angle increments, see Figure 1. A subject sits on a computer-controlled rotating chair with his or her mouth aligned at the axis of rotation and circular center of the microphone array. He or she repeats a series of phonetically-balanced sentences at each of 72 five-degree azimuthal-angle increments. This results in 2522 measurement points on a sphere around the subject.

[MISSING Figure 1. A subject and the measurement array]

Analysis
The measurements are based on audio recordings of the subject who tries to repeat the sentences with exactly the same timing and inflection at each rotation. To account for the inevitable differences in each repetition, a transfer function and the coherence between a reference microphone near the subject and a measurement microphone on the semicircular array is computed. The coherence is used to examine how good each measurement is. The transfer function for each measurement point makes up the directivity. To visualize the results, each measurement is plotted on a sphere, where the color and the radius of the sphere indicate how strongly sound is radiated in that direction for a given frequency. Animations of these spherical plots show how the directivity differs for each frequency.

[MISSING Figure 2. Balloon plot for male speech directivity at 500 and 1000 Hz.]
[MISSING Figure 3. Balloon plot for female speech directivity at 500 and 1000 Hz.]
[MISSING Animation 1. Male Speech Directivity, animated]
[MISSING Animation 2. Female Speech Directivity, animated]

Results and Conclusions
Some unique results are visible in the animations. Most importantly, as frequency increases, one can see that most of the sound is radiated in the forward direction. This is one reason for why it’s hard to hear someone talking in the front of a car when you’re sitting in the back, unless they turn around to talk to you. One can also see in the animations that as frequency increases, and most of the sound radiates forwards, there is poor coherence in the back area. This doesn’t necessarily indicate a poor measurement, just poor signal-to-noise ratio, since there is little sound energy in that direction. It’s also interesting to see that the polar angle of the strongest radiation also changes with frequency. At some frequencies the sound is radiated strongly downward and to the sides, but at other frequencies the stound is radiated strongly upwards and forwards. Male and female directivities are similar in shape, but at different frequencies, since the fundamental frequency of males and females is so different.

A more complete understanding of speech directivity has great benefits to several industries. For example, hearing aid companies can use speech directivity patterns to know where to aim microphones in the hearing aids to pick up the best sound for the hearing aid wearer having a conversation. Microphone placement in cell phones can be adjusted to get clearer signal from those talking into the cell phone. The theater and audio industries can use directivity patterns to assist in positioning actors on stage, or placing microphones near the speakers to record the most spectrally rich speech. The scientific community can develop more complete models for human speech based on these measurements. Further study on this subject will allow researchers to improve the measurement method and analysis techniques to more fully understand the results, and generalize them to all speech containing similar phonemes to those in these measurements.

2aSCb3 – How would you sketch a sound with your hands?

Hugo Scurto – Hugo.Scurto@ircam.fr
Guillaume Lemaitre – Guillaume.Lemaitre@ircam.fr
Jules Françoise – Jules.Francoise@ircam.fr
Patrick Susini – Patrick.Susini@ircam.fr
Frédéric Bevilacqua – Frederic.Bevilacqua@ircam.fr
Ircam
1 place Igor Stravinsky
75004 Paris, France

Popular version of paper 2aSCb3, “Combining gestures and vocalizations to imitate sounds”
Presented Tuesday morning, November 3, 2015, 10:30 AM in Grand Ballroom 8
170th ASA Meeting, Jacksonville

Scurto fig 1 - gestures

Figure 1. A person hears the sound of door squeaking and imitates it with vocalizations and gestures. Can the other person understand what he means?

Have you ever listened to an old Car Talk show? Here is what it sounded like on NPR back in 2010:

“So, when you start it up, what kind of noises does it make?
– It just rattles around for about a minute. […]
– Just like budublu-budublu-budublu?
– Yeah! It’s definitely bouncing off something, and then it stops”

As the example illustrates, it is often very complicated to describe a sound with words. But it is really easy to make it with our built-in sound-making system: the voice! In fact, we have observed earlier that this is exactly what people do: when we ask a person to communicate a sound to another person, she will very quickly try to recreate this noise with her voice – and also use a lot of gestures.

And this works! Communicating sounds with voice and gesture is much more effective than describing them with words and sentences. Imitations of sounds are fun, expressive, spontaneous, widespread in human communication, and very effective. These non-linguistic vocal utterances have been little studied, but nevertheless have the potential to provide researchers with new insights into several important questions in domains such as articulatory phonetics and auditory cognition.

The study we are presenting at this ASA meeting is part of a larger European project on how people imitate sounds with voice and gestures: SkAT-VG (“Sketching Audio Technologies with Voice and Gestures”, http://www.skatvg.eu): How do people produce vocal imitations (phonetics)? What are imitations made of (acoustics and gesture analysis)? How do other people interpret them (psychology)? The ultimate goal is to create “sketching” tools for sound designers (the persons that create the sounds of everyday products). If you are an architect and want to sketch a house, you can simply draw it on a sketchpad. But what do you do if you are a sound designer and want to rapidly sketch the sound of a new motorbike? Well, all that is available today are cumbersome pieces of software. Instead, the Skat-VG project aims to offer sound designers new tools that are as intuitive as a sketching pad: simply use their voice and gestures to control complex sound design tools. Therefore, the SkAT-VG project also conducts research in machine learning, sound synthesis, and studies how sound designers work.

Here at the ASA meeting, we are presenting a partial study in which we asked the question: “What do people use gestures for when they imitate a sound?” In fact, people use a lot of gestures, but we do not know what information these gestures convey: Are they redundant with the voice? Do they convey specific pieces of information that the voice cannot represent?

We first collected a huge database of vocal and gestural imitations. Then, we asked 50 participants to come to our lab and make vocal and gestural imitations for several hours. We recorded their voice, filmed them with a high-speed camera, and used a depth camera and accelerometers to measure their gestures. This resulted in a database of about 8000 imitations! This database is an unprecedented amount of material that now allows

We first analyzed the database qualitatively, by watching and annotating the videos. From this analysis, several hypotheses about the combination of gestures and vocalizations were drawn. Then, to test these hypotheses, we asked 20 participants to imitate 25 specially synthesized sounds with their voice and gestures.

The results showed a quantitative advantage of voice over gesture for communicating rhythmic information. Voice can reproduce accurately higher tempos than gestures, and is more precise than gestures when reproducing complex rhythmic patterns. We also found that people often use gestures in a metaphorical way, whereas voice reproduces some acoustic features of the sound. For instance, people shake their hands very rapidly whenever a sound is stable and noisy. This type of gesture does not really follow a feature of the sound: it simply means that the sound is noisy.

Overall, our study reveals the metaphorical function of gestures during sound imitation. Rather than following an acoustic characteristic, gestures expressively emphasize the vocalization and signal the most salient features. These results will inform the specifications of the SkAT-VG tools and make the tools more intuitive.