4aSCb8 – How do kids communicate in challenging conditions? – Valerie Hazan

4aSCb8 – How do kids communicate in challenging conditions? – Valerie Hazan

Kids learn to speak fluently at a young age and we expect young teenagers to communicate as effectively as adults. However, researchers are increasingly realizing that certain aspects of speech communication have a slower developmental path. For example, as adults, we are very skilled at adapting the way that we speak according to the needs of the communication. When we are speaking a predictable message in good listening conditions, we do not need to make an effort to pronounce speech clearly and we can expend less effort. However, in poor listening conditions or when transmitting new information, we increase the effort that we make to enunciate speech clearly in order to be more easily understood.

In our project, we investigated whether 9 to 14 year olds (divided into three age bands) were able to make such skilled adaptations when speaking in challenging conditions. We recorded 96 pairs of friends of the same age and gender while they carried out a simple picture-based ‘spot the difference’ game (See Figure 1).
Figure 1: one of the picture pairs in the DiapixUK ‘spot the difference’ task.

The two friends were seated in different rooms and spoke to each other via headphones; they had to try to find 12 differences between their two pictures without seeing each other or the other picture. In the ‘easy communication’ condition, both friends could hear each other normally, while in the ‘difficult communication’ condition, we made it difficult for one of the friends (‘Speaker B’) to hear the other by heavily distorting the speech of ‘Speaker A’ using a vocoder (See Figure 2 and sound demos 1 and 2). Both kids had received some training at understanding this type of distorted speech. We investigated what adaptations Speaker A, who was hearing normally, made to his or her speech in order to make themselves understood by their friend with ‘impaired’ hearing, so that they could complete the task successfully.
Figure 2: The recording set up for the ‘easy communication’ (NB) and ‘difficult communication’ (VOC) conditions.

Sound 1: Here, you will hear an excerpt from the diapix task between two 10 year olds in the ‘difficult communication’ conversation from the viewpoint of the talker hearing normally. Hear how she attempts to clarify her speech when her friend has difficulty understanding her.

Sound 2: Here, you will hear the same excerpt but from the viewpoint of the talker hearing the heavily degraded (vocoded) speech. Even though you will find this speech very difficult to understand, even 10 year olds get better at perceiving it after a bit of training. However, they are still having difficulty understanding what is being said, which forces their friend to make greater effort to communicate.

We looked at the time it took to find the differences between the pictures as a measure of communication efficiency. We also carried out analyses of the acoustic aspects of the speech to see how these varied when communication was easy or difficult.
We found that when communication was easy, the child groups did not differ from adults in the average time that it took to find a difference in the picture, showing that 9 to 14 year olds were communicating as efficiently as adults. When the speech of Speaker A was heavily distorted, all groups took longer to do the task, but only the 9-10 year old group took significantly longer than adults (See Figure 3). The additional problems experienced by younger kids are likely to be due both to greater difficulty for Speaker B in understanding degraded speech and to Speaker A being less skilled at compensating for this difficulties. The results obtained for children aged 11 and older suggest that they were using good strategies to compensate for the difficulties imposed on the communication (See Figure 3).
Figure 3: Average time taken to find one difference in the picture task. The four talker groups do not differ when communication is easy (blue bars); in the ‘difficult communication’ condition (green bars), the 9-10 years olds take significantly longer than the adults but the other child groups do not.

In terms of the acoustic characteristics of their speech, the 9 to 14 year olds differed in certain aspects from adults in the ‘easy communication’ condition. All child groups produced more distinct vowels and used a higher pitch than adults; kids younger than 11-12 also spoke more slowly and more loudly than adults. They hadn’t learnt to ‘reduce’ their speaking effort in the way that adults would do when communication was easy. When communication was made difficult, the 9 to 14 year olds were able to make adaptations to their speech for the benefit of their friend hearing the distorted speech, even though they themselves were having no hearing difficulties. For example, they spoke more slowly (See Figure 4) and more loudly. However, some of these adaptations differed from those produced by adults.
Figure 4: Speaking rate changes with age and communication difficulty. 9-10 year olds spoke more slowly than adults in the ‘easy communication’ condition (blue bars). All speaker groups slowed down their speech as a strategy to help their friend understand them in the ‘difficult communication’ (vocoder) condition (green bars).

Overall, therefore, even in the second decade of life, there are changes taking place in the conversational speech produced by young people. Some of these changes are due to physiological reasons such as growth of the vocal apparatus, but increasing experience with speech communication and cognitive developments occurring in this period also play a part.

Younger kids may experience greater difficulty than adults when communicating in difficult conditions and even though they can make adaptations to their speech, they may not be as skilled at compensating for these difficulties. This has implications for communication within school environments, where noise is often an issue, and for communication with peers with hearing or language impairments.


Valerie Hazan – v.hazan@ucl.ac.uk
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK

Michèle Pettinato – Michele.Pettinato@uantwerpen.be
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK

Outi Tuomainen – o.tuomainen@ucl.ac.uk
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK
Sonia Granlund – s.granlund@ucl.ac.uk
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK
Popular version of paper 4aSCb8
Presented Thursday morning, October 30, 2014
168th ASA Meeting, Indianapolis

4aSCb16 – How your genes may help you learn another language – Han-Gyol Yi

For many decades, speech scientists have marveled at the complexity of speech sounds. In English, a relatively simple task of distinguishing “bat” from “pat” can involve as many as 16 different sound cues. Also, English vowels are pronounced so differently across speakers that one person’s “Dan” can sound like another’s “done”. Despite all this, most adult native English speakers are able to understand English speech sounds rapidly, effortlessly, and accurately. In contrast, learning a new language is not an easy task, partly because the characteristics of foreign speech sounds are unfamiliar to us. For instance, Mandarin Chinese is a tonal language, which means that the pitch pattern used to produce each syllable can change the meaning of the word. Therefore, the word “ma” can mean “mother”, “hemp”, “horse”, or “to scold,” depending on whether the word was produced with a flat, rising, dipping, or a falling pitch pattern. It is no surprise that many native English speakers struggle in learning Mandarin Chinese. At the same time, some seem to master these new speech sounds with relative ease. With our research, we seek to discover the neural and genetic bases of this individual variability in language learning success. In this paper, we are focusing on genes that target activity of two distinct neural regions: prefrontal cortex and striatum.

Recent advances in speech science research strongly suggest that for adults, learning speech sounds for the first time is a cognitively challenging task. What this means is that every time you hear a new speech sound, a region of your brain called the prefrontal cortex – the part of the cerebral cortex that sits right under your forehead –¬ must do extra work to extract relevant sound patterns and parse them according to learned rules. Such activity in the prefrontal cortex is driven by dopamine, which is one of the many chemicals that the cells in your brain use to communicate with each other. In general, higher dopamine activity in the prefrontal cortex means better performance in complex and difficult tasks.

Interestingly, there is a well-studied gene called COMT that affects the dopamine activity level in the prefrontal cortex. Everybody has a COMT gene, although with different subtypes. Individuals with a subtype of the COMT gene that promotes dopamine activity perform hard tasks better than do those with other subtypes. In our study, we found that the native English speakers with the dopamine-promoting subtype of the COMT gene (40 out of 169 participants) learned Mandarin Chinese speech sounds better than those with different subtypes. This means that, by assessing your COMT gene profile, you might be able to predict how well you will learn a new language.

However, this is only half the story. While new learners may initially use their prefrontal cortex to discern foreign speech sound contrasts, expert learners are less likely to do so. As with any other skill, speech perception becomes more rapid, effortless, and accurate with practice. At this stage, your brain can bypass all that burdensome cognitive reasoning in the prefrontal cortex. Instead, it can use the striatum – a deep structure within the brain¬¬ – to directly decode the speech sounds. We find that the striatum is more active for expert learners of new speech sounds. Furthermore, individuals with a subtype of a gene called FOXP2 that promotes flexibility of the striatum to new experiences (31 out of 204 participants) were found to learn Mandarin Chinese speech sounds better than those with other subtypes.

Our research suggests that learning speech sounds in a foreign language involves multiple neural regions, and that genetic variations which affect the activity within those regions lead to better or worse learning. In other words, your genetic framework may be contributing to how well you learn to understand a new language. What we do not know at this point is how these variables interact with other sources of variability, such as prior experience. Previous studies have shown that extensive musical training, for example, can enhance learning speech sounds of a foreign language. We are a long way from cracking the code of how the brain, a highly complex organism, functions. We hope that a neurocognitive genetic approach may help bridge the gap between biology and language.


Han-Gyol Yi – gyol@utexas.edu
W. Todd Maddox ¬– maddox@psy.utexas.edu
The University of Texas at Austin
2504A Whitis Ave. (A1100)
Austin, TX 78712

Valerie S. Knopik – valerie_knopik@brown.edu
Rhode Island Hospital
593 Eddy Street
Providence, RI 02093

John E. McGeary – john_mcgeary@brown.edu
Providence Veterans Affairs Medical Center
830 Chalkstone Avenue
Providence, RI 02098

Bharath Chandrasekaran – bchandra@utexas.edu
The University of Texas at Austin
2504A Whitis Ave. (A1100)
Austin, TX 78712

Popular version of paper 4aSCb16
Presented Thursday morning, October 30, 2014
168th ASA Meeting, Indianapolis


4pAAa12 – Hearing voices in the high frequencies: What your cell phone isn’t telling you – Brian B. Monson

Ever noticed how or wondered why people sound different on your cell phone than in person? You might already know that the reason is because a cell phone doesn’t transmit all of the sounds that the human voice creates. Specifically, cell phones don’t transmit very low-frequency sounds (below about 300 Hz) or high-frequency sounds (above about 3,400 Hz). The voice can and typically does make sounds at very high frequencies in the “treble” audio range (from about 6,000 Hz up to 20,000 Hz) in the form of vocal overtones and noise from consonants. Your cell phone cuts all of this out, however, leaving it up to your brain to “fill in” if you need it.



Figure 1. A spectrogram showing acoustical energy up to 20,000 Hz (on a logarithmic axis) created by a male human voice. The current cell phone bandwidth (dotted line) only transmits sounds between about 300 and 3400 Hz. High-frequency energy (HFE) above 6000 Hz (solid line) has information potentially useful to the brain when perceiving singing and speech.


What are you missing out on? One way to answer this question is to have individuals listen to only the high frequencies and report what they hear. We can do this using conventional signal processing methods: cut out everything below 6,000 Hz thereby only transmitting sounds above 6,000 Hz to the ear of the listener. When we do this, some listeners only hear chirps and whistles, but most normal-hearing listeners report hearing voices in the high frequencies. Strangely, some voices are very easy to hear out in the high frequencies, while others are quite difficult. The reason for this difference is not yet clear. You might experience this phenomenon if you listen to the following clips of high frequencies from several different voices. (You’ll need a good set of high-fidelity headphones or speakers to ensure you’re getting the high frequencies.)



Until recently, these treble frequencies were only thought to affect some aspects of voice quality or timbre. If you try playing with the treble knob on your sound system you’ll probably notice the change in quality. We now know, however, that it’s more than just quality (see Monson et al., 2014). In fact, the high frequencies carry a surprising amount of information about a vocal sound. For example, could you tell the gender of the voices you heard in the examples? Could you tell whether they were talking or singing? Could you tell what they were saying or singing? (Hint: the words are lyrics to a familiar song.) Most of our listeners could accurately report all of these things, even when we added noise to the recordings.



Figure 2. A frequency spectrum (on a linear axis) showing the energy in the high frequencies combined with speech-shaped low-frequency noise.

[Insert noise clip here: MonsonM1singnoise.wav]


What does this all mean? Cell phone and hearing aid technology is now attempting to include transmission of the high frequencies. It is tempting to speculate how inclusion of the high frequencies in cell phones, hearing aids, and even cochlear implants might benefit listeners. Lack of high-frequency information might be why we sometimes experience difficulty understanding someone on our phones, especially when sitting on a noisy bus or at a cocktail party. High frequencies might be of most benefit to children who tend to have better high-frequency hearing than adults. And what about quality? High frequencies certainly play a role in determining voice quality, which means vocalists and sound engineers might want to know the optimal amount of high-frequency energy for the right aesthetic. Some voices naturally produce higher amounts of high-frequency energy, and this might contribute to how well you like that voice. These possibilities give rise to many research questions we hope to pursue in our study of the high frequencies.




Monson, B. B., Hunter, E. J., Lotto, A. J., and Story, B. H. (2014). “The perceptual significance of high-frequency energy in the human voice,” Frontiers in Psychology, 5, 587, doi: 10.3389/fpsyg.2014.00587.


Hearing voices in the high frequencies: What your cell phone isn’t telling you


Brian B. Monson – bmonson@research.bwh.harvard.edu

Department of Pediatric Newborn Medicine

Brigham and Women’s Hospital

Harvard Medical School

75 Francis St

Boston, MA 02115


Popular version of paper 4pAAa12

Presented Thursday afternoon, October 30, 2014

168th ASA Meeting, Indianapolis



2aSC8 – Some people are eager to be heard: anticipatory posturing in speech production – Sam Tilsen

2aSC8 – Some people are eager to be heard: anticipatory posturing in speech production – Sam Tilsen

Consider a common scenario in a conversation: your friend is in the middle of asking you a question, and you already know the answer. To be polite, you wait to respond until your friend finishes the question. But what are you doing while you are waiting?

You might think that you are passively waiting for your turn to speak, but the results of this study suggest that you may be more impatient than you think. In analogous circumstances recreated experimentally, speakers move their vocal organs—i.e. their tongues, lips, and jaw—to positions that are appropriate for the sounds that they intend to produce in the near future. Instead of waiting passively for their turn to speak, they are actively preparing to respond.

To examine how speakers control their vocal organs prior to speaking, this study used real-time magnetic resonance imaging of the vocal tract. This recently developed technology takes a picture of tissue in middle of the vocal tract, much like an x-ray, and it takes the picture about 200 times every second. This allows for measurement of rapid changes in the positions of vocal organs before, during, and after people are speaking.

A video is available online (http://youtu.be/h2_NFsprEF0).

To understand how changes in the positions of vocal organs are related to different speech sounds, it is helpful to think of your mouth and throat as a single tube, with your lips at one end and the vocal folds at the other. When your vocal folds vibrate, they create sound waves that resonate in this tube. By using your lips and tongue to make closures or constrictions in the tube, you can change the frequencies of the resonating sound waves. You can also use an organ called the velum to control whether sound resonates in your nasal cavity. These relations between vocal tract postures and sounds provide a basis for extracting articulatory features from images of the vocal tract. For example, to make a “p” sound you close your lips, to make an “m” sound you close your lips and lower your velum, and to make “t” sound you press the tip of the tongue against the roof of your mouth.
Participants in this study produced simple syllables with a consonant and vowel (such as “pa” and “na”) in several different conditions. In one condition, speakers knew ahead of time what syllable to produce, so that they could prepare their vocal tract specifically for the response. In another condition, they produced the syllable immediately without any time for response-specific preparation. The experiment also manipulated whether speakers were free to position their vocal organs however they wanted before responding, or whether they were constrained by the requirement to produce the vowel “ee” before their response.
All of the participants in the study adopted a generic “speech-ready” posture prior to making a response, but only some of them adjusted this posture specifically for the upcoming response. This response-specific anticipation only occurred when speakers knew ahead of time exactly what response to produce. Some examples of anticipatory posturing are shown in the figures below.

Figure 2. Examples of anticipatory postures for “p” and “t” sounds. The lips are closer together in anticipation of “p” and the tongue tip is raised in anticipation of “t”.
Figure 3. Examples of anticipatory postures for “p” and “m” sounds. The velum is raised in anticipation of “p” and lowered in anticipation of “m”.
The surprising finding of this study was that only some speakers anticipatorily postured their vocal tracts in a response-specific way, and that speakers differed greatly in which vocal organs they used for this purpose. Furthermore, some of the anticipatory posturing that was observed facilitates production of an upcoming consonant, while other anticipatory posturing facilitates production of an upcoming vowel. The figure below summarizes these results.
Figure 4. Summary of anticipatory posturing effects, after controlling for generic speech-ready postures.
Why do some people anticipate vocal responses while others do not? Unfortunately, we don’t know: the finding that different speakers use different vocal organs to anticipate different sounds in an upcoming utterance is challenging to explain with current models of speech production. Future research will need to investigate the mechanisms that give rise to anticipatory posturing and the sources of variation across speakers.


Sam Tilsen – tilsen@cornell.edu
Peter Doerschuk – pd83@cornell.edu
Wenming Luh – wl358@cornell.edu
Robin Karlin – rpk83@cornell.edu
Hao Yi – hy433@cornell.edu
Cornell University
Ithaca, NY 14850

Pascal Spincemaille – pas2018@med.cornell.edu
Bo Xu – box2001@med.cornell.edu
Yi Wang – yiwang@med.cornell.edu
Weill Medical College
New York, NY 10065

Popular version of paper 2aSC8
Presented Tuesday morning, October 28, 2014
168th ASA Meeting, Indianapolis

4pAAa1 – Auditory Illusions of Supernatural Spirits: Archaeological Evidence and Experimental Results – Steven J. Waller

Introduction: Auditory illusions
The ear can be tricked by ambiguous sounds, just as the eye can be fooled by optical illusions. Sound reflection, whisper galleries, reverberation, ricochets, and interference patterns were perceived in the past as eerie sounds attributed to invisible echo spirits, thunder gods, ghosts, and sound-absorbing bodies. These beliefs in the supernatural were recorded in ancient myths, and expressed in tangible archaeological evidence as canyon petroglyphs, cave paintings, and megalithic stone circles including Stonehenge. Controlled experiments demonstrate that certain ambiguous sounds cause blindfolded listeners to believe in the presence of phantom objects.


Figure 1. This prehistoric pictograph of a ghostly figure in Utah’s Horseshoe Canyon will answer you back.


1. Echoes = Answers from Echo Spirits (relevant to canyon petroglyphs)
Voices coming out of solid rock gave our ancestors the impression of echo spirits calling out from the rocks. Just as light reflection in a mirror gives an illusion of yourself duplicated as a virtual image, sound waves reflecting off a surface are mathematically identical to sound waves emanating from a virtual sound source behind a reflecting plane such as a large cliff face. This can result in an auditory illusion of somebody answering you from deep within the rock. It struck me that canyon petroglyphs might have been made in response to hearing echoes and believing that the echo spirits dwelt in rocky places. Ancient myths contain descriptions of echo spirits that match prehistoric petroglyphs, including witches that hide in sheep bellies and snakeskins. My acoustic measurements have shown that the artists chose to place their art precisely where they could hear the strongest echoes.
Listen to an echo at a rock art site in the Grand Canyon (click here).

Watch a video of an echoing rock art site in Utah


Figure 2. This figure on the Pecos River in Texas is painted in a shallow shelter with interesting acoustics.
2. Whisper Galleries = Disembodied Voices (relevant to parabolic shelters)
Just as light reflected in a concave mirror can focus to give a “real image” floating in front of the surface, a shallow rock shelter can focus sound waves like a parabolic dish. Sounds from unseen sources miles away can be focused to result in an auditory illusion of disembodied voices coming from thin air right next to you. Such rock shelters were often considered places of power, and were decorated with mysterious paintings. These shelters can also act like loud-speakers to broadcast sounds outward, such that listeners at great distances would wonder why they could not see who was making the sounds.


Figure 3. This stampede of hoofed animals is painted in a cave with thunderous reverberation in central India.

3. Reverberation = Thunder from Hoofed Animals (relevant to cave paintings)
Echoes of percussion noises can sound like hoof beats. Multiple echoes of a simple clap in a cavern blur together into thunderous reverberation, which mimics the sound of the thundering herds of stampeding hoofed animals painted in prehistoric caves. Ancient myths describe thunder as the hoof beats of supernatural gods. I realized that the reverberation in caves must have given the auditory illusion of being thunder, and thus inspired the cave paintings depicting that the same mythical hoofed thunder gods causing thunder in the sky also cause thunder in the underworld.
Listen to thunderous reverberation of a percussion sound in a prehistoric cave in France (click here).



4. Ricochets = “Boo-o-o!” (relevant to ghostly hauntings)
Can you hear the ricochet reminiscent of a ghostly “Boo” in this recording  of a clap in a highly reverberant room?



Figure 4. A petroglyph of a flute player in an echoing location within Dinosaur National Monument.
5. Resonance = spritely music (relevant to cave and canyon paintings)
Listen to the difference between a flute being played in a non-echoing environment, then how haunting it sounds if played in a cave (click here);

it is as if spirit musicians are in accompaniment. (Thanks to Simon Wyatt for the flute music, to which half-way through I added cave acoustics via the magic of a convolution reverberation program.)

WallerFig5_rippletank12 nodes 3D w stonehenge perspective

Figure 5. An interference pattern from two sound sources such as bagpipes can cause the auditory illusion that the silent zones are acoustic shadows from a megalithic stone circle, and vice versa.
6. Interference Patterns = Acoustic Shadows of a Ring of Pillars (relevant to Stonehenge and Pipers’ Stones)
Mysterious silent zones in an empty field can give the impression of a ring of large phantom objects casting acoustic shadows. Two sound sources, such as bagpipes playing the same tone, can produce an interference pattern. Zones of silence radiating outward occur where the high pressure of sound waves from one source cancel out the low pressure of sound waves from the other source. Blindfolded participants hearing an interference pattern in controlled experiments attributed the dead zones to the presence of acoustic obstructions in an arrangement reminiscent of Stonehenge.
These experimental results demonstrate that regions of low sound intensity due to destructive interference of sound waves from musical instruments can be misperceived as an auditory illusion of acoustic shadows cast by a ring of large rocks:

Figure 6. Drawing by participant C. Fuller after hearing interference pattern blindfolded, as envisioned from above (shown on left), and in perspective from ground level (shown on right).

I then visited the U.K. and made measurements of the actual acoustic shadows radiating out from Stonehenge and other megalithic stone circles, and demonstrated that the pattern of alternating loud and quiet zones recreates a dual source sound wave interference pattern. My theory that musical interference patterns served as blueprints for megalithic stone circles – many of which are named “Pipers’ Stones” — is supported by ancient legends that two magic pipers enticed maidens to dance in a circle and they all turned to stone.
Listen for yourself to the similarity between sound wave interference as I walk around two flutes in an empty field (click here), and acoustic shadows as I walk around a megalithic Pipers’ Stone circle (click here); both have similar modulations between loud and quiet. How would you have explained this if you couldn’t see what was “blocking” the sound?


Complex behaviors of sound such as reflection and interference (which scientists today explain by sound wave theory and dismiss as acoustical artifacts) can experimentally give rise to psychoacoustic misperceptions in which such unseen sonic phenomena are attributed to the invisible/supernatural. The significance of this research is that it can help explain the motivation for some of mankind’s most mysterious behaviors and greatest artistic achievements. There are several implications and applications of my research. It shows that acoustical phenomena were culturally significant to ancient peoples, leading to the immediate conclusion that the natural soundscapes of archaeological sites should be preserved in their natural state for further study and greater appreciation. It demonstrates that even today sensory input can be used to manipulate perception, and can give spooky illusions inconsistent with scientific reality, which could have interesting practical applications for virtual reality and special effects in entertainment media. A key point to learn from my research is that objectivity is questionable, since a given set of data can be used to support multiple conclusions. For example, an echo can be used as “proof” for either an echo spirit or sound wave reflection. Also, just based on their interpretation of sounds heard in an empty field, people can be made to believe there is a ring a huge rocks taller than themselves. The history of humanity is full of misinterpretations, such as the visual illusion that the sun propels itself across the sky above the flat earth. Sound, being invisible with complex properties, can lead to auditory illusions of the supernatural. This leads to the more general question, what other perceptional illusions are we currently living under due to other phenomena that we are currently misinterpreting?


See https://sites.google.com/site/rockartacoustics/ for further detail.

Auditory Illusions of Supernatural Spirits: Archaeological Evidence and Experimental Results

Steven J. Waller — wallersj@yahoo.com
Rock Art Acoustics
5415 Lake Murray Boulevard #8
La Mesa, CA 91942

Popular version of paper 4pAAa1
Presentation Thursday afternoon, October 30, 2014
Session: “Acoustic Trick-or-Treat: Eerie Noises, Spooky Speech, and Creative Masking”
168th Acoustical Society of America Meeting, Indianapolis, IN