2aSC8 – Some people are eager to be heard: anticipatory posturing in speech production – Sam Tilsen

2aSC8 – Some people are eager to be heard: anticipatory posturing in speech production – Sam Tilsen

Consider a common scenario in a conversation: your friend is in the middle of asking you a question, and you already know the answer. To be polite, you wait to respond until your friend finishes the question. But what are you doing while you are waiting?

You might think that you are passively waiting for your turn to speak, but the results of this study suggest that you may be more impatient than you think. In analogous circumstances recreated experimentally, speakers move their vocal organs—i.e. their tongues, lips, and jaw—to positions that are appropriate for the sounds that they intend to produce in the near future. Instead of waiting passively for their turn to speak, they are actively preparing to respond.

To examine how speakers control their vocal organs prior to speaking, this study used real-time magnetic resonance imaging of the vocal tract. This recently developed technology takes a picture of tissue in middle of the vocal tract, much like an x-ray, and it takes the picture about 200 times every second. This allows for measurement of rapid changes in the positions of vocal organs before, during, and after people are speaking.

A video is available online (http://youtu.be/h2_NFsprEF0).

To understand how changes in the positions of vocal organs are related to different speech sounds, it is helpful to think of your mouth and throat as a single tube, with your lips at one end and the vocal folds at the other. When your vocal folds vibrate, they create sound waves that resonate in this tube. By using your lips and tongue to make closures or constrictions in the tube, you can change the frequencies of the resonating sound waves. You can also use an organ called the velum to control whether sound resonates in your nasal cavity. These relations between vocal tract postures and sounds provide a basis for extracting articulatory features from images of the vocal tract. For example, to make a “p” sound you close your lips, to make an “m” sound you close your lips and lower your velum, and to make “t” sound you press the tip of the tongue against the roof of your mouth.
Participants in this study produced simple syllables with a consonant and vowel (such as “pa” and “na”) in several different conditions. In one condition, speakers knew ahead of time what syllable to produce, so that they could prepare their vocal tract specifically for the response. In another condition, they produced the syllable immediately without any time for response-specific preparation. The experiment also manipulated whether speakers were free to position their vocal organs however they wanted before responding, or whether they were constrained by the requirement to produce the vowel “ee” before their response.
All of the participants in the study adopted a generic “speech-ready” posture prior to making a response, but only some of them adjusted this posture specifically for the upcoming response. This response-specific anticipation only occurred when speakers knew ahead of time exactly what response to produce. Some examples of anticipatory posturing are shown in the figures below.

Figure 2. Examples of anticipatory postures for “p” and “t” sounds. The lips are closer together in anticipation of “p” and the tongue tip is raised in anticipation of “t”.
Figure 3. Examples of anticipatory postures for “p” and “m” sounds. The velum is raised in anticipation of “p” and lowered in anticipation of “m”.
The surprising finding of this study was that only some speakers anticipatorily postured their vocal tracts in a response-specific way, and that speakers differed greatly in which vocal organs they used for this purpose. Furthermore, some of the anticipatory posturing that was observed facilitates production of an upcoming consonant, while other anticipatory posturing facilitates production of an upcoming vowel. The figure below summarizes these results.
Figure 4. Summary of anticipatory posturing effects, after controlling for generic speech-ready postures.
Why do some people anticipate vocal responses while others do not? Unfortunately, we don’t know: the finding that different speakers use different vocal organs to anticipate different sounds in an upcoming utterance is challenging to explain with current models of speech production. Future research will need to investigate the mechanisms that give rise to anticipatory posturing and the sources of variation across speakers.


Sam Tilsen – tilsen@cornell.edu
Peter Doerschuk – pd83@cornell.edu
Wenming Luh – wl358@cornell.edu
Robin Karlin – rpk83@cornell.edu
Hao Yi – hy433@cornell.edu
Cornell University
Ithaca, NY 14850

Pascal Spincemaille – pas2018@med.cornell.edu
Bo Xu – box2001@med.cornell.edu
Yi Wang – yiwang@med.cornell.edu
Weill Medical College
New York, NY 10065

Popular version of paper 2aSC8
Presented Tuesday morning, October 28, 2014
168th ASA Meeting, Indianapolis

4pAAa10 – Eerie voices: Odd combinations, extremes, and irregularities. – Brad Story

The human voice is a pattern of sound generated by both the mind and body, and carries with it information about about a speaker’s mental and physical state. Qualities such as gender, age, physique, dialect, health, and emotion are often embedded in the voice, and can produce sounds that are comforting and pleasant, intense and urgent, sad and happy, and so on. The human voice can also project a sense of eeriness when the sound contains qualities that are human-like, but not necessarily typical of the speech that is heard on a daily basis. A person with an unusually large head and neck, for example, may produce highly intelligible speech, but it will be oddly dominated by low frequency sounds that belie the atypical size of the talker. Excessively slow or fast speaking rates, strangely-timed and irregular speech, as well as breathiness and tremor may all also contribute to an eeriness if produced outside the boundaries of typical speech.

The sound pattern of the human voice is produced by the respiratory system, the larynx, and the vocal tract. The larynx, located at the bottom of the throat, is comprised of a left and right vocal fold (often referred to as vocal cords) and a surrounding framework of cartilage and muscle. During breathing the vocal folds are spread far apart to allow for an easy flow of air to and from the lungs. To generate sound they are brought together firmly, allowing air pressure to build up below them. This forces the vocal folds into vibration, creating the sound waves that are the “raw material” to be formed into speech by the vocal tract. The length and mass of the vocal folds largely determine the vocal pitch and vocal quality. Small and light vocal folds will generally produce a high pitched sound, whereas low pitch typically originate with large, heavy vocal folds.

The vocal tract is the airspace created by the throat and the mouth whose shape at any instant of time depends on the positions of the tongue, jaw, lips, velum, and larynx. During speech it is a continuously changing tube-like structure that “sculpts” the raw sound produced by the vocal folds into a stream of vowels and consonants. The size and shape of the vocal tract imposes another layer of information about the talker. A long throat and large mouth may transmit the impression of a large body while more subtle characteristics like the contour of the roof of the mouth may add characteristics that are unique to the talker.

For this study, speech was simulated with a mathematical representation of the vocal folds and vocal tract. Such simulations allow for modifications of size and shape of structures, as well as temporal aspects of speech. The goal was to simulate extremes in vocal tract length, unusual timing patterns of speech movements, and odd combinations of breathiness and tremor. The result can be both eerie and amusing because the sounds produced are almost human, but not quite.

Three examples are included to demonstrate these effects. The first is set of seven simulations of the word “abracadabra” produced while gradually decreasing the vocal tract length from 22 cm to 6.6 cm, increasing the vocal pitch from very low to very high, and increasing the speaking rate from slow to fast. The longest and shortest vocal tracts are shown in Figure 1 and are both configured as “ah” vowels; for production of the entire word, the vocal tract shape continuously changes. The set of simulations can be heard in sound sample 1.

Although it may be tempting to assume that the changes present in sound sample 1 are similar to simply increasing the playback speed of the audio, the changes are based on physiological scaling of the vocal tract, vocal folds, as well as an increase in the speaking rate. Sound sample 2 contains the same seven simulations except that the speaking rate is exactly the same in each case, eliminating the sense of increased playback speed.

The third example demonstrates the effects of modifying the timing of the vowels and consonants within the word “abracadabra” while simultaneously adding a shaky or tremor-like quality, and an increased amount of breathiness. A series of six simulations can be heard in sound sample 3; the first three versions of the word are based on the structure of an unusually large male talker, whereas the second three are representative of an adult female talker.

This simulation model used for these demonstrations has been developed for purposes of studying and understanding human speech production and speech development. Using the model to investigate extreme cases of structure and unusual timing patterns is useful for better understanding the limits of human speech.



Figure 1 caption:
Unnaturally long and short tube-like representations of the human vocal tract. Each vocal tract is configured as an “ah” vowel (as in “hot”), but during speech the vocal tract continuously changes shape. Vocal tract lengths for typical adult male and adult female talkers are approximately 17.5 cm and 15 cm, respectively. Thus, the 22 cm long tract would be representative of a person with an unusually large head and neck, whereas the 6.6 cm vocal tract is even shorter than a typical infant.


Brad Story – bstory@email.arizona.edu
Dept. of Speech, Language, and Hearing Sciences
University of Arizona
P.O. Box 210071
Tucson, AZ 85712

Popular version of paper 4pAAa10
Presented Thursday afternoon, October 30, 2014
168th ASA Meeting, Indianapolis


4aSCb8 – How do kids communicate in challenging conditions? – Valerie Hazan

4aSCb8 – How do kids communicate in challenging conditions? – Valerie Hazan

Kids learn to speak fluently at a young age and we expect young teenagers to communicate as effectively as adults. However, researchers are increasingly realizing that certain aspects of speech communication have a slower developmental path. For example, as adults, we are very skilled at adapting the way that we speak according to the needs of the communication. When we are speaking a predictable message in good listening conditions, we do not need to make an effort to pronounce speech clearly and we can expend less effort. However, in poor listening conditions or when transmitting new information, we increase the effort that we make to enunciate speech clearly in order to be more easily understood.

In our project, we investigated whether 9 to 14 year olds (divided into three age bands) were able to make such skilled adaptations when speaking in challenging conditions. We recorded 96 pairs of friends of the same age and gender while they carried out a simple picture-based ‘spot the difference’ game (See Figure 1).
Figure 1: one of the picture pairs in the DiapixUK ‘spot the difference’ task.

The two friends were seated in different rooms and spoke to each other via headphones; they had to try to find 12 differences between their two pictures without seeing each other or the other picture. In the ‘easy communication’ condition, both friends could hear each other normally, while in the ‘difficult communication’ condition, we made it difficult for one of the friends (‘Speaker B’) to hear the other by heavily distorting the speech of ‘Speaker A’ using a vocoder (See Figure 2 and sound demos 1 and 2). Both kids had received some training at understanding this type of distorted speech. We investigated what adaptations Speaker A, who was hearing normally, made to his or her speech in order to make themselves understood by their friend with ‘impaired’ hearing, so that they could complete the task successfully.
Figure 2: The recording set up for the ‘easy communication’ (NB) and ‘difficult communication’ (VOC) conditions.

Sound 1: Here, you will hear an excerpt from the diapix task between two 10 year olds in the ‘difficult communication’ conversation from the viewpoint of the talker hearing normally. Hear how she attempts to clarify her speech when her friend has difficulty understanding her.

Sound 2: Here, you will hear the same excerpt but from the viewpoint of the talker hearing the heavily degraded (vocoded) speech. Even though you will find this speech very difficult to understand, even 10 year olds get better at perceiving it after a bit of training. However, they are still having difficulty understanding what is being said, which forces their friend to make greater effort to communicate.

We looked at the time it took to find the differences between the pictures as a measure of communication efficiency. We also carried out analyses of the acoustic aspects of the speech to see how these varied when communication was easy or difficult.
We found that when communication was easy, the child groups did not differ from adults in the average time that it took to find a difference in the picture, showing that 9 to 14 year olds were communicating as efficiently as adults. When the speech of Speaker A was heavily distorted, all groups took longer to do the task, but only the 9-10 year old group took significantly longer than adults (See Figure 3). The additional problems experienced by younger kids are likely to be due both to greater difficulty for Speaker B in understanding degraded speech and to Speaker A being less skilled at compensating for this difficulties. The results obtained for children aged 11 and older suggest that they were using good strategies to compensate for the difficulties imposed on the communication (See Figure 3).
Figure 3: Average time taken to find one difference in the picture task. The four talker groups do not differ when communication is easy (blue bars); in the ‘difficult communication’ condition (green bars), the 9-10 years olds take significantly longer than the adults but the other child groups do not.

In terms of the acoustic characteristics of their speech, the 9 to 14 year olds differed in certain aspects from adults in the ‘easy communication’ condition. All child groups produced more distinct vowels and used a higher pitch than adults; kids younger than 11-12 also spoke more slowly and more loudly than adults. They hadn’t learnt to ‘reduce’ their speaking effort in the way that adults would do when communication was easy. When communication was made difficult, the 9 to 14 year olds were able to make adaptations to their speech for the benefit of their friend hearing the distorted speech, even though they themselves were having no hearing difficulties. For example, they spoke more slowly (See Figure 4) and more loudly. However, some of these adaptations differed from those produced by adults.
Figure 4: Speaking rate changes with age and communication difficulty. 9-10 year olds spoke more slowly than adults in the ‘easy communication’ condition (blue bars). All speaker groups slowed down their speech as a strategy to help their friend understand them in the ‘difficult communication’ (vocoder) condition (green bars).

Overall, therefore, even in the second decade of life, there are changes taking place in the conversational speech produced by young people. Some of these changes are due to physiological reasons such as growth of the vocal apparatus, but increasing experience with speech communication and cognitive developments occurring in this period also play a part.

Younger kids may experience greater difficulty than adults when communicating in difficult conditions and even though they can make adaptations to their speech, they may not be as skilled at compensating for these difficulties. This has implications for communication within school environments, where noise is often an issue, and for communication with peers with hearing or language impairments.


Valerie Hazan – v.hazan@ucl.ac.uk
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK

Michèle Pettinato – Michele.Pettinato@uantwerpen.be
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK

Outi Tuomainen – o.tuomainen@ucl.ac.uk
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK
Sonia Granlund – s.granlund@ucl.ac.uk
University College London (UCL)
Speech Hearing and Phonetic Sciences
Gower Street, London WC1E 6BT, UK
Popular version of paper 4aSCb8
Presented Thursday morning, October 30, 2014
168th ASA Meeting, Indianapolis

1pAA1 – Audible Simulation in the Canadian Parliament – Ronald Eligator

1pAA1 – Audible Simulation in the Canadian Parliament – Ronald Eligator

If the MP’s speeches don’t put you to sleep, at least you should be able to understand what they are saying.

Using state-of-the-art audible simulations, a design team of acousticians, architects and sound system designers is working to ensure that speech within the House of Commons chamber of the Parliament of Canada now in design will be intelligible in either French or English.

The new chamber for the House of Commons is being built in a glass-topped atrium in the courtyard of the West Block building on Parliament Hill in Ottawa. The chamber will be the temporary home of the House of Commons, while their traditional location in the Center Block building is being renovated and restored.

The skylit atrium in the West Block will be about six times the volume of the existing room, resulting in significant challenges for ensuring speech will be intelligibility.


Figure 1 - House_of_Commons

Figure 1: Existing Chamber of the House of Commons, Parliament of Canada

The existing House chamber is 21 meters (70 feet) long, 16 meters (53 feet) wide, and has seats for the current 308 Members of Parliament (to increase to 338 in 2015) and 580 people in the upper gallery that runs around the second level of the room. Most surfaces are wood, although the floor is carpeted, and there is an adjustable curtain at the rear of the MP seating area on both sides of the room. The ceiling is a painted stretched linen canvas over the ceiling 14.7 meters (48.5 feet) above the commons floor, resulting in a room volume of approximately 5000 cubic meters.

The new House chamber is being infilled into an existing courtyard that is 44 meters (145 feet) long, 39 meters (129 feet) wide, and 18 meters (59 feet) high. The meeting space itself will retain the same basic footprint as the existing room, including the upper gallery seating, but will be open to the sound reflective glass roof and stone and glass side walls of the courtyard. In the absence of any acoustic treatments, the high level of reverberant sound would make it very difficult to understand speech in the room.



Figure 2 - 2010 PERSPECTIVE-1

Figure 2: Early Design Rendering of Chamber in West Block

In order to help the Public Works and Government Services Canada (PWGSC) and the House of Commons understand the acoustic differences between the existing house chamber and the one under design, and to assure them that excellent speech intelligibility will be achieved in the new chamber, Acoustic Distinctions, the New York-based acoustic consultant, created a computer model of both the new and existing house chambers, and performed acoustic tests in the existing chamber. AD also made comparisons of the two room using sophisticated data analysis and tables of data an produced graphs maps of speech intelligibility in each space.

An early design iteration, for example, included significant areas of sound absorptive materials at the sides of the ceiling areas, as well as sound absorptive materials integrated into the branches of the tree-like structure which supports the roof:




Figure 3

Figure 3: Computer Model of Room Finishes
The dark areas of the image show the location of sound absorptive materials, including triangularly-shaped wedges integrated into the structure which supports the roof.

Using a standardized measure of intelligibility, AD estimated a speech quality of 65% using the Speech Transmission Index (STI), a standardized measure of speech intelligibility, where a minimum of 75% was needed to ensure excellent intelligibility.

The computer analysis done by Acoustic Distinctions also produced colorful images relating to the degree of speech intelligibility that was to be expected:


Figure 4

Figure 4: Speech Transmission Index, single person speaking, no reinforcement
Talker at lower left; Listener at lower right
Dark blue to black color indicates fair to good intelligibility

While these numerical and graphical tools were useful in understanding acoustic conditions of the new room, in order to make it easier for the client and design team to appreciate the acoustic recommendations made by the consultant, Acoustic Distinctions also produced computer simulations of speech within the new room, enabling the team to hear the way the new room will sound when complete.

This approach, known as audible simulation or auralization, has been used to analyze a variety of room design options, and as the design progresses, new analysis and simulations are produced.

This first audible simulation is made using the room model shown above. The talker is an MP standing near the center of the bright yellow area in the STI map above. The listener is an MP seated in the opposite corner of the room, where the dark blue to black color confirms the STI value of just less than 0.70, corresponding to “good” intelligibility.
Audio file 1: Speech without Sound System. STI 0.68


To increase the intelligibility to values above the 0.75 minimum design goal, we add the sound system, being designed by Engineering Harmonics, to our model. With the sound system operating, STI value are increased for the above talker/speaker pair to 0.85. Speech will sound like this:
Audio file 2: Speech with Sound System. STI 0.85


While these examples clearly show the benefit of a speech reinforcement system in the Chamber, the design and client team were not satisifed with the extent of sound absorptive materials in the ceiling of the Chamber that were required to achieve the results of excellent intelligibility. An additional goal was expressed to reduce the total amount of sound absorptive materials in the room, to make the structure and skylight more visible and prominent.

Acoustic Distinctions therefore made changes to the model, strategically removing sound absorptive materials from specific ceiling locations, and reconfiguring the absorptive materials within the upper reaches of the structure supporting the roof. Computer models were again developed, and the resulting images showed that with careful design, excellent intelligibility would be achieved with reduced absorption.

Figure 5 - E_07_SOUND_SYSTEM_ON_40_STI_NoiseFigure 5: Speech Transmission Index, single person speaking, with sound reinforcement
Talker at upper left; Listener at lower right
Bright pink to red color indicates excellent intelligibility

Not surprisingly, communicating this to the design team and House of Commons in a way that provided a high level of confidence in the results was required. We again used audible simulations to demonstrate the results:
Audio file 3: Speech with Sound System, reduced absorption. STI 0.82


The rendering below shows the space configuration associated with the latest results:


Figure 6 - House of Commons Glass Dome rendering

Figure 6: Rendering, House of Commons, West Block, Parliament Hill
Proposed Design Configuration, showing sound absorptive panels
integrated into laylight and structure supporting roof





Audible Simulation in the Canadian Parliament
The impact of auralization on design decisions for the House of Commons

Ronald Eligator – religator@ad-ny.com

Acoustic Distinctions, Inc.
145 Huguenot Street
New Rochelle, NY 10801

Popular version of paper 1pAA1
Presented Monday morning, October 27, 2014

168th ASA Meeting, Indianapolis


1aSC9 – Challenges when using mobile phone speech recordings as evidence in a court of law – Balamurali B. T. Nair

1aSC9 – Challenges when using mobile phone speech recordings as evidence in a court of law – Balamurali B. T. Nair

When Motorola’s vice president, Martin Cooper, made his first call from a mobile phone device, which priced about four thousand dollars back in 1983, one could not have imagined then that in just a few decades mobile phones would become a crucial and ubiquitous part of everyday life. Not surprisingly this technology is also being increasingly misused by the criminal fraternity to coordinate their activities, which range from threatening calls, to ransoms and even bank frauds and robberies.

Recordings of mobile phone conversations can sometimes be presented as major pieces of evidence in a court of law. However, identifying a criminal by their voice is not a straight forward task and poses many challenges. Unlike DNA and finger prints, an individual’s voice is far from constant and exhibits changes as a result of a wide range of factors. For example, the health condition of a person can substantially change his/her voice, and as a result the same words spoken on one occasion would sound different on another.

The process of comparing voice samples and then presenting the outcome to a court of law is technically known as forensic voice comparison. This process begins by extracting a set of features from the available speech recordings of an offender, whose identity obviously is unknown, in order to capture information that is unique to their voice. These features are then compared using various procedures with those of the suspect charged with the offence.

One approach that is becoming widely accepted nowadays amongst forensic scientists for undertaking forensic voice comparison is known as the likelihood ratio framework. The likelihood ratio addresses two different hypotheses and estimates their associated probabilities. First is the prosecution hypothesis which states that suspect and offender voice samples have the same origin (i.e., suspect committed the crime). Second is the defense hypothesis that states that the compared voice samples were spoken by different people who just happen to sound similar.

When undertaking this task of comparing voice samples, forensic practitioners might erroneously assume that mobile phone recordings can all be treated in the same way, irrespective of which mobile phone network they originated from. But this is not the case. There are two major mobile phone technologies currently in use today: the Global System for Mobile Communications (GSM) and Code Division Multiple Access (CDMA), and these two technologies are fundamentally different in the way they process speech. One difference, for example, is that the CDMA network incorporates a procedure for reducing the effect of background noise picked up by the sending-end mobile microphone, whereas the GSM network does not. Therefore, the impact of these networks on voice samples is going to be different, which in turn will impact the accuracy of any forensic analysis undertaken.

Having two mobile phone recordings, one for the suspect and another for the offender that originate from different networks represent a typical scenario in forensic case work. This situation is normally referred to as a mismatched condition (see Figure 1). Researchers at the University of Auckland, New Zealand, have conducted a number of experiments to investigate in what ways and to what extent such mismatch conditions can impact the accuracy and precision of a forensic voice comparison. This study used speech samples from 130 speakers, where the voice of each speaker had been recorded on three occasions, separated by one month intervals. This was important in order to account for the variability in a person’s voice which naturally occurs from one occasion to another. In these experiments the suspect and offender speech samples were processed using the same speech codecs as used in the GSM and CDMA networks. Mobile phone networks use these codecs to compress speech in order to minimize the amount of data required for each call. Not only this, the speech codec dynamically interacts with the network and changes its operation in response to changes occurring in the network. The codecs in these experiments were set to operate in a manner similar to what happens in a real, dynamically changing, mobile phone network.


Typical scenario in a forensic case work

The results suggest that the degradation in the accuracy of a forensic analysis under mismatch conditions can be very significant (as high as 150%). Surprisingly, though, these results also suggest that the precision of a forensic analysis might actually improve. Nonetheless, precise but inaccurate results are clearly undesirable. The researchers have proposed a strategy for lessening the impact of mismatch by passing the suspect’s speech samples through the same speech codec as the offender’s (i.e., either GSM or CDMA) prior to forensic analysis. This strategy has been shown to improve the accuracy of a forensic analysis by about 70%, but performance is still not as good as analysis under matched conditions.



Balamurali B. T. Nair – bbah005@aucklanduni.ac.nz
Esam A. Alzqhoul – ealz002@aucklanduni.ac.nz
Bernard J. Guillemin – bj.guillemin@auckland.ac.nz

Dept. of Electrical & Computer Engineering,
Faculty of Engineering,
The University of Auckland,
Private Bag 92019, Auckland Mail Centre,
Auckland 1142, New Zealand.

Phone: (09) 373 7599 Ext. 88190
DDI: (09) 923 8190
Fax: (09) 373 7461


Popular version of paper 1aSC9

Presented Monday morning, October 27, 2014

168th ASA Meeting, Indianapolis



4aAAa1 – Speech-in-noise recognition as both an experience- and signal-dependent process – Ann Bradlow

Real-world speech understanding in naturally “crowded” auditory soundscapes is a complex operation that acts upon an integrated speech-plus-noise signal.   Does all of the auditory “clutter” that surrounds speech make its way into our heads along with the speech? Or, do we perceptually isolate and discard background noise at an early stage of processing based on general acoustic properties that differentiate sounds from non-speech noise sources and those from human vocal tracts (i.e. speech)?

We addressed these questions by first examining the ability to tune into speech while simultaneously tuning out noise. Is this ability influenced by properties of the listener (their experience-dependent knowledge) as well as by properties of the signal (factors that make it more or less difficult to separate a given target from a given masker)? Listeners were presented with English sentences in a background of competing speech that was either English (matched-language, English-in-English recognition) or another language (mismatched-language, e.g. English-in-Mandarin recognition). Listeners were either native or non-native listeners of English and were either familiar or unfamiliar with the language of the to-be-ignored, background speech (English, Mandarin, Dutch, or Croatian). Overall, we found that matched-language speech-in-speech understanding (English-in-English) is significantly harder than mismatched-language speech-in-speech understanding (e.g. English-in-Mandarin). Importantly, listener familiarity with the background language modulated the magnitude of the mismatched-language benefit On a smaller time scale of experience, we also find that this benefit is modulated by short-term adaptation to a consistent background language within a test session. Thus, we conclude that speech understanding in conditions that involve competing background speech engages experience-dependent knowledge in addition to signal-dependent processes of auditory stream segregation.

Experiment Series 2 then asked if listeners’ memory traces for spoken words with concurrent background noise remain associated in memory with the background noise. Listeners were presented with a list of spoken words and for each word they were asked to indicate if the word was “old” (i.e. had occurred previously in the test session) or “new” (i.e. had not been presented over the course of the experiment). All words were presented with concurrent noise that was either aperiodic in a limited frequency band (i.e. like wind in the trees) or a pure tone. Importantly, both types of noise were clearly from a sound source that was very different from the speech source. In general, words were more likely to be correctly recognized as previously-heard if the noise on the second presentation matched the noise on the first presentation (e.g. pure tone on both first and second presentations of the word). This suggests that the memory trace for spoken words that have been presented in noisy backgrounds includes an association with the specific concurrent noise. That is, even sounds that quite clearly emanate from an entirely different source remain integrated with the cognitive representation of speech rather than being permanently discarded during speech processing.

These findings suggest that real-world speech understanding in naturally “crowded” auditory soundscapes involves an integrated speech-plus-noise signal at various stages of processing and representation. All of the auditory “clutter” that surrounds speech somehow makes its way into our heads along with the speech leaving us with exquisitely detailed auditory memories from which we build rich representations of our unique experiences.


Speech-in-noise recognition as both an experience- and signal-dependent process


Ann Bradlow – abradlow@northwestern.edu
Department of Linguistics
Northwestern UniversitY
2016 Sheridan Road
Evanston, IL 60208
Popular version of paper 4aAAa1

Presented Thursday morning, October 30, 2014

168th ASA Meeting, Indianapolis

Important note: The work in this presentation was conducted in a highly collaborative laboratory at Northwestern University. Critical contributors to this work are former group members Susanne Brouwer (now at Utrecht University, Netherlands), Lauren Calandruccio (now at UNC-Chapel Hill), and Kristin Van Engen (now at Washington University, St. Louis), and current group member, Angela Cooper.