4aSC19 – Consonant Variation in Southern Speech

Lisa Lipani – llipani@uga.edu
Michael Olsen – michael.olsen25@uga.edu
Rachel Olsen – rmm75992@uga.edu
Department of Linguistics
University of Georgia
142 Gilbert Hall
Athens, Georgia 30602

Popular version of paper 4aSC19
Presented Thursday morning, December 5, 2019
178th ASA Meeting, San Diego, CA

We all recognize that people from different areas of the United States have different ways of talking, especially in how they pronounce their vowels. Think, for example, about stereotypical Bostonians who might “pahk the cah in Havahd Yahd”. The field of sociolinguistics studies speech sounds from different groups of people to establish and understand regional American dialects.

While there are decades of research on vowels, sociolinguists have recently begun to ask whether consonants such as p, b, t, d, k, and g also vary depending on where people are from or what social groups they belong to. These consonants, p, b, t, d, k, and g, are known as “stop consonants,” because the airflow “stops” due to a closure in your vocal tract. One acoustic characteristic of these consonants is voice onset time, the amount of time between the closure in the vocal tract and the start of vocal fold (also known as vocal cords) vibration. We wanted to know whether some groups of speakers, say men versus women or Texans versus other Southern speakers, pronounced their consonants differently than other groups. In order to investigate this, we used the Digital Archive of Southern Speech (DASS), which contains 367 hours of recordings made across the southeastern United States between 1970 and 1983, consisting of approximately two million words of Southern speech.

The original DASS researchers were mostly interested in differences in language based on the age of speakers and their geographic location. In the interviews, people were asked about specific words that might indicate their dialect. For example, do you say “pail” or “bucket” for the thing you might borrow from Jack and Jill?

We used computational methods to investigate Southern consonants in DASS, looking at pronunciations of p, b, t, d, k, and g at the beginning of roughly 144,000 words. Our results show that ethnicity is a social factor in the production of these sounds. In our data, African Americans had longer voice onset time, meaning that there was a longer period of time between the closure of the stop consonant and the start of vocal fold vibration, even when we adjusted the data for speaking rate. This kind of research is important because as we describe differences in the way we speak, we can better understand how we express our social and regional identity.

5aSC8 – How head and eyebrow movements help make conversation a success

Samantha Danner – sfgordon@usc.edu
Dani Byrd – dbyrd@usc.edu
Department of Linguistics, University of Southern California
Grace Ford Salvatori Hall, Rm. 301
3601 Watt Way
Los Angeles, CA 90089-1693

Jelena Krivokapić– jelenak@umich.edu
Department of Linguistics, University of Michigan
440 Lorch Hall
611 Tappan Street
Ann Arbor, MI 48109-1220

Popular version of poster 5aSC8
Presented Friday morning, December 6, 2019
178th ASA Meeting, San Diego, CA

It’s easy to take for granted our ability to have a conversation, even with someone we’ve never met before. In fact, the human capacity for choreographing conversation is quite incredible. The average time from when one speaker stops speaking to when the next speaker starts is only about 200 milliseconds. Yet somehow, speakers are able to let their conversation partner know when they are ready to turn over the conversational ‘floor.’ Likewise, people somehow sense when it is their turn to start speaking. How, without any conscious effort, is this dance of conversation between two people so relatively smooth?

One possible answer to this question is that we use non-verbal communication to help move conversations along. The study described in this presentation takes a look at how movements of the eyebrow and the head might be used by participants in conversation to help determine when to exchange the conversational floor with one another. For this research, speakers conversed in a pair, each taking turns to collaboratively recite a well-known nursery rhyme like ‘Humpty Dumpty’ or ‘Jack and Jill.’ Using nursery rhymes allowed us to study spontaneous speech (speech that is not rehearsed or read) that offered many opportunities for the members of the pair to take turns speaking. We used an instrument called an electromagnetic articulograph to precisely track the eyebrow and head movements of the two conversing people. Their speech was also recorded, so that it was clear exactly when in the conversation the movements of each person’s brow and head were happening.

We wondered whether we would see more frequent movements of the eyebrows and head when someone is acting as a speaker as opposed to a listener during the conversation, and whether we would see more or less frequent movement at particular moments in the conversation, such as when one person yields the conversational floor to the other, or interrupts the other, or finds that they need to start speaking again after an awkward pause.

We found that listeners move their heads and brows more frequently than speakers. This may mean that people in conversation use face movements to show their engagement with what their partner is saying. We also found that the moment in conversation when movements are most frequent is at interruptions, indicating that listeners may use co-speech movements to signal that they are about to interrupt a speaker.

This research on spoken language helps linguists understand how humans can converse so easily and effectively, highlighting some of the many behaviors we use in talking to each other. Actions of the face and body facilitate the uniquely human capacity for language communication—we use so much more than just our voices to make a conversation happen.

3pID2 – Communication Between Native and Non-Native Speakers

Melissa Baese-Berk mbaesebe@uoregon.edu

1290 University of Oregon
Eugene, OR 97403

Popular version of 3pID2
Presented Wednesday afternoon, December 4, 2019
178th Meeting of the Acoustical Society of America, San Diego, CA

Communication is critically important in society. Operations of business, government, and the legal system rely on communication, as do more personal ventures like human relationships. Therefore, understanding how individuals understand and produce speech is important to understand how our society functions. For decades, researchers have asked questions about how people produce and perceive speech. However, the bulk of this prior research has used an idealized, monolingual speaker-listener as it’s model. Of course, this model is unrealistic in a society where, globally, most individuals speak more than one language and frequently communicate in a language that is not their native language. This is especially true with the rise of English as a lingua franca, or common language of communication – currently, non-native speakers of English outnumber native speakers of the language.

Real-world communication between individuals who do not share a language background (e.g., a native and a non-native speaker of English) can result in challenges for successful communication. For example, communication between individuals who do not share a native language background can be less efficient than communication between individuals who do share a language background. However, the sources of those miscommunications are not well-understood.

For many years, research in this domain has focused on how to help non-native listeners acquire a second or third language. Indeed, an industry of language teaching and learning apps, classes, and tools has developed. However, only in the last decade has research on how a native listener might improve their ability to understand non-native speech begun to expand rapidly.

It has long been understood that myriad factors (both social and cognitive) impact how non-native languages are learned. Our recent work demonstrates that this is also true when we ask how native listeners can better understand non-native speech. For example, a variety of cognitive factors (e.g., memory abilities) can impact how listeners understand unfamiliar speech in general. However, it is also the case that social factors, such as listeners’ attitudes, also impact perception of and adaptation to unfamiliar speech. By better understanding these factors, we can improve education and dialog around issues of native and non-native communication. This has implications for businesses and governmental organizations dealing with international communication, as well as individuals who work across language boundaries in their professional or personal relationships.

In this talk, I address issues of communication between native and non-native speakers in their capacities as speakers and listeners. Specifically, I describe the current state of knowledge about how non-native speakers understand and produce speech in their second (or third) language, how native speakers understand non-native speech, and how both parties can improve their abilities at these tasks. I argue that awareness of the issues informing communication between native and non-native speakers is required to truly understand the processes that underlying speech communication, broadly.

4pSC34 – Social contexts do not affect how listeners perceive personality traits of gay and heterosexual male talkers

Erik C. Tracy – erik.tracy@uncp.edu
University of North Carolina Pembroke
Pembroke, NC 28372

Popular version of Poster 4pSC34
Presented in the afternoon on Thursday, December 5, 2019
178th ASA Meeting, San Diego, CA

Researchers found that different social contexts change how listeners perceive a talker’s emotional state.  For example, a scream while watching a football game could be perceived as excitement, while a scream at a haunted house could be perceived as fear.  The current experiment examined whether listeners would strongly associate certain personality traits with a talker if they knew the talkers’ sexual orientation (i.e., greater social context) compared to if listeners did not know the talkers’ sexual orientation (i.e., less social context).  For example, if a listener knew that a talker was gay, they may perceive the talker as being more outgoing.  In the first phase of the experiment, listeners heard a gay or heterosexual male talker and then they rated, along a 7-point scale with 7 being the strongest, how much they associated the talker with a personality trait.  Here, listeners did not know the talkers’ sexual orientation.  It was found that listeners associated certain personality traits (e.g., confident, mad, stuck-up, and outgoing) with gay talkers and other personality traits (e.g., boring, old, and sad) with heterosexual talkers.  The second phase of the experiment was similar to the first phase, but the key difference was that the listeners were aware of the talkers’ sexual orientation.  For instance, listeners heard a gay or heterosexual talker and then rated the talker along the 7-point scale.  On each trial, the talker’s sexual orientation was presented next to the 7-point scale.  The results of the second phase were similar to the results from the first phase.  If listeners knew the talkers’ sexual orientation, they still perceived gay talkers as being more confident, mad, stuck-up, and outgoing, and they still perceived heterosexual talkers as being more boring, old, and sad.  As an example, the Outgoing chart shows how listeners responded if they knew or did not know the talkers’ sexual orientation when deciding how outgoing the talker was.

Social contexts

In conclusion, if listeners knew the talkers’ sexual orientation (i.e., greater social context), then this did not strengthen associations between gay and heterosexual talkers and certain personality traits.

4pSC15 – Reading aloud in a clear speaking style may interfere with sentence recognition memory

Sandie Keerstock – keerstock@utexas.edu
Rajka Smiljanic – rajka@austin.utexas.edu
Department of Linguistics, The University of Texas at Austin
305 E 23rd Street, B5100, Austin, TX 78712

Popular version of paper 4pSC15
Presented Thursday afternoon, May 16, 2019
177th ASA Meeting, Louisville, KY

Can you improve your memory by speaking clearly? If, for example, you are rehearsing for a presentation, what speaking style will better enhance your memory of the material: reading aloud in a clear speaking style, or reciting the words casually, as if speaking with a friend?

When conversing with a non-native listener or someone with a hearing problem, talkers spontaneously switch to clear speech: they slow down, speak louder, use a wider pitch range, and hyper-articulate their words. Compared to more casual speech, clear speech enhances a listener’s ability to understand speech in a noisy environment. Listeners also better recognize previously heard sentences and recall what was said if the information was spoken clearly.

Figure 1. Illustration of the procedure of the recognition memory task.

In this study, we set out to examine whether talkers, too, have better memory of what they said if they pronounced it clearly.In the training phase of the experiment, 60 native and 30 non-native English speakers were instructed to read aloud and memorize 60 sentences containing high-frequency words, such as “The hot sun warmed the ground,” as they were presented one by one on a screen. Each screen directed the subject with regard to speaking style, alternating between “clear” and “casual” every ten slides. During the test phase, they were asked to identify as “old” or “new” 120 sentences written on the screen one at a time: 60 they had read aloud in either style, and 60 they had not.

clear speech

Figure 2. Average of d’ (discrimination sensitivity index) for native (n=60) and non-native English speakers (n=30) for sentences produced in clear (light blue) and casual (dark blue) speaking styles. Higher d’ scores denote enhanced accuracy during the recognition memory task. Error bars represent standard error.

Unexpectedly, both native and non-native talkers in this experiment showed enhanced recognition memory for sentences they read aloud in a casual style. Unlike in perception, where hearing clearly spoken sentences improved listeners’ memory, findings from the present study tend to indicate a memory cost when talkers themselves produced clear sentences. This asymmetry between the production and perception effect on memory may be related to the same underlying mechanism, namely the Effortfulness Hypothesis (McCoy et al. 2005). In perception, more cognitive resources are used during processing of more-difficult-to-understand casual speech and fewer resources remain available for storing information in memory. Conversely, cognitive resources may be more depleted during the production of hyper-articulated clear sentences, which could lead to poorer memory encoding. This study suggests that the benefit of clear speech may be limited to the retention of spoken information in long-term memory of listeners, but not talkers.

2aSC3 – Studying Vocal Fold Non-Stationary Behavior during Connected Speech Using High-Speed Videoendoscopy

Maryam Naghibolhosseini – naghib@msu.edu
Dimitar D. Deliyski – ddd@msu.edu
Department of Communicative Sciences and Disorders, Michigan State University
1026 Red Cedar Rd.
East Lansing, MI 48824

Stephanie R.C. Zacharias – Zacharias.Stephanie@mayo.edu
Department of Otolaryngology Head & Neck Surgery, Mayo Clinic
13400 E Shea Blvd.
Scottsdale, AZ 85259

Alessandro de Alarcon – alessandro.dealarcon@cchmc.org
Division of Pediatric Otolaryngology, Cincinnati Children’s Hospital Medical Center
3333 Burnet Ave
Cincinnati, OH 45229

Robert F. Orlikoff – orlikoffr16@ecu.edu
College of Allied Health Sciences, East Carolina University
2150 West 5th St.
Greenville, NC 27834

Popular version of paper 2aSC3
Presented Tuesday morning, Nov 6, 2018
176th ASA Meeting, Victoria, BC, Canada

You would feel the vibrations of your vocal folds when you place your hand on your neck while saying /a/. The vocal fold vibratory behavior can be studied to learn about the voice production mechanisms. Better understanding of the voice production in norm and disorder could be helpful to improve voice assessment and treatment strategies. One of the techniques to study the vocal fold function is laryngeal imaging. The most sophisticated tool for laryngeal imaging is high-speed videoendoscopy (HSV), which enables us to record vocal fold vibrations with high temporal resolution (thousands of frames per second, fps). The recent advancement of coupling HSV systems with flexible nasolaryngoscopes has provided us the unique opportunity of recording the vocal fold vibrations during connected speech for the first time.

In this study, HSV data were obtained from a vocally normal 38 year old female during reading of the “Rainbow Passage” using a custom-built HSV system at 4,000 fps. This frame rate leads to the recording length of 29.14 seconds (total of 116,543 frames). The following video shows one second of the recorded HSV with playback speed of 30 fps.

 Video1

The HSV dataset is large and it will take about 32 hours to just look at the data if you spend 1 second per image frame! You can imagine with this large dataset, the manual analysis of the data is not doable and automated computerized methods are required. The goal of this research project is to develop automatic algorithms for the analysis of HSV in running speech to extract meaningful information about the vocal fold function. How the vibration of the vocal folds starts and how it ends during phonation are critical factors in studying the pathophysiology of voice disorders. Hence, in this project, the onset and offset of phonation that have non-stationary behavior are studied.

We have developed the following automated algorithms: temporal segmentation, motion compensation, spatial segmentation, and onset/offset measurements. The temporal segmentation algorithm was able to determine the onset and offset timestamps of phonation. To do so, the glottal area (the dark area between the vocal folds) waveform was measured. The area change is due to the vibrations of the vocal folds. This waveform can be converted to an acoustic signal that we can listen to. In the following video, you can follow the “Rainbow Passage” text while listening to the extracted audio from the glottal area waveform. It should be noted that this audio signal was merely extracted from the HSV images and no acoustic signal was recorded from the subject.

Video 2

A motion compensation algorithm was developed to align the vocal folds across frames to overcome the laryngeal tissue maneuvers during connected speech. You may see in the following video that after the motion compensation, the vocal folds location is almost the same across frames in the cropped frame.

Video 3

The spatial segmentation was performed to extract the edges of the vibrating vocal folds from HSV kymograms. The kymograms were extracted by passing a line in the medial section of the frames to capture the vocal fold vibrations over this line in time. An active contour modeling approach was applied to the HSV kymograms of each vocalized segment to provide an analytic description of the vocal fold edges across the frames. You can see the result of spatial segmentation for one vocalization in the following figure.

videoendoscopy

Figure 1

The glottal attack time (the time difference between the first vocal fold oscillation to first contact), offset time (the time difference between the last vocal fold contact to last oscillation), amplification ratio, and damping ratio were measured from the spatially segmented kymogram, shown in the figure. The amplification ratio shows how the oscillation grows at the beginning of phonation and the damping ratio quantifies how the oscillation dies at the offset of phonation. These measures are beneficial to describe the laryngeal dynamics of voice production.