Kristin Van Engen – kvanengen@wustl.edu Avanti Dey Nichole Runge Mitchell Sommers Brent Spehar Jonathen E. Peelle
Washington University in St. Louis 1 Brookings Drive St. Louis, MO 63130
Popular version of paper 4aSC12 Presented Thursday morning, May 10, 2018 175th ASA Meeting, Minneapolis, MN
How hard is it to recognize a spoken word?
Well, that depends. Are you old or young? How is your hearing? Are you at home or in a noisy restaurant? Is the word one that is used often, or one that is relatively uncommon? Does it sound similar to lots of other words in the language?
As people age, understanding speech becomes more challenging, especially in noisy situations like parties or restaurants. This is perhaps unsurprising, given the large proportion of older adults who have some degree of hearing loss. However, hearing measurements do not actually do a very good job of predicting the difficulty a person will have with speech recognition, and older adults tend to do worse than younger adults even when their hearing is good.
We also know that some words are more difficult to recognize than others. Words that are used rarely are more difficult than common words, and words that sound similar to many other words in the language are recognized less accurately than unique-sounding words. Relatively little is known, however, about how these kinds of challenges interact with background noise to affect the process of word recognition or how such effects might change across the lifespan.
In this study, we used eye tracking to investigate how noise and word frequency affect the process of understanding spoken words. Listeners were shown a computer screen displaying four images, and listened the instruction “Click on the” followed by a target word (e.g., “Click on the dog.”). As the speech signal unfolds, the eye tracker records the moment-by-moment direction of the person’s gaze (60 times per second). Since listeners direct their gaze toward the visual information that matches incoming auditory information, this allows us to observe the process of word recognition in real time.
Our results indicate that word recognition is slower in noise than in quiet, slower for low-frequency words than high-frequency words, and slower for older adults than younger adults. Interestingly, young adults were more slowed down by noise than older adults. The main difference, however, was that young adults were considerably faster to recognize words in quiet conditions. That is, word recognition by older adults didn’t differ much from quiet to noisy conditions, but young listeners looked like older listeners when tasked with listening to speech in noise.
Mara Logerquist1 Alisha Martell 1 Hyuna Mia Kim2 Benjamin Munson1 (contact author, munso005@umn.edu, +1 612 619 7724) Jan Edwards2,3,4
1Department of Speech-Language-Hearing Sciences, University of Minnesota, Twin Cities, 2Department of Communication Sciences and Disorders, University of Wisconsin, Madison, 3Department of Hearing and Speech Sciences, University of Maryland, College Park, 4Language Science Center, University of Maryland, College Park
Lay-language version of paper Growth in the Accuracy of Preschool Children’s /r/ Production: Evidence from a Longitudinal Study, poster 5aSC, presented in the session Speech Production, Friday, May 11, 8 am – 12 pm.
Few would dispute that language acquisition is a fascinating and remarkable feat. Children progress from their first coos and cries to saying full sentences in a matter of just a few years. Given all that is involved in spoken language, it seems almost unreal that children could accomplish this Herculean task in such a short time. Even the seemingly simple task of learning to pronounce sounds is, on closer examination, rather tough. Children have to listen to the adults around them to figure out what they should sound like. Then, they have to approximate the adult productions that they hear with the very different vocal instrument that they have: children’s vocal tracts are about half the size of an adult’s. Not surprisingly, specific difficulty in learning speech sounds is one of the most common communication disorders.
The English /r/ sound is a particularly interesting topic in speech sound acquisition. It is one of the last sounds to be acquired. For many children with developmental speech disorders, /r/ errors (which usually sound like the /w/ sound) are very persistent, even when other speech errors have been successfully corrected. Perhaps because it is so common, /r/ errors are very socially salient. We can easily find examples of portrayals of children’s speech in TV shows that have /r/ errors, such as Ming-Ming the duck’s catch phrase “This is serious” (with a production of /r/ that sounds like a /w/) on the show Wonder Pets (https://www.youtube.com/watch?v=bjmYee2ZfSk).
The sound /r/ has a distinctive acoustic signature, as illustrated by productions of the words rock and walk (which rhyme in speech of many people in Minnesota). These illustrations are spectrograms, which are a type of acoustic record of a sound. Spectrograms allow us to measure fine-grained detail in speech. The red dots on these spectrograms are estimates of which of the many frequencies that are present in the speech signal are the loudest. In the production of rock, the third-lowest peak frequency (which we call the third formant [F3]) is low (about 1500 Hz). In the production of walk, it is much higher (about 2500 Hz).
The last two authors of this study, along with a third collaborator, Dr. Mary E. Beckman, recently finished a longitudinal study of relationships among speech perception, speech production, and word learning in children. As part of this study, we collected numerous productions of late-acquired sounds in word-initial position (like the /r/ sound in rocking). The ultimate goal of that study is to understand how speech production and perception early in life set the stage for vocabulary growth throughout the preschool years, and how vocabulary growth helps children refine their knowledge of speech sounds. The study collected a treasure trove of data that we can use to analyze other secondary questions. Our ASA poster does just that. In it, we ask whether we can identify predictors of which children with inaccurate /r/ productions at our second time point (TP2, when children were between 39 and 52 months old, which followed the first time point, in which they were 28 to 39 months old) improve their /r/ production at our third time point (TP3, when the children were between 52 and 64 months old), and which did not improve. Our candidate measures were taken from a battery of standardized and non-standardized tests of speech perception, vocabulary knowledge, and nonlinguistic cognitive skills.
Our first stab at answering this question involved looking at phonetic transcriptions of children’s productions of /r/ and /w/. We picked /w/ as a comparison sound because most of children’s /r/ errors sound like /w/. Phonetic transcription was completed by trained phonetic transcribers using a very strict protocol. We calculated the accuracy of /r/ and /w/ at both TP2 and TP3. As the plot below (in which each dot represents a single child’s performance) shows, children’s performance generally improved: most of the children are above the y=x line.
We examined predictors of how much growth in accuracy occurred from TP2 to TP3 for the subset of children whose accuracy of /r/ was below 50% at TP2. Surprisingly, the results did not help us understand why some children improved more than others. In general, we found that the children who had the most improvement were those with low speech perception and vocabulary scores at TP2. A naïve interpretation might be that low vocabulary is associated with positive speech acquisition—an unlikely story! Closer inspection showed that this relationship was because the children who had the lowest accuracy scores at TP2 (that is, the children with the most room to grow) were those who had the lowest vocabulary and speech perception scores.
We then went back to our data and asked whether we could get a finer-grained measure than simply we get from phonetic transcriptions. We know from our own previous work, and from work of others (especially Tara McAllister and colleagues, who have worked on /r/ extensively) that speech sounds are acquired gradually. A child learning /s/ (the “s” sound) over the course of development gradually learns how to produce /s/ differently from other similar sounds (like the “th” sound and the “sh” sound). McAllister and colleagues showed this to be the case with /r/, too. Measures like phonetic transcription don’t do a very good job of capturing this gradual acquisition, since a transcription either says that a sound is correct or incorrect. It doesn’t track degrees of accuracy or inaccuracy.
To examine gradual learning of /r/, we first tried to look at acoustic measures, like the F3 measures that are useful in characterizing adults’ /r/ productions. A quick look at two spectrograms of children’s productions reveal how hard this endeavor actually is. Both of these are the first 175 ms of two kids’ productions of the word rocking. Both of them were transcribed as /w/. In both of them, the F3 is hard to find. It’s not nearly as clear as it is in the adults’ productions that are shown above. In both of these cases, the algorithm to track formant frequencies gives values that are rather suspicious. In short, we would need to carefully code these by hand to get anything approximating an accurate F3, and some tokens wouldn’t be amenable to any kind of acoustic analysis. Given our large number of productions in this study (nearly 3,000!), this would take many hundreds of hours of work.
Subject 679L
Subject 671L
To remedy this, we decided to abandon acoustics. Instead, we presented brief clips of children’s speech (the first 175 ms of words starting with the /r/ and /w/ sounds) to naïve listeners, where “naïve” means “without any specialized training in speech-language pathology, phonetics, or acoustics.” We asked them to rate the children’s productions on a continuous scale, by clicking on an arrow like this:
The “r” soundThe “w” sound
When we examine listener ratings, we find quite a bit of variation across kids in how their sounds are rated. Consider the ratings for the productions above. Each one of the “r” symbols represents a rating by an individual. The higher the rating, the more /w/-like it was judged to sound.
679L
671L
Listen to these sounds yourself, and ask where you would click on the line. Do your judgments match those of our listeners?
We find that the production by 679L (which was rated by 125 listeners) is perceived as much more /w/-like than was the production by 671L(which was rated by 20 listeners). How do these data help us understand growth in /r/? In our ongoing anlayses, we are examining growth by using pooled listener ratings instead of phonetic transcription. Our hope is that these finer-grained measures will help us better understand individual differences in how children learn the /r/ sound.
Seth Wiener – sethw1@cmu.edu Carnegie Mellon University 160 Baker Hall, 5000 Forbes Avenue Pittsburgh, PA 15213
Popular version of paper 4pSC11, “The role of familiarity in audiovisual speech perception” Presented Thursday afternoon, December 7, 2017, 1:00-4:00 PM, Studios Foyer 174th ASA Meeting, New Orleans
When we listen to someone talk, we hear not only the content of the spoken message, but also the speaker’s voice carrying the message. Although understanding content does not require identifying a specific speaker’s voice, familiarity with a speaker has been shown to facilitate speech perception (Nygaard & Pisoni, 1998) and spoken word recognition (Lee & Zhang, 2017).
Because we often communicate with a visible speaker, what we hear is also affected by what we see. This is famously demonstrated by the McGurk effect (McGurk & MacDonald, 1976). For example, an auditory “ba” paired with a visual “ga” usually elicits a perceived “da” that is not present in the auditory or the visual input.
Since familiarity with a speaker’s voice affects auditory perception, does familiarity with a speaker’s face similarly affect audiovisual perception? Walker, Bruce, and O’Malley (1995) found that familiarity with a speaker reduced the occurrence of the McGurk effect. This finding supports the “unity” assumption of intersensory integration (Welch & Warren, 1980), but challenges the proposal that processing facial speech is independent of processing facial identity (Bruce & Young, 1986; Green, Kuhl, Meltzoff, & Stevens, 1991).
In this study, we explored audiovisual speech perception by investigating how familiarity with a speaker affects the perception of English fricatives “s” and “sh”. These two sounds are useful because they contrast visibly in lip rounding. In particular, the lips are usually protruded for “sh” but not “s”, meaning listeners can potentially identify the contrast based on visual information.
Listeners were asked to watch/listen to stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). The listeners’ task was to identify whether the first sound of the stimuli was “s” or “sh”. We tested two groups of native English listeners – one familiar with the speaker who produced the stimuli and one unfamiliar with the speaker.
The results showed that listeners familiar with the speaker identified the fricatives faster in all conditions (Figure 1) and more accurately in the visual-only condition (Figure 2). That is, listeners familiar with the speaker were more efficient in identifying the fricatives overall, and were more accurate when visual input was the only source of information.
We also examined whether visual familiarity affects the occurrence of the McGurk effect. Listeners were asked to identify syllable-initial stops (“b”, “d”, “g”) from stimuli that were audiovisual-congruent or incongruent (e.g., audio “ba” paired with visual “ga”). A blended (McGurk) response was indicated by a “da” response to an auditory “ba” paired with a visual “ga”.
Contrary to the “s”-“sh” findings reported earlier, the results from our identification task showed no difference between the familiar and unfamiliar listeners in the proportion of McGurk responses. This finding did not replicate Walker, Bruce, and O’Malley (1995).
In sum, familiarity with a speaker facilitated the speed of identifying fricatives from audiovisual stimuli. Familiarity also improved the accuracy of fricative identification when visual input was the only source of information. Although we did not find an effect of familiarity on the McGurk responses, our findings from the fricative task suggest that processing audiovisual speech is affected by speaker identity.
Figure 1- Reaction time of fricative identification from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent. Error bars indicate 95% confidence intervals.
Figure 2- Accuracy of fricative identification (d’) from stimuli that were audio-only, visual-only, audiovisual-congruent, or audiovisual-incongruent (e.g., audio “save” paired with visual “shave”). Error bars indicate 95% confidence intervals.
Figure 3- Proportion of McGurk response (“da” response to audio “ba” paired with visual “ga”).
Video 1 – Example of an audiovisual-incongruent stimulus (audio “save” paired with visual “shave”).
Video 2 – Example of an audiovisual-incongruent stimulus (audio “ba” paired with visual “ga”).
References:
Bruce, V., & Young, A. (1986). Understanding face recognition. British Journal of Psychology, 77, 305-327.
Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524-536.
Lee, C.-Y., & Zhang, Y. (in press). Processing lexical and speaker information in repetition and semantic/associative priming. Journal of Psycholinguistic Research.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26, 746-748.
Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355-376.
Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception and Psychophysics, 57, 1124-1133.
Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638-667.
Popular version of 2pSC7. Seeing is treating: 3D electromagnetic midsagittal articulography (EMA) visual biofeedback for the remediation of residual speech errors. Presented at the 173rd ASA Meeting
Lips, teeth, and cheeks are the key ingredients of every great smile. For a speech therapist, however, they get in the way of seeing how the tongue moves during speech. This creates a challenge for treating speech sound errors in the clinic. Carefully coordinated movements of the tongue shape sound into speech. If you have ever heard someone say what sounds like a “w” for the “r” sound or make the “s” sound with a lisp, you have heard what can happen when the tongue does not move toward the right shape to create the sounds. When these errors persist into adolescence and adulthood, there can be major social consequences.
Preschool children and toddlers often make speech errors with the tongue that can make it difficult for them to be understood. Sounds like “k,” “s,” and “r” commonly have errors because saying them accurately requires high-level coordination of the muscles in the tongue. In fact, 15% of children, aged 4-6 years, have some delay or error in speech sound production, without any known cause.
Traditional speech therapy for these errors can include games, drills, and some work to describe or show how the sound should be produced. Many times, these errors can be resolved with less than a year of treatment. Sometimes, even with strong work in speech therapy, the speech errors continue into adolescence and adulthood. In these cases, being able to see how the tongue is moving and to provide a visualization of how it should move would be especially useful.
Opti-Speech is a technology that provides this visualization in the speech therapy room. With it, the patient’s tongue movement is displayed in real-time as he or she talks. The speech therapist can see how the tongue is moving and provide target shapes that help the client produce speech sounds correctly. It was developed by a team that included speech therapists, computer scientists, animators, and biomedical engineers in collaboration with a tech company, Vulintus, using hardware created by Northern Digital, Incorporated.
Tiny sensors are placed on the tongue and their positions are tracked in an electromagnetic field. The positions of the sensors are generated as a 3D animated tongue on a display screen. Using the animation, the speech therapist can identify how the target sound is produced in error, which is not possible without this visualization.
In 2008 in Dallas, Texas, I started working with a talented team that included an electrical engineer, a computer scientist, an animator, and two other speech therapists to create the Opti-Speech therapy technology. We imagined software that could show an animated version of a tongue, driven in real-time by the motion of the client’s tongue, that could be used to “show” clients how to produce the sounds better. Similar technology is used to improve golf swings — by showing the aspiring golfers an image of their swing superimposed on an ideal swing, improvements come more rapidly.
Why couldn’t this same concept be applied to speech, we wondered. With this in mind, the Opti-Speech project began. The engineers and animators worked on the software, the speech therapists tested early versions, and in my lab at Case Western Reserve University, we set out to better understand what the targets for speech sounds might be.
Figure 1: The motion of 5 sensors glued to the tongue animate the movement of the avatar tongue in real-time. Credit: Vick/CHSC
Just eight years later, I am proud that Cleveland Hearing and Speech Center was included in an NIH-funded phase II clinical trial of Opti-Speech. The technology uses the captured motion of sensors on the tongue to animate a real-time 3-D avatar tongue (see Figure 1). The speech therapist can set spherical targets to “show” the client how to shape the tongue for particular speech sounds. For those who have not had success with traditional speech therapy, the Opti-Speech clinical trial may be a great alternative.
It has been almost 18 months since CHSC started the Opti-Speech trial. Rebecca Mental, a CHSC staff speech-language pathologist and doctoral student at CWRU, designed the treatment sessions and is running them. To date, she has completed the treatment with eleven participants who range in age from 8 to 22 years. Each and every one of these individuals has put in many hours across 13 sessions to help us understand if Opti-Speech will be a treatment that will be beneficial to our clients.
With these cases behind us, I am pleased to report that I believe we have a powerful new approach for treating those speech sound errors the most resistant to improvement. All of the Opti-Speech participants were previously enrolled in speech therapy without resolving their speech errors. Many of these individuals came to us frustrated, expecting to encounter yet another unsuccessful run in therapy.
With Opti-Speech, most of these participants experienced a transformation in how they make speech sounds. The key to the success of Opti-Speech is giving the client an additional “sense” for producing speech. In addition to feeling the tongue move and hearing the sound, Opti-Speech clients can “see” the movements of the tongue and know, right away, if they have produced the sound correctly.
The Opti-Speech story is best told through the experience of one of our first participants. Nancy, as I will call her, was 22-year-old and had been in speech therapy throughout most of her early school years to work on the “r” sound. It was her junior year of high school when Nancy first became aware that her peers were making fun of her speech. As this continued, she started to notice that teachers had a difficult time understanding her. Before long, she started to question her own competence and abilities. Nancy is a server at a local restaurant. Her boyfriend said she frequently returned home from work in tears. Nancy says, “When I have to say an ‘r’ word, I try to mumble it so that people won’t hear the error, but then they ask me to repeat myself which makes me feel even more embarrassed.” Frustrated, Nancy again enrolled in speech therapy, trying a few different clinics, but she did not have any success changing her “r” sound. Her boyfriend began researching options on the internet and found out about the Opti-Speech clinical trial at CHSC. Nancy was soon enrolled in the trial. As her boyfriend said, “I feel like we wasted so much time trying other things and then we came here and, BAM, 10 sessions and she can say “r” like anyone else!” He says he could hear a difference in Nancy’s speech after two or three sessions. Nancy has remarked that the change has made her job so much easier. “I can actually tell people that I am a server now. I used to avoid it because of the “r” sound. And at work, I can say ‘rare’ and ‘margarita’ and customers can understand me!”
It has been three months since Nancy “graduated” from Opti-Speech treatment and everything is going great for her. She is enrolled in classes at community college and working as a server at a high-end restaurant. While she is incredibly proud of her new speech, she is, understandably, self-conscious about how her speech used to sound. While listening to a recording of her speech before Opti-Speech, tears fell from her eyes. Looking back on the past gave her such an incredible sense of how far she’s come. I am exhilarated to have met and talked with Nancy. It made me realize the power of imagination and collaboration for solving some of the greatest challenges we encounter in the clinic.
Every year at Cleveland Hearing and Speech Center, we see countless clients who have speech production errors that Opti-Speech may improve. We have a strong affiliation with researchers at CWRU and we have a talented team of speech therapists who can help to run the trial. In other words, CHSC is unique in the world in its ability to test new technologies for speech, hearing, and deafness. This is why CHSC is the only site in the world currently running the Opti-Speech clinical trial. Almost a century of collaboration, community support, and philanthropy has helped to create the perfect environment for bringing the most cutting-edge speech therapy to our region.
Keiichi Tajima – tajima@hosei.ac.jp Dept. of Psychology Hosei University 2-17-1 Fujimi, Chiyoda-ku Tokyo 102-8160 Japan
Stefanie Shattuck-Hufnagel – sshuf@mit.edu Research Laboratory of Electronics Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge, MA 02139 USA
Popular version of paper 1pSC12 Presented Sunday afternoon, June 25, 2017 173rd ASA Meeting, Boston
Learning pronunciation and listening skills in a second language is a challenging task. Languages vary not only in the vowels and consonants that are used, but also in how the vowels and consonants combine to form syllables and words. For example, syllables in Japanese are relatively simple, often consisting of just a consonant plus a vowel, but syllables in English tend to be more complex, containing several consonants in a row. Because of these differences, learning the syllable structure of a second language may be difficult.
For example, when Japanese learners of English pronounce English words such as “stress,” they often pronounce it as “sutoresu,” inserting what are called epenthetic vowels (underlined) between adjacent consonants and at the end of words [1]. Similarly, when asked to count the number of syllables in spoken English words, Japanese learners often over-estimate the number of syllables, saying, for example, that the one-syllable word, play, contains 2 syllables [2].
This may be because Japanese listeners “hear” an epenthetic vowel between adjacent consonants even if no vowel is physically present. That is, they may hear “play” as something like “puh-lay,” thus reporting to have heard two syllables in the word. In fact, a study has shown that when Japanese speakers are presented with a nonsense word like “ebzo,” they report hearing an “illusory” epenthetic vowel between the b and z; that is, they report hearing “ebuzo” rather than “ebzo,” even though the vowel u was not in the speech signal [3].
These tendencies suggest the possibility that Japanese learners may have difficulty distinguishing between English words that differ in syllable count, or the presence or absence of a vowel, e.g. blow vs. below, sport vs. support. Furthermore, if listeners tend to an extra vowel between consonants, then they might be expected to misperceive blow as below more often than below as blow.
To test these predictions, we conducted a listening experiment with 42 Japanese learners of English as participants. The stimuli consisted of 76 pairs of English words that differed in the presence or absence of a vowel. Each pair had a “CC word” that contained a consonant-consonant sequence, like blow, and a “CVC word” that had a vowel within that sequence, like below. On each trial, listeners saw one pair of words on the computer screen, and heard one of them through headphones, as pronounced by a male native English speaker. The participants’ task was to pick which word they think they heard by clicking on the appropriate button. A control group of 14 native English participants also took part in the experiment.
Figure 1 shows the percentage of correct responses for CC words and CVC words for the Japanese learners of English (left half) and for the native English listeners (right half). The right half of Figure 1 clearly shows that the native listeners were very good at identifying the words; they were correct about 98% of the time. In contrast, the left half of Figure 1 shows that the Japanese listeners were less accurate; they were correct about 75~85% of the time. Interestingly, their accuracy was higher for CC words (85.7%) than for CVC words (73.4%), contrary to the prediction based on vowel epenthesis.
Figure 1. Percent correct identification rate for CC words, e.g. blow, and CVC words, e.g. below, for Japanese learners of English (left half) and native English listeners (right half). Credit: Tajima/ Shattuck-Hufnagel
To find out why Japanese listeners’ performance was lower for CVC words than for CC words, we further analyzed the data based on phonetic properties of the target words. It turned out that Japanese listeners’ performance was especially poor when the target word contained a particular type of sound, namely, a liquid consonant such as “l” and “r”. Figure 2 shows Japanese listeners’ identification accuracy for target words that contained a liquid consonant (left half), like blow-below, prayed-parade, scalp-scallop, course-chorus, and for target words that did not contain a liquid consonant (right half), like ticked-ticket, camps-campus, sport-support, mint-minute.
The left half of Figure 2 shows that while Japanese listeners’ accuracy for CC words that contained a liquid consonant, like blow, prayed, was about 85%, their accuracy for the CVC counterparts, e.g. below, parade, was about 51%, which is at chance (guessing) level. In contrast, the right half of Figure 2 shows that Japanese listeners’ performance on words that did not contain a liquid sound was around 85%, with virtually no difference between CC and CVC words.
Figure 2. Percent correct identification rate for word pairs that contained a liquid consonant, e.g. blow-below, prayed-parade (left half) and word pairs that did not contain a liquid consonant, e.g. ticked-ticket, camp-campus. Credit: Tajima/ Shattuck-Hufnagel
Why was Japanese listeners’ performance poor for words that contained a liquid consonant? One possible explanation is that liquid consonants are acoustically similar to vowel sounds. Compared to other kinds of consonants such as stops, fricatives, and nasals, liquid consonants generally have greater intensity, making them similar to vowels. Liquid consonants also generally have a clear formant structure similar to vowels, i.e. bands of salient energy stemming from resonant properties of the oral cavity.
Because of these similarities, liquid consonants are more confusable with vowels than are other consonant types, and this may have led some listeners to interpret words with vowel + liquid sequences such as below and parade as containing just a liquid consonant without a preceding vowel, thus leading them to misperceive the words as blow and prayed. Given that the first vowel in words such as below and parade is a weak, unstressed vowel, which is short and relatively low in intensity, such misperceptions would be all the more likely.
Another possible explanation for why Japanese listeners were poorer with CVC words than CC word may have to do with the listeners’ familiarity with the target words and their pronunciation. That is, listeners may have felt reluctant to select words which they were not familiar with or did not know how to pronounce. When the Japanese listeners were asked to rate their subjective familiarity with each of the English words used in this study using a 7-point scale, from 1 (not familiar at all) to 7 (very familiar), it turned out that their ratings were higher on average for CC words (4.8) than for CVC words (4.1).
Furthermore, identification accuracy showed a moderate positive correlation (r = 0.45) with familiarity rating, indicating that words that were more familiar to Japanese listeners tended to be more correctly identified. These results suggest that listeners’ performance in the identification task was partly affected by how familiar they were with the English words.
Put together, the present study suggests that Japanese learners of English indeed have difficulty correctly identifying spoken English words that are distinguished by the presence vs. absence of a vowel. From a theoretical standpoint, the results are intriguing because they are not in accord with predictions based on vowel epenthesis, and suggests that detailed properties of the target words affect the results in subtle ways. From a practical standpoint, the results suggest that it would be worthwhile to develop ways to improve learners’ skills in listening to these distinctions.
References
Tajima, K., Erickson, D., and Nagao, K. (2003). Production of syllable structure in a second language: Factors affecting vowel epenthesis in Japanese-accented English. In Burleson, D., Dillon, C., and Port, R. (eds.), Indiana University Working Papers in Linguistics 4, Speech Prosody and Timing: Dynamic Aspects of Speech. IULC Publications.
Tajima, K. (2004). Stimuus-related effects on the perception of syllables in second-language speech. Bulletin of the Faculty of Letters, vol. 49, Hosei University.
Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., and Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology: Human Perception and Performance, 25, 1568-1578.
Takashi Saito – saito@sc.shonan-it.ac.jp Shonan Institute of Technology 1-1-25 Tsujido-Nishikaigan, Fujisawa, Kanagawa, JAPAN
Popular version of paper 2pSC, “Prosodic analysis of storytelling speech in Japanese fairy tale” Presented Tuesday afternoon, November 29, 2016 172nd ASA Meeting, Honolulu
Recent advances in speech synthesis technologies bring us relatively high quality synthetic speech, as smartphones today often provide it with speech message output. The acoustic sound quality especially seems to sometimes be particularly close to that of human voices. Prosodic aspects, or the patterns of rhythm and intonation, however, still have large room for improvement. The overall speech messages generated by speech synthesis systems sound somewhat awkward and monotonous. In other words, those messages lack expressiveness of speech compared with human speech. One of the reasons for this is that most systems use a one-sentence speech synthesis scheme in which each sentence in the message is generated independently, connected just to construct the message. The lack of expressiveness might hinder widening the range of applications for speech synthesis. Storytelling is a typical application to expect speech synthesis to be capable of having a control mechanism beyond just one sentence to provide really vivid and expressive storytelling. This work attempts to investigate the actual storytelling strategies of human narration experts for the purpose of ultimately reflecting them on the expressiveness of speech synthesis.
A Japanese popular fairy tale titled, “The Inch-High Samurai,” in its English translation was the storytelling material in this study. It is a short story taking about six minutes to tell verbally. The story consists of four elements typically found in simple fairy tales: introduction, build-up, climax, and ending. These common features suit the story well for observing prosodic changes in the story’s flow. The story was told by six narration experts (four female and two male narrators) and were recorded. First, we were interested in what they were thinking while telling the story, so we interviewed them on their actual reading strategies after the recording. We found they usually did not adopt fixed reading techniques for each sentence, but tried to go into the world of the story, and make a clear image of characters appearing in the story, as would an actor. They also reported paying attention to the following aspects of the scenes associated with the story elements: In the introduction, featuring the birth of the little Samurai character, they started to speak slowly and gently in effort to grasp the hearts of listeners. In the story’s climax, depicting the extermination of the devil character, they tried to express a tense feeling through a quick rhythm and tempo. Finally, in the ending, they gradually changed their reading styles to make the audience understand that the happy ending is coming soon.
For all six speakers a baseline speech segmentation was conducted for words, and accentual phrases in a semi-automatic way. We then used a multi-layered prosodic tagging method, performed manually, to provide information on various changes of “story states” relevant to impersonation, emotional involvement and scene flow control. Figure 1 shows an example of the labeled speech data. Wavesurfer [1] software served as our speech visualization and labelling tool. The example utterance contains a part of the storyteller’s speech (containing the phrase “oniwa bikkuridesu” meaning, “the devil was surprised,” and devil’s part, “ta ta tasukekuree,” meaning, “please help me!”) and is shown in the top label pane for characters (chrlab). The second top label pane (evelab) shows event labels such as scene changes and emotional involvement (desire, joy, fear, etc…). In this example, a “fear” event is attached to the devil’s utterance part. The dynamic pitch movement can be observed in the pitch contour pane located at the bottom of the figure.
How are the events of scene change or emotional involvement provided by human narrators manifested in speech data? Prosodic parameters of speed, measured in speech rate or mora/sec; pitch, measured in Hz; power, measured in dB; and preceding pause length, measured in seconds, are investigated for all the breath groups in the speech data. Breath group refers to a speech segment which is uttered consecutively without pausing. Figure 2, 3 and 4 show these parameters at a scene-change event (Figure 2), desire event (Figure 3), and fear event (Figure 4). The axis on the left of the figures shows the ratio of the parameter to its average value. Each event has its own distinct tendency on prosodic parameters, also seen in the figures, which seems to be fairly common to all speakers. For instance, the differences between the scene-change event and the desire event are the amount of preceding pause and the degree of the contributions from the other three parameters. The fear event shows a quite different tendency from other events, but it is common to all speakers though the degree of the parameter movement differs between speakers. Figure 5 shows how to expresses character differences, when the reader impersonates the story’s characters, with the three parameters. In short, speed and pitch are changed dynamically for impersonation, and this is a common tendency of all speakers.
Based on findings obtained from these human narrations, we are designing a framework of mapping story events through scene changes and emotional involvement to prosodic parameters. Simultaneously, it is necessary to build additional databases to ensure and reinforce story event description and mapping framework.