How do narration experts provide expressive storytelling in Japanese fairy tales?
Takashi Saito – firstname.lastname@example.org
Shonan Institute of Technology
Fujisawa, Kanagawa, JAPAN
Popular version of paper 2pSC, “Prosodic analysis of storytelling speech in Japanese fairy tale”
Presented Tuesday afternoon, November 29, 2016
172nd ASA Meeting, Honolulu
Recent advances in speech synthesis technologies bring us relatively high quality synthetic speech, as smartphones today often provide it with speech message output. The acoustic sound quality especially seems to sometimes be particularly close to that of human voices. Prosodic aspects, or the patterns of rhythm and intonation, however, still have large room for improvement. The overall speech messages generated by speech synthesis systems sound somewhat awkward and monotonous. In other words, those messages lack expressiveness of speech compared with human speech. One of the reasons for this is that most systems use a one-sentence speech synthesis scheme in which each sentence in the message is generated independently, connected just to construct the message. The lack of expressiveness might hinder widening the range of applications for speech synthesis. Storytelling is a typical application to expect speech synthesis to be capable of having a control mechanism beyond just one sentence to provide really vivid and expressive storytelling. This work attempts to investigate the actual storytelling strategies of human narration experts for the purpose of ultimately reflecting them on the expressiveness of speech synthesis.
A Japanese popular fairy tale titled, “The Inch-High Samurai,” in its English translation was the storytelling material in this study. It is a short story taking about six minutes to tell verbally. The story consists of four elements typically found in simple fairy tales: introduction, build-up, climax, and ending. These common features suit the story well for observing prosodic changes in the story’s flow. The story was told by six narration experts (four female and two male narrators) and were recorded. First, we were interested in what they were thinking while telling the story, so we interviewed them on their actual reading strategies after the recording. We found they usually did not adopt fixed reading techniques for each sentence, but tried to go into the world of the story, and make a clear image of characters appearing in the story, as would an actor. They also reported paying attention to the following aspects of the scenes associated with the story elements: In the introduction, featuring the birth of the little Samurai character, they started to speak slowly and gently in effort to grasp the hearts of listeners. In the story’s climax, depicting the extermination of the devil character, they tried to express a tense feeling through a quick rhythm and tempo. Finally, in the ending, they gradually changed their reading styles to make the audience understand that the happy ending is coming soon.
For all six speakers a baseline speech segmentation was conducted for words, and accentual phrases in a semi-automatic way. We then used a multi-layered prosodic tagging method, performed manually, to provide information on various changes of “story states” relevant to impersonation, emotional involvement and scene flow control. Figure 1 shows an example of the labeled speech data. Wavesurfer  software served as our speech visualization and labelling tool. The example utterance contains a part of the storyteller’s speech (containing the phrase “oniwa bikkuridesu” meaning, “the devil was surprised,” and devil’s part, “ta ta tasukekuree,” meaning, “please help me!”) and is shown in the top label pane for characters (chrlab). The second top label pane (evelab) shows event labels such as scene changes and emotional involvement (desire, joy, fear, etc…). In this example, a “fear” event is attached to the devil’s utterance part. The dynamic pitch movement can be observed in the pitch contour pane located at the bottom of the figure.
How are the events of scene change or emotional involvement provided by human narrators manifested in speech data? Prosodic parameters of speed, measured in speech rate or mora/sec; pitch, measured in Hz; power, measured in dB; and preceding pause length, measured in seconds, are investigated for all the breath groups in the speech data. Breath group refers to a speech segment which is uttered consecutively without pausing. Figure 2, 3 and 4 show these parameters at a scene-change event (Figure 2), desire event (Figure 3), and fear event (Figure 4). The axis on the left of the figures shows the ratio of the parameter to its average value. Each event has its own distinct tendency on prosodic parameters, also seen in the figures, which seems to be fairly common to all speakers. For instance, the differences between the scene-change event and the desire event are the amount of preceding pause and the degree of the contributions from the other three parameters. The fear event shows a quite different tendency from other events, but it is common to all speakers though the degree of the parameter movement differs between speakers. Figure 5 shows how to expresses character differences, when the reader impersonates the story’s characters, with the three parameters. In short, speed and pitch are changed dynamically for impersonation, and this is a common tendency of all speakers.
Based on findings obtained from these human narrations, we are designing a framework of mapping story events through scene changes and emotional involvement to prosodic parameters. Simultaneously, it is necessary to build additional databases to ensure and reinforce story event description and mapping framework.
Musical mind control: Human speech takes on characteristics of background music
Department of Linguistics, University of Canterbury
20 Kirkwood Avenue, Upper Riccarton
Christchurch, NZ, 8041
Popular version of paper 1aNS4, “Musical mind control: Acoustic convergence to background music in speech production.”
Presented Monday morning, November 28, 2016
172nd ASA Meeting, Honolulu
People often adjust their speech to resemble that of their conversation partners – a phenomenon known as speech convergence. Broadly defined, convergence describes automatic synchronization to some external source, much like running to the beat of music playing at the gym without intentionally choosing to do so. Through a variety of studies a general trend has emerged where we find people automatically synchronizing to various aspects of their environment 1,2,3. With specific regard to language use, convergence effects have also been observed in many linguistic domains such as sentence-formation4, word-formation 5, and vowel production6 (where differences in vowel production are well associated with perceived accentedness 7,8). This prevalence in linguistics raises many interesting questions about the extent to which speakers converge. This research uses a speech-in-noise paradigm to explore whether or not speakers also converge to non-linguistic signals in the environment: Specifically, will a speaker’s rhythm, pitch, or intensity (which is closely related to loudness) be influenced by fluctuations in background music such that the speech echoes specific characteristics of that background music (for example, if the tempo of background music slows down, will that influence those listening to unconsciously decrease their speech rate)?
In this experiment participants read passages aloud while hearing music through headphones. Background music was composed by the experimenter to be relatively stable with regard to pitch, tempo/rhythm, and intensity, so we could manipulate and test only one of these dimensions at a time, within each test-condition. We imposed these manipulations gradually and consistently toward a target, which can be seen in Figure 1, and would similarly return to the level at which they started after reaching that target. We played the participants music with no experimental changes in between all manipulated sessions. (Examples of what participants heard in headphones are available as sound-files 1 and 2]
Fig. 1 Using software designed for digital signal processing (analyzing and altering sound), manipulations were applied in a linear fashion (in a straight line) toward a target – this can be seen above as the blue line, which first rises and then falls. NOTE: After manipulations reach their target (the target is seen above as a dashed, vertical red line), the degree of manipulation would then return to the level at which it started in a similar linear fashion. Graphic captured while using Praat 9 to increase and then decrease the perceived loudness of the background music.
Data from 15 native speakers of New Zealand English were analyzed using statistical tests that allow effects to vary somewhat for each participant where we observed significant convergence in both the pitch and intensity conditions. Analysis of the Tempo condition, however, has not yet been conducted. Interestingly, these effects appear to differ systematically based on a person’s previous musical training. While non-musicians demonstrate the predicted effect and follow the manipulations, musicians appear to invert the effect and reliably alter aspects of their pitch and intensity in the opposite direction of the manipulation (see Figure 2). Sociolinguistic research indicates that under certain conditions speakers will emphasize characteristics of their speech to distinguish themselves socially from conversation partners or groups, as opposed to converging with them6. It seems plausible then that, given a relatively heightened ability to recognize low-level variations of sound, musicians may on some cognitive level be more aware of the variation in their sound environment, and as a result similarly resist the more typical effect. However, more work is required to better understand this phenomenon.
Fig. 2 The above plots measure pitch on the y-axis (up and down on the left edge), and indicate the portions of background music that have been manipulated on the x- axis (across the bottom). The blue lines show that speakers generally lower their pitch as an un-manipulated condition progresses. However the red lines show that when global pitch is lowered during a test-condition, such lowering is relatively more dramatic for non-musicians (left plot) and that the effect is reversed by those with musical training (right plot). NOTE: A follow-up model further accounts for the relatedness of Pitch and Intensity and shows much the same effect.
This work indicates that speakers are not only influenced by human speech partners in production, but also, to some degree, by noise within the immediate speech environment, which suggests that environmental noise may constantly be influencing certain aspects of our speech production in very specific and predictable ways. Human listeners are rather talented when it comes to recognizing subtle cues in speech 10, especially compared to computers and algorithms that can’t yet match this ability. Some language scientists argue these changes in speech occur to make understanding easier for those listening 11. That is why work like this is likely to resonate in both academia and the private sector, as a better understanding of how speech will change in different environments contributes to the development of more effective aids for the hearing impaired, as well as improvements to many devices used in global communications.
Sound-file 1. An example of what participants heard as a control condition (no experimental manipulation) in between test-conditions.
Sound-file 2. An example of what participants heard as a test condition (Pitch manipulation, which drops 200 cents/one full step).
1. Hill, A. R., Adams, J. M., Parker, B. E., & Rochester, D. F. (1988). Short-term entrainment of ventilation to the walking cycle in humans. Journal of Applied Physiology, 65(2), 570-578.
2. Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic acoustic stimuli. Neuroscience letters, 424(1), 55-60.
3. McClintock, M. K. (1971). Menstrual synchrony and suppression. Nature, Vol 229, 244-245.
4. Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007). Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163-197.
5. Beckner, C., Rácz, P., Hay, J., Brandstetter, J., & Bartneck, C. (2015). Participants Conform to Humans but Not to Humanoid
Robots in an English Past Tense Formation Task. Journal of Language and Social Psychology, 0261927X15584682.
Retreived from: http://jls.sagepub.com.ezproxy.canterbury.ac.nz/content/early/2015/05/06/0261927X15584682.
6. Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation. Journal of Phonetics, 40(1), 177-189.
7. Major, R. C. (1987). English voiceless stop production by speakers of Brazilian Portuguese. Journal of Phonetics, 15, 197—
8. Rekart, D. M. (1985) Evaluation of foreign accent using synthetic speech. Ph.D. dissertation, the Lousiana State University.
9. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.4.04) [Computer program]. Retrieved
10. Hay, J., Podlubny, R., Drager, K., & McAuliffe, M. (under review). Car-talk: Location-specific speech production and
11. Lane, H., & Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech, Language, and Hearing Research, 14(4), 677-709.
Popular version of paper, 5aSC43, “Appropriateness of acoustic characteristics on perception of disaster warnings.”
Presented Friday morning, December 2, 2016
172nd ASA Meeting, Honolulu
As you might know, Japan has often been hit by natural disasters, such as typhoons, earthquakes, flooding, landslides, and volcanic eruptions. According to the Japan Institute of Country-ology and Engineering , 20.5% of all the M6 and greater earthquakes in the world occurred in Japan, and 0.3% of deaths caused by natural disasters worldwide were in Japan. These numbers seem quite high compared with the fact that Japan occupies only 0.28% of the world’s land mass.
Municipalities in Japan issue and announce evacuation calls to local residents through the community wireless system or home receiver when a disaster is approaching; however, there have been many cases reported in which people did not evacuate even after they heard the warnings . This is because people tend to not believe and disregard warnings due to a normalcy bias . Facing this reality, it is necessary to find a way to make evacuation calls more effective and trustworthy. This study focused on the influence of acoustic characteristics (voice gender, pitch, and speaking rate) of a warning call on the listeners’ perception of the call and tried to make suggestions for better communication.
Three short warnings were created: 1) Kyoo wa ame ga furimasu. Kasa wo motte dekakete kudasai. ‘It’s going to rain today. Please take an umbrella with you.’ 2) Ookina tsunami ga kimasu. Tadachini hinan shitekudasai. ‘A big tsunami is coming. Please evacuate immediately.’ and 3) Gakekuzure no kiken ga arimasu. Tadachini hinan shitekudasai. ‘There is a risk of landslide. Please evacuate immediately.’ A female and a male native speaker of Japanese, who both have relatively clear voices and good articulation, read the warnings out aloud at a normal speed (see Table 1 for the acoustic information of the utterances), and their utterances were recorded in a sound attenuated booth with a high quality microphone and recording device. Each of the female and male utterances was modified using the acoustic analysis software PRAAT  to create stimuli with 20% higher or lower pitch and 20% faster or slower speech rate. The total number of tokens created was 54 (3 warning types x 2 genders x 3 pitch levels x 3 speech rates), but only 4 of the warning 1) tokens were used in the perception experiment as practice stimuli.
Table 1: Acoustic Data of Normal Tokens
34 university students listened to each stimulus through the two speakers placed on the right and left front corners in a classroom (930cm x 1,500cm). Another group of 42 students and 11 people from the public listened to the same stimuli through one speaker placed on the front in a lab (510cm x 750cm). All of the participants rated each token on 1-to-5 scale (1: lowest, 5: highest) in terms of Intelligibility, Reliability, and Urgency.
Figure 1 summarizes the evaluation responses (n=87) in a bar chart, with the average scores calculated from the ratings on a 1-5 scale for each combination of the acoustic conditions. Taking Intelligibility, for example, the average score was the highest when the calls were spoken with a female voice, with normal speed and normal pitch. Similar results are seen for Reliability as well. On the other hand, respondents felt a higher degree of Urgency for both faster speed and higher pitch.
Figure 1. Evaluation responses (bar graph, in percent) and Average scores (data labels and
line graph on 1 – 5 scale)
The data were then analyzed with an analysis of variance (ANOVA, Table 2). Figure 2 illustrates the same results as bar charts. It was confirmed that for all of Intelligibility, Reliability, and Urgency, the main effect of speaking speed was the most dominant. In particular, Urgency can be influenced by the speed factor alone by up to 43%.
Table 2: ANOVA results
Figure 2: Decomposed variances in stacked bar charts based on the ANOVA results
Finally, we calculated the expected average evaluation scores, with respect to different levels of speed, to find out how much influence speed has on Urgency, with a female speaker and normal pitch (Figure 3). Indeed, by setting speed to fast, the perceived Urgency can be raised to the highest level, even at the expense of Intelligibility and Reliability to some degrees. Based on these results, we argue that the speech rate may effectively be varied depending on the purpose of an evacuation call, whether it prioritizes Urgency, or Intelligibility and Reliability.
Figure 3: Expected average evaluation scores on 1-5 scale, setting female voice and normal
Japan Institute of Country-ology and Engineering (2015). Kokudo wo shiru [To know the national land]. Retrieved from: http://www.jice.or.jp/knowledge/japan/commentary09.
2. Nakamura, Isao. (2008). Dai 6 sho Hinan to joho, dai 3 setsu Hinan to jyuumin no shinri [Chapter 6 Evacuation and Information, Section 3 Evacuation and Residents’ Mind]. In H. Yoshii & A. Tanaka (Eds.), Saigai kiki kanriron nyuumon [Introduction to Disaster Management Theory] (pp.170-176). Tokyo: Kobundo.
Drabek, Thomas E. (1986). Human System Responses to Disaster: An Inventory of Sociological Findings. NY: Springer-Verlag New York Inc.
Boersma, Paul & Weenink, David (2013). Praat: doing phonetics by computer [Computer program]. Retrieved from: http://www.fon.hum.uva.nl/praat/.