Takashi Saito – email@example.com
Shonan Institute of Technology
Fujisawa, Kanagawa, JAPAN
Popular version of paper 2pSC, “Prosodic analysis of storytelling speech in Japanese fairy tale”
Presented Tuesday afternoon, November 29, 2016
172nd ASA Meeting, Honolulu
Recent advances in speech synthesis technologies bring us relatively high quality synthetic speech, as smartphones today often provide it with speech message output. The acoustic sound quality especially seems to sometimes be particularly close to that of human voices. Prosodic aspects, or the patterns of rhythm and intonation, however, still have large room for improvement. The overall speech messages generated by speech synthesis systems sound somewhat awkward and monotonous. In other words, those messages lack expressiveness of speech compared with human speech. One of the reasons for this is that most systems use a one-sentence speech synthesis scheme in which each sentence in the message is generated independently, connected just to construct the message. The lack of expressiveness might hinder widening the range of applications for speech synthesis. Storytelling is a typical application to expect speech synthesis to be capable of having a control mechanism beyond just one sentence to provide really vivid and expressive storytelling. This work attempts to investigate the actual storytelling strategies of human narration experts for the purpose of ultimately reflecting them on the expressiveness of speech synthesis.
A Japanese popular fairy tale titled, “The Inch-High Samurai,” in its English translation was the storytelling material in this study. It is a short story taking about six minutes to tell verbally. The story consists of four elements typically found in simple fairy tales: introduction, build-up, climax, and ending. These common features suit the story well for observing prosodic changes in the story’s flow. The story was told by six narration experts (four female and two male narrators) and were recorded. First, we were interested in what they were thinking while telling the story, so we interviewed them on their actual reading strategies after the recording. We found they usually did not adopt fixed reading techniques for each sentence, but tried to go into the world of the story, and make a clear image of characters appearing in the story, as would an actor. They also reported paying attention to the following aspects of the scenes associated with the story elements: In the introduction, featuring the birth of the little Samurai character, they started to speak slowly and gently in effort to grasp the hearts of listeners. In the story’s climax, depicting the extermination of the devil character, they tried to express a tense feeling through a quick rhythm and tempo. Finally, in the ending, they gradually changed their reading styles to make the audience understand that the happy ending is coming soon.
For all six speakers a baseline speech segmentation was conducted for words, and accentual phrases in a semi-automatic way. We then used a multi-layered prosodic tagging method, performed manually, to provide information on various changes of “story states” relevant to impersonation, emotional involvement and scene flow control. Figure 1 shows an example of the labeled speech data. Wavesurfer  software served as our speech visualization and labelling tool. The example utterance contains a part of the storyteller’s speech (containing the phrase “oniwa bikkuridesu” meaning, “the devil was surprised,” and devil’s part, “ta ta tasukekuree,” meaning, “please help me!”) and is shown in the top label pane for characters (chrlab). The second top label pane (evelab) shows event labels such as scene changes and emotional involvement (desire, joy, fear, etc…). In this example, a “fear” event is attached to the devil’s utterance part. The dynamic pitch movement can be observed in the pitch contour pane located at the bottom of the figure.
How are the events of scene change or emotional involvement provided by human narrators manifested in speech data? Prosodic parameters of speed, measured in speech rate or mora/sec; pitch, measured in Hz; power, measured in dB; and preceding pause length, measured in seconds, are investigated for all the breath groups in the speech data. Breath group refers to a speech segment which is uttered consecutively without pausing. Figure 2, 3 and 4 show these parameters at a scene-change event (Figure 2), desire event (Figure 3), and fear event (Figure 4). The axis on the left of the figures shows the ratio of the parameter to its average value. Each event has its own distinct tendency on prosodic parameters, also seen in the figures, which seems to be fairly common to all speakers. For instance, the differences between the scene-change event and the desire event are the amount of preceding pause and the degree of the contributions from the other three parameters. The fear event shows a quite different tendency from other events, but it is common to all speakers though the degree of the parameter movement differs between speakers. Figure 5 shows how to expresses character differences, when the reader impersonates the story’s characters, with the three parameters. In short, speed and pitch are changed dynamically for impersonation, and this is a common tendency of all speakers.
Based on findings obtained from these human narrations, we are designing a framework of mapping story events through scene changes and emotional involvement to prosodic parameters. Simultaneously, it is necessary to build additional databases to ensure and reinforce story event description and mapping framework.
 Wavesurfer: http://www.speech.kth.se/wavesurfer/