2pSC – How do narration experts provide expressive storytelling in Japanese fairy tales?

Takashi Saito – saito@sc.shonan-it.ac.jp
Shonan Institute of Technology
1-1-25 Tsujido-Nishikaigan,
Fujisawa, Kanagawa, JAPAN

Popular version of paper 2pSC, “Prosodic analysis of storytelling speech in Japanese fairy tale”
Presented Tuesday afternoon, November 29, 2016
172nd ASA Meeting, Honolulu

Recent advances in speech synthesis technologies bring us relatively high quality synthetic speech, as smartphones today often provide it with speech message output. The acoustic sound quality especially seems to sometimes be particularly close to that of human voices. Prosodic aspects, or the patterns of rhythm and intonation, however, still have large room for improvement. The overall speech messages generated by speech synthesis systems sound somewhat awkward and monotonous. In other words, those messages lack expressiveness of speech compared with human speech. One of the reasons for this is that most systems use a one-sentence speech synthesis scheme in which each sentence in the message is generated independently, connected just to construct the message. The lack of expressiveness might hinder widening the range of applications for speech synthesis. Storytelling is a typical application to expect speech synthesis to be capable of having a control mechanism beyond just one sentence to provide really vivid and expressive storytelling. This work attempts to investigate the actual storytelling strategies of human narration experts for the purpose of ultimately reflecting them on the expressiveness of speech synthesis.

A Japanese popular fairy tale titled, “The Inch-High Samurai,” in its English translation was the storytelling material in this study. It is a short story taking about six minutes to tell verbally. The story consists of four elements typically found in simple fairy tales: introduction, build-up, climax, and ending. These common features suit the story well for observing prosodic changes in the story’s flow. The story was told by six narration experts (four female and two male narrators) and were recorded. First, we were interested in what they were thinking while telling the story, so we interviewed them on their actual reading strategies after the recording. We found they usually did not adopt fixed reading techniques for each sentence, but tried to go into the world of the story, and make a clear image of characters appearing in the story, as would an actor. They also reported paying attention to the following aspects of the scenes associated with the story elements: In the introduction, featuring the birth of the little Samurai character, they started to speak slowly and gently in effort to grasp the hearts of listeners. In the story’s climax, depicting the extermination of the devil character, they tried to express a tense feeling through a quick rhythm and tempo. Finally, in the ending, they gradually changed their reading styles to make the audience understand that the happy ending is coming soon.

For all six speakers a baseline speech segmentation was conducted for words, and accentual phrases in a semi-automatic way. We then used a multi-layered prosodic tagging method, performed manually, to provide information on various changes of “story states” relevant to impersonation, emotional involvement and scene flow control. Figure 1 shows an example of the labeled speech data. Wavesurfer [1] software served as our speech visualization and labelling tool. The example utterance contains a part of the storyteller’s speech (containing the phrase “oniwa bikkuridesu” meaning, “the devil was surprised,” and devil’s part, “ta ta tasukekuree,” meaning, “please help me!”) and is shown in the top label pane for characters (chrlab). The second top label pane (evelab) shows event labels such as scene changes and emotional involvement (desire, joy, fear, etc…). In this example, a “fear” event is attached to the devil’s utterance part. The dynamic pitch movement can be observed in the pitch contour pane located at the bottom of the figure.


How are the events of scene change or emotional involvement provided by human narrators manifested in speech data? Prosodic parameters of speed, measured in speech rate or mora/sec; pitch, measured in Hz; power, measured in dB; and preceding pause length, measured in seconds, are investigated for all the breath groups in the speech data. Breath group refers to a speech segment which is uttered consecutively without pausing. Figure 2, 3 and 4 show these parameters at a scene-change event (Figure 2), desire event (Figure 3), and fear event (Figure 4). The axis on the left of the figures shows the ratio of the parameter to its average value. Each event has its own distinct tendency on prosodic parameters, also seen in the figures, which seems to be fairly common to all speakers. For instance, the differences between the scene-change event and the desire event are the amount of preceding pause and the degree of the contributions from the other three parameters. The fear event shows a quite different tendency from other events, but it is common to all speakers though the degree of the parameter movement differs between speakers. Figure 5 shows how to expresses character differences, when the reader impersonates the story’s characters, with the three parameters. In short, speed and pitch are changed dynamically for impersonation, and this is a common tendency of all speakers.

Based on findings obtained from these human narrations, we are designing a framework of mapping story events through scene changes and emotional involvement to prosodic parameters. Simultaneously, it is necessary to build additional databases to ensure and reinforce story event description and mapping framework.

saito-fig2 saito-fig3
saito-fig4 saito-fig5

[1] Wavesurfer: http://www.speech.kth.se/wavesurfer/

2aNS – How virtual reality technologies can enable better soundscape design

W.M. To – wmto@ipm.edu.mo
Macao Polytechnic Institute, Macao SAR, China.
A. Chung – ac@smartcitymakter.com
Smart City Maker, Denmark.
B. Schulte-Fortkamp – b.schulte-fortkamp@tu-berlin.de
Technische Universität Berlin, Berlin, Germany.

Popular version of paper 2aNS, “How virtual reality technologies can enable better soundscape design”
Presented Tuesday morning, November 29, 2016
172nd ASA Meeting, Honolulu

The quality of life including good sound quality has been sought by community members as part of the smart city initiative. While many governments have placed special attention to waste management, air and water pollution, acoustic environment in cities has been directed toward the control of noise, in particular, transportation noise. Governments that care about the tranquility in cities rely primarily on setting the so-called acceptable noise levels i.e. just quantities for compliance and improvement [1]. Sound quality is most often ignored. Recently, the International Organization for Standardization (ISO) released the standard on soundscape [2]. However, sound quality is a subjective matter and depends heavily on the perception of humans in different contexts [3]. For example, China’s public parks are well known to be rather noisy in the morning due to the activities of boisterous amateur musicians and dancers – many of them are retirees and housewives – or “Da Ma” [4]. These activities would cause numerous complaints if they would happen in other parts of the world, but in China it is part of everyday life.

According to the ISO soundscape guideline, people can use sound walks, questionnaire surveys, and even lab tests to determine sound quality during a soundscape design process [3]. With the advance of virtual reality technologies, we believe that the current technology enables us to create an application that immerses designers and stakeholders in the community to perceive and compare changes in sound quality and to provide feedback on different soundscape designs. An app has been developed specifically for this purpose. Figure 1 shows a simulated environment in which a student or visitor arrives the school’s campus, walks through the lawn, passes a multifunctional court, and get into an open area with table tennis tables. She or he can experience different ambient sounds and can click an object to increase or decrease the volume of sound from that object. After hearing sounds at different locations from different sources, the person can evaluate the level of acoustic comfort at each location and express their feelings toward overall soundscape.  She or he can rate the sonic environment based on its degree of perceived loudness and its level of pleasantness using a 5-point scale from 1 = ‘heard nothing/not at all pleasant’ to 5 = ‘very loud/pleasant’. Besides, she or he shall describe the acoustic environment and soundscape using free words because of the multi-dimensional nature of sonic environment.


Figure 1. A simulated soundwalk in a school campus.

  1. To, W. M., Mak, C. M., and Chung, W. L.. Are the noise levels acceptable in a built environment like Hong Kong? Noise and Health, 2015. 17(79): 429-439.
  2. ISO. ISO 12913-1:2014 Acoustics – Soundscape – Part 1: Definition and Conceptual Framework, Geneva: International Organization for Standardization, 2014.
  3. Kang, J. and Schulte-Fortkamp, B. (Eds.). Soundscape and the Built Environment, CRC Press, 2016.
  4. Buckley, C. and Wu, A. In China, the ‘Noisiest Park in the World’ Tries to Tone Down Rowdy Retirees, NYTimes.com, from http://www.nytimes.com/2016/07/04/world/asia/china-chengdu-park-noise.html , 2016.


4aEA1 – Aero-Acoustic Noise and Control Lab

Aero-Acoustic Noise and Control Lab – Seoryong Park – tjfyd11@snu.ac.kr

School of Mechanical and Aerospace Eng., Seoul National University
301-1214, 1 Gwanak-ro, Gwanak-gu, Seoul 151-742, Republic of Korea

Popular version of paper 4aEA1, “Integrated simulation model for prediction of acoustic environment of launch vehicle”
Presented Thursday morning, December 1, 2016
172nd ASA Meeting, Honolulu

Literally speaking, a “sound” refers to a pressure fluctuation of the air. This means, for example, the sound of a bus passing means our ear senses the pressure fluctuation or pressure variation the bus created. During our daily lives, there are rarely significant pressure fluctuations in the air above common noises, but in special cases it happens. Windows are commonly featured in movies breaking from someone screaming loudly or in high pitches in the movie. This is usually exaggerated, but not out of the realm of what is physically possible.

The pressure fluctuations in the air caused by sound can cause engineering problems for loud structures such as rockets, especially given that the pressure nature of the sounds waves that means louder sounds result from larger pressure fluctuations and can cause more damage. Rocket launches are particularly loud and the resulting pressure change in the air can affect the surface of the launched vehicle as the form of the force shown as Figure 1.

Figure 1. The Magnitude of Acoustic Loads on the Launch Vehicle

As the vehicle is launched (Figure. 2), it reaches volumes over 180dB, which corresponds to about 20,000 Pascals in pressure change. This pressure change is about 20% of atmospheric pressure, which is considered very large. Because of the pressure change during launching, communication equipment and antenna panel can incur damage, causing the malfunctioning of the fairing, the protective cone covering the satellite. In the engineering field, the load created by the launching noise is called acoustic load, and many studies are in progress related to acoustic load.

Studies focused on the relationship between a launching vehicle and its acoustic load is categorized, to rocket engineers, under “prediction and control.” Prediction is divided into two aspects: internal acoustic load; and external acoustic load. Internal acoustic load refers to sound delivered from outside to inside, while external acoustic load is the noise directly from the jet fire. There are two ways to predict the external acoustic load, namely an empirical method and numerical method. The empirical method was developed by NASA in 1972 and uses the collected information from various studies. The numerical method employs mathematical formulas related to noise and electric wave calculated using computer modeling. As computers become more powerful, this method continues to gain favor. However, because numerical methods require so much calculation time, they often require the use of dedicated computing centers. Our team instead focused on using the more efficient and faster empirical method. fig-3-external-acoustic-loads-prediction-result-%28spectrum%29

Figure 3 shows the results of our calculations, depicting the expected sound spectrum. We can consider various physics principles involved during a lift-off, such as sound reflection, diffraction and impingement that could affect the original empirical method results.

Meanwhile, our team used a statistical energy analysis method to predict the internal acoustic load caused by the predicted external acoustic load. This method is used often to predict internal noise environments. It is used to predict the internal noise of a launching vehicle as well as aircraft and automobile noise. Our research team used a program called, VA One SEA, for predicting these noise effects, shown as figure. 4.

Figure 4. Modeling of the Payloads and Forcing of the External Acoustic Loads

After predicting internal acoustic load, we decreased the acoustic load to conduct an internal noise control study. A common way to do this is by sticking noise-reducing material to the structure. However, the extra weight from the noise-reducing material can cause decreased performance. To overcome this side effect, we also conducted a study about active noise control, which is in progress. Active noise control refers to reducing the noise by making antiphase waves of the sound for cancelling. Figure 5 shows the experimental results of applied SISO Noise Control, showing the reduction of noise is significant, especially for low frequencies.

Figure 5. Experimental Results of SISO Active Noise Control

Our research team applied the acoustic load prediction method and control method to the Korean launching vehicle, KSR-111. Through this application, we developed an improved empirical prediction method that is more accurate than previous methods, and we found usefulness of the noise control as we established the best algorithm for our experimental facilities and the active noise control area.

1aNS5 – Noise, vibration, and harshness (NVH) of smartphones

Inman Jang – kgpbjim@yonsei.ac.kr
Tae-Young Park – pty0948@yonsei.ac.kr
Won-Suk Ohm – ohm@yonsei.ac.kr
Yonsei University
50, Yonsei-ro, Seodaemun-gu
Seoul 03722

Heungkil Park – heungkil.park@samsung.com
Samsung Electro Mechanics Co., Ltd.
150, Maeyeong-ro, Yeongtong-gu
Suwon-si, Gyeonggi-do 16674

Popular version of paper 1aNS5, “Controlling smartphone vibration and noise”
Presented Monday morning, November 28, 2016
172nd ASA Meeting, Honolulu

Noise, vibration, and harshness, also known as NVH, refers to the comprehensive engineering of noise and vibration of a device through stages of their production, transmission, and human perception. NVH is a primary concern in car and home appliance industries because many consumers take into account the quality of noise when making buying decisions. For example, a car that sounds too quiet (unsafe) or too loud (uncomfortable) is a definite turnoff. That said, a smartphone may strike you as an acoustically innocuous device (unless you are not a big fan of Metallica ringtones), for which the application of NVH seems unwarranted. After all, who would expect the roar of a Harley from a smartphone? But think again. Albeit small in amplitude (less than 30 dB), smartphones emit an audible buzz that, because of the close proximity to the ear, can degrade the call quality and cause annoyance.


Figure 1: Smartphone noise caused by MLCCs

The major culprit for the smartphone noise is the collective vibration of tiny electronics components, known as multi-layered ceramic capacitors (MLCCs). An MLCC is basically a condenser made of piezoelectric ceramics, which expands and contracts upon the application of voltage (hence piezoelectric). A typical smartphone has a few hundred MLCCs soldered to the circuit board inside. The almost simultaneous pulsations of these MLCCs are transmitted to and amplified by the circuit board, the vibration of which eventually produces the distinct buzzing noise as shown in Fig. 1. (Imagine a couple hundred rambunctious little kids jumping up and down on a floor almost in unison!) The problem has been even more exacerbated by the recent trend in which the name of the game is “The slimmer the better”; because a slimmer circuit board is much easier to flex it transmits and produces more vibration and noise.

Recently, Yonsei University and Samsung Electromechanics in South Korea joined forces to address this problem. Their comprehensive NVH regime includes the visualization of smartphone noise and vibration (transmission), the identification and replacement of the most problematic MLCCs (production), and the evaluation of harshness of the smartphone noise (human perception). For visualization of smartphone noise, a technique known as the nearfield acoustic holography is used to produce a sound map as shown in Fig. 2, in which the spatial distribution of sound pressure, acoustic intensity or surface velocity can be overlapped on the snapshot of the smartphone. Such sound maps help smartphone designers draw a detailed mental picture of what is going on acoustically and proceed to rectify the problem by identifying the groups of MLCCs most responsible for producing the vibration of the circuit board. Then, engineers can take corrective actions by replacing the (cheap) problematic MLCCs with (expensive) low-vibration MLCCs. Lastly, the outcome of the noise/vibration engineering is measured not only in terms of physical attributes such as sound pressure level, but also in their psychological correlates such as loudness and the overall psychoacoustic annoyance. This three-pronged strategy (addressing production, transmission, and human perception) is proven to be highly effective, and currently Samsung Electromechanics is offering the NVH service to a number of major smartphone vendors around the world.

sound-map - smartphone

Figure 2: Sound map of a smartphone surface


2aABa3 – Indris’ melodies are individually distinctive and genetically driven

Marco Gamba – marco.gamba@unito.it
Cristina Giacoma – cristina.giacoma@unito.it

University of Torino
Department of Life Sciences and Systems Biology
Via Accademia Albertina 13
10123 Torino, Italy

Popular version of paper 2aABa3 “Melody in my head, melody in my genes? Acoustic similarity, individuality and genetic relatedness in the indris of Eastern Madagascar”
Presented Tuesday morning, November 29, 2016
172nd ASA Meeting, Honolulu

Human hearing ablities are exceptional at identifying the voices of friends and relatives [1]. The potential for this identification lies in the acoustic structures of our words, which not only convey verbal information (the meaning of our words) but also non-verbal cues (such as sex and identity of the speakers).

In animal communication, the recognizing a member of the same species can also be important. Birds and mammals may adjust their signals that function for neighbor recognition, and the discrimination between a known neighbor and a stranger would result in strikingly different responses in term of territorial defense [2].

Indris (Indri indri) are the only lemurs that produce group songs and among the few primate species that communicate using articulated singing displays. The most distinctive portions of the indris’ song are called descending phrases, consisting of between two and five units or notes. We recorded 21 groups of indris in the Eastern rainforests of Madagascar from 2005 to 2015. In each recording, we identified individuals using natural markings. We noticed that group encounters were rare, and hypothesized that song might play a role in providing members of the same species with information about the sex and identity of an individual singer and the emitting group.

gamba1 - indris

Figure 1. A female indri with offspring in the Maromizaha Forest, Madagascar. Maromizaha is a New Protected Area located in the Region Alaotra-Mangoro, east of Madagascar. It is managed by GERP (Primate Studies and Research Group). At least 13 species of lemurs have been observed in the area.

We found we could effectively discriminate between the descending phrases of an individual indris, showing they have the potential for advertising about sex and individual identity. This strengthened the hypothesis that song may play a role in processes like kinship and mate recognition. Finding that there is was degree of group specificity in the song also supports the idea that neighbor-stranger recognition is also important in the indris and that the song may function announcing territorial occupation and spacing.


Figure 2. Spectrograms of an indri song showing a typical sequence of different units. In the enlarged area, the pitch contour in red shows a typical “descending phrase” of 4 units. The indris also emit phrases of 2, 3 and more rarely 5 or 6 units.

Traditionally, primate songs are considered an example of a genetically determined display. Thus the following step in our research was to examine whether the structure of the phrases could relate to the genetic relatedness of the indris. We found a significant correlation between the genetic relatedness of the studied individuals and the acoustic similarity of their song phrases. This suggested that genetic relatedness may play a role in determining song similarity.

For the first time, we found evidence that the similarity of a primate vocal display changes within a population in a way that is strongly associated with kin. When examining differences between sexes we found that male offspring showed phrases that were more similar to their fathers, while daughters did not show similarity with any of their parents.


Figure 3. A 3d-plot of the dimensions (DF1, DF2, DF3) generated from a Discriminant model that successfully assigned descending phrases of four units (DP4) to the emitter. Colours denote individuals. The descending phrases of two (DP2) and three units (DP3) also showed a percentage of correct classification rate significantly above chance.

The potential for kin detection may play a vital role in determining relationships within a population, regulating dispersal, and avoiding inbreeding. Singing displays may advertise kin to signal against potential mating, information that females, and to a lesser degree males, can use when forming a new group. Unfortunately, we still do not know whether indris can perceptually decode this information or how they use it in their everyday life. But work like this sets the basis for understanding primates’ mating and social systems and lays the foundation for better conservation methods.

  1. Belin, P. Voice processing in human and non-human primates. Philosophical Transactions of the Royal Society B: Biological Sciences, 2006. 361: p. 2091-2107.
  2. Randall, J. A. Discrimination of foot drumming signatures by kangaroo rats, Dipodomys spectabilis. Animal Behaviour, 1994. 47: p. 45-54.
  3. Gamba, M., Torti, V., Estienne, V., Randrianarison, R. M., Valente, D., Rovara, P., Giacoma, C. The Indris Have Got Rhythm! Timing and Pitch Variation of a Primate Song Examined between Sexes and Age Classes. Frontiers in Neuroscience, 2016. 10: p. 249.
  4. Torti, V., Gamba, M., Rabemananjara, Z. H., Giacoma, C. The songs of the indris (Mammalia: Primates: Indridae): contextual variation in the long-distance calls of a lemur. Italian Journal of Zoology, 2013. 80, 4.
  5. Barelli, C., Mundry, R., Heistermann, M., Hammerschmidt, K. Cues to androgen and quality in male gibbon songs. PLoS ONE, 2013. 8: e82748.