ASA PRESSROOM

148th ASA Meeting, San Diego, CA


[ Lay Language Paper Index | Press Room ]


Synthesis of Human-Like Laughter:
Toward Machine Synthesis of Human Speech

Shiva Sundaram (shiva.sundaram@usc.edu) and Shrikanth Narayanan (shri@sipi.usc.edu)
Speech Analysis and Interpretation Laboratory,
University of Southern California, Los Angeles,CA.USA

Popular version of paper 1aSC11
Presented Monday morning, November 15, 2004
148th ASA Meeting, San Diego, CA

1. Introduction
Speech is one of the basic forms of human communication. Verbal aspects of spoken communication rely on words specified of to convey explicit intent and desires in a communication exchange. However, speech also carries significant implicit information such as intonation that exemplifies the context and the emotional state of the speaker. Consider for example the sequence of words "did you go to the post office." To use these words to pose a question, a speaker controls his or her intonation. Of course this question can be relayed either plainly or with any number of emotions such as a hint of harshness to express anger or laughter to indicate something funny, or even sarcastic. As one can see even in spoken communication, linguistic cues (as conveyed by words) are accompanied by paralinguistic cues such as laughter to make up the rich communication fabric. For a machine voice to sound truly natural, it is hence critical to be able to capture the rich linguistic and paralinguistic details embedded in human speech. Enabling automated synthesis of expressive speech is a critical area in current speech communication research.

While the primary focus on past efforts in speech synthesis has been on improving intelligibility, recent trends are targeting improving the emotional and expressive quality of the synthetic speech. For instance, natural expressive speech quality is essential for long exchanges of dialogues and for even information relaying monologues. There are many components that play a role in imparting the expressive emotional quality to speech. These include variations in speech intonation and timing, the appropriate choice of words and the use of other non-verbal cues. One of the key expressive quality types is happiness. It has been shown by prior work by other researchers, and our own, that it is one of the most challenging synthesis problems and that one has to look beyond intonation variation. Laughter is one key attribute of this realm, and the focus of the present work.

This paper addresses the analysis and synthesis of the acoustics of laughter. Our main application is to aid synthesis of ``happy-sounding" speech. In most situations, happy speech may be better defined as speech that conveys positive emotions. Subjective tests reveal that it is difficult to discern happy emotions in speech purely on the basis of intonational differences. While in face-to-face interactions it is possible to have happy speech without laughter, other non-verbal cues such as facial expressions help the interlocutor to understand that the underlying emotion is positive or happy. For speech-only situations, however, laughter is often used as a cue to express happiness.

Different types of laughter are used by humans to express different levels of gladness or elation. Furthermore, the same speaker may laugh differently in different situations. The synthesis technique developed here assumes a dynamical oscillator model for laughter. It has the provision to parametrically control the laughter model and thus can be used to synthesize for a wide variety of happy expressions. It can also be used to capture and generate speaker-specific traits in laughter.

2. Laughter: Its Components and Synthesis
Laughter is a highly complex physiological process of breathing and voicing.. There is a wide variety of terminologies used to describe the various aspects of laughter, hence a description relevant to this research follows:A short burst or a train of laughter (i.e., a laughter bout), comprises two main components. Voiced laughter-calls, utterances that excite the vocal chords and generate sound, and unvoiced sounds, generated as breathing sounds with air passing through the larynx without involving vocal chord vibrations. Usually, the voiced and unvoiced parts alternate in a laughter event, but it is also possible that they co-occur. Figure 1 shows a typical laughter bout illustrating these parts.

Figure 1


A laughter even starts with a contextual or semantic impulse, that puts the speaker in a laughing state. When laughing, there are continuous bursts of air exhalation and intake that each last for a short period. This intake and exhalation can be seen as an oscillatory behavior that can be observed in most laughter bouts. Figure 1 illustrates this observation. It is a plot of an actual laughter sample from a human.

In summary, synthesis of laughter consists of two main issues: to synthesize the actual sound that would occur at each voiced laughter-call and to decide the duration and energy of each call, and thus synthesize the overall laughter bout.
We model this oscillatory behavior of alternate inhalation and exhalation with equations that describe the harmonic motion of a mass attached to the end of a spring (illustrated in Figure 2) In this simple mass-spring system, the stiffness of the spring and the weight of the mass determine the frequency of oscillation of the mass. The initial displacement and the damping factor determine how long the mass would continue to oscillate.

Figure 2


Using these equations, we can calculate the duration for each voiced-call in the synthesized laughter from the duration of the positive displacement of the mass. Further, during the negative displacement of the spring-mass system, an inhalation takes place and the duration of the inhalation for synthesis can be thus calculated. Thus the duration information and intensity of each voiced-call is known, and these can be used with conventional signal processing techniques in speech synthesis to generate the overall laughter bout. Also, the pitch variation through out the laughter bout can be analyzed from pre-recorded data of a real human and introduced during synthesis.

In Figure 1, the audio sample of a laughter bout recorded from a real human speaker is shown. Overlaid on this waveform is the fitted model from the equations that govern the movement of the spring-mass system. It is seen that during each positive displacement of the mass-spring system, there is a voiced-call and at each negative displacement there is an inhalation (illustrated as the gaps in the plot). It can also be seen that the amplitudes and the fall-off of the model follows the fall-off in amplitude of the voiced-calls. However, on close examination, there is a mismatch between the model fit and the actual data. This mismatch is mainly due to the oversimplified spring-mass system used to illustrate the model. More complex models can be constructed (if necessary) to fit the audio sample better. It can be made with multiple damping factors and multiple forced oscillations and with variable stiffness and mass parameters. But this may not be completely necessary because, this would only be a right fit for the particular audio sample and in reality no two laughter bouts are exactly the same.

From a synthesis point of view, this simple model is found to be adequate and flexible. One could easily control the trailing of the laughter bout, its duration, and the frequency of occurrence of the voiced-calls. One could easily make it last longer or introduce temporal variations in the bout.

Speaker specific traits can be introduced during synthesis by analyzing the pitch variation and the frequency of occurrence of the voiced-calls in the laughter from pre-recorded data. Using the fact that voiced-calls are short bursts and sometimes vowel like, one can synthesize a wide variety of laughter of a particular person with limited or no laughter data. In this technique it is possible to simply take a short voiced segment from any speech data, then determine the duration and intensity of each voiced-call from the spring-mass system, and generate a whole bout of laughter in any style that is suitable. It could be a long complex laughter bout with multiple voiced-calls or a simple short laugh with one or two voiced-calls.

The main advantage of this proposed method is the parametric control over the generated laughter, it is possible to create a wide variety of laughter types with virtually no pre-recorded data. The model is also scalable. It is possible to have any level of model complexity and can be changed in real-time to suite the synthesis application. It is also flexible in the sense that any existing speech synthesis technique can use this model to synthesize laughter.


3.Conclusion and Future Work
It is important to point out that a comprehensive knowledge of effects of including laughter in synthesized speech is necessary. For example, as presented earlier, laughter can be used to express a wide variety of attitudes of the speaker, it can be used to express both positive and negative emotional state. An in-depth understanding can be brought forth from controlled subjective experiments for which it must be possible to generate a variety of controlled laughter for the sake of experimental repeatability. The synthesis technique proposed here can be used for this purpose.

It is not always the case that there is a distinct segment in speech where laughter is present to express gladness or a positive emotion. Often, speech and laughter are articulated together (known as "speech-laughs''). The focus of our ongoing research is to develop techniques to synthesize these so-called "speech-laughs.''


[ Lay Language Paper Index | Press Room ]