ASA Lay Language Papers
162nd Acoustical Society of America Meeting


Hearing pitch despite its absence in "whispered" speech

Yukiko Sugiyama – yukiko_sugiyama@mac.com
Keio University
4-1-1 Hisyoshi, Kohoku-ku
Yokohama, 223-8521, Japan

Popular version of paper 5aSCb
Presented Friday morning, November 4, 2011
162nd ASA Meeting, San Diego, Calif.

Languages have characteristic prosodic (or melodic) properties of their own.  Because of this, when we encounter a language we have never heard, we still have a general idea about what kind of language it would be.  While the word prosody of Tokyo Japanese (simply Japanese hereafter) is often described as pitch accent, the exact nature of pitch accent is under much debate with pitch accent sharing properties with the major two prosodic types observed in languages across the world.  The present study investigates whether Japanese pitch accent is characterized by properties other than pitch.

Germanic languages such as English, German, and Dutch are typical examples of stress accent languages, in which words have stress accent specified on particular syllables within words. For example, when the word record has stress on the first syllable, it is a noun.  When it has stress on the second syllable, it is a verb.  Languages such as Mandarin Chinese, Thai, and Vietnamese have different word prosody known as tone.  A tone, or a pitch contour, is a perceptual quality of sounds which we describe as “high” or “low.”  Although the relationship is not strictly linear, as the rate at which the vocal folds vibrate, known as the fundamental frequency of one’s voice, increases, we perceive the sound to be on a higher pitch.  In tone languages, each syllable in a word is specified for its tone.  For example, one-syllable word ma can mean four different things depending on its pitch contour.

The word prosody of Japanese is an interesting case in that it has characteristics of both stress accent languages and tone languages.  Like stress accent in English, Japanese words have accent on specified syllables and the word acts as a unit to characterize prosodic patterns.  Although words can have no accent, unaccented words also have a specific prosodic pattern.  At the same time, Japanese is similar to tone languages in that it uses fundamental frequency (F0) to express accent information.  Thus, the sentence kare wa tori ga ii can mean two different things depending on whether tori has pitch accent on the final syllable or it has no accent.  The sentence means “he prefers to be the last person (to perform on the stage)” when tori has accent on the final syllable whereas it means “he prefers chicken” when tori has no accent.  Figure 1 shows the spectrograms and the F0 movements of the utterances containing the accented word (Figure 1a) and the unaccented word (Figure 1b).  The blue lines overlaying the spectrograms show the movements of the F0.  While the F0 is higher overall on the second syllable than on the first syllable of tori for the accented word, the F0 stays relatively flat for the unaccented word.  In addition, the F0 falls drastically in the following syllable ga for the accented word but it remains high for the unaccented word.

(a) The sentence containing the accented word
listen here
(b) The sentence containing the unaccented word

Figure 1. The F0 movements for the accented word (a) and the unaccented word (b).  The accented syllable is marked with an asterisk (*) in the figure.  The thin, more or less regular vertical striations in the spectrograms are indications of vocal fold vibrations.


As the figure illustrates, the F0 is said to be the primary acoustic correlate of pitch accent in Japanese.  Considering that studies in other languages show that linguistic information is often encoded in more than one way, the F0 is not likely to be the only correlate.  For example, in English, stressed syllables are typically longer in duration, have a greater F0 movement, and tend to have greater intensity.  Studies in tone languages suggest that F0 may not be the only cue for tone.  However, previous studies that measured vowel duration or intensity as a correlate of Japanese pitch accent are inconsistent at best and have failed to identify secondary cues for pitch accent.  Therefore, the present study attempts to find correlates of pitch accent from the other end, i.e., perception.

In the experiment, a native speaker of Japanese produced 14 pairs of sentences including the pair shown in Figure 1.  These utterances were then edited by replacing the F0 with random noise, creating artificial “whispered” speech.  Since whisper is normally spoken with no vocal fold vibrations, it has no pitch.  However, when people whisper, it is possible that they try to convey accent information in a way they do not in normal speech.  For this reason, the “whispered” speech was created artificially in this study.  Figure 2 shows the “whispered” version of the utterances shown in Figure 1.  As the figure shows, there is no blue line in it, indicating that the speech analysis program has failed to detect the F0.  The original utterances and their “whispered” version were presented to 21 Japanese speakers.  Their task was to identify the words they heard by choosing between two written alternatives, which were representations of the accented word and its unaccented counterpart.  The results found that the accuracy for the original speech reached over 95 percent and that for the “whispered” speech was roughly 65 percent.  Although the accuracy was much lower for the “whispered” speech, it was better then chance.  The listeners perceived pitch accent where no fundamental frequency cues were given.  This clearly means that some acoustic properties other than the F0 were there to convey the accent information.  Currently, acoustic analyses are underway to identify what acoustic cues enabled the listeners to distinguish the accented and unaccented words in the “whispered” speech.

listen here
(a) The sentence containing the accented word
listen here
(b) The sentence containing the unaccented word

Figure 2. The “whispered” versions of the utterances shown in Figure 1.  The spectrograms are blur overall and there is no blue line to indicate the F0.

[ Lay Language Papers Index | Press Room