Acoustical Society of America 
142nd Meeting Lay Language Papers

[ Lay Language Paper Index | Press Room ]


Synthesis fidelity and vowel identification

Peter F. Assmann - assmann@utdallas.edu
William F. Katz
School of Human Development
The University of Texas at Dallas
Richardson TX 75083
(972) 883 2435

Popular version of paper 2aSC17
Presented Tuesday morning, Dec. 4, 2001
142nd ASA Meeting, Fort Lauderdale, FL

Introduction
        Recent advances in speech synthesis have provided new techniques for manipulating the properties of speech, making it possible, for example, to simulate highly realistic changes in voice quality, or enhance the acoustic properties of speech that pose difficulties for second language learners or hearing impaired listeners. In this paper we report the results of experiments examining the importance of time-varying changes in the acoustic properties of vowels. Traditionally, vowels were often treated as static entities: acoustic measurements were taken from a single cross-section at the vowel midpoint, and in synthesis the spectral properties of vowels were held constant over time. However, recent studies have shown that time-varying properties of vowels make an important contribution to identification.

        Vowel quality is determined primarily by the frequencies of the vocal tract resonances, called formants, while the pitch of the voice is determined mainly by the fundamental frequency (F0), associated with vibration of the vocal folds. Synthesis studies have shown that American English vowels are less accurately identified when natural time-varying changes in the formant frequencies are eliminated (by "flattening" these changes over time). On the other hand, holding F0 constant over time produces a monotone voice pitch, but has little effect on vowel identification accuracy. Thus formant frequency movement is important for the identification of American English vowels, but F0 movement has little effect.

        One limitation in the earlier experiments was that the synthesized versions were less accurately identified than natural vowels. To overcome this limitation we used a high-quality vocoder, STRAIGHT, developed by Hideki Kawahara, to re-examine the effects of spectral change and source properties in vowel identification. The results confirmed (1) that time-varying changes in the formants are important for the identification of American English vowels, and (2) that changes in fundamental frequency have little effect on vowel identification.

Stimuli

  • The stimuli were natural and synthesized vowels in / h_d / context:
    • "heed",  "hid",  "hayed",  "head",  "had",  "hud",  "hawed",  "hod",  "hoed",  "hood",  "who'd",  "herd".
  • 12 vowels from 3 talkers in each of 5 talker groups (adult males, adult females, and 3 groups of children, ages 3, 5, and 7).
  • Listeners were 10 college students, native English speakers, with normal hearing.
  • Vowels were presented in randomized sequence over headphones.
  • Listeners responded by selecting one of 12 labeled response boxes on the computer screen.
Results
 

Result 1: Listeners identified the synthesized vowels as well as the natural vowels.
 

  • M: Adult males
  • F: Adult females
  • 7: 7-year old children
  • 5: 5-year old children
  • 3: 3-year old children
  • Result 2: Holding the fundamental frequency (F0) constant 
    had little effect on identification accuracy.

    Result 3: Holding the formant frequencies constant 
    led to a 23% drop in mean accuracy.

  • Blue bars: synthesized vowels with natural formants / natural F0  

  •  
  • Red bars: synthesized vowels with natural formants / flat F0 

  •  
  • Black bars: synthesized vowels with flat formants / natural F0 

  •  
  • White bars: synthesized vowels with flat formants / flat F0 

  •  
    Discussion
            While there is no question that high-quality synthesis requires careful modeling of F0, our results show that time-varying changes in F0 do not help listeners to identify vowels. This finding is somewhat surprising, because several models of vowel identification had predicted that F0 movement should help to "sketch out" the locations of the formants when the F0 is high, as in children's voices. On the other hand, the results indicate that important dynamic information is provided by changes in the formants over time. The largest effects are heard for the vowels /e/ ("hayed") and /o/ ("hoed"), but several other vowels undergo clear deterioration when their formants are held constant, as the audio demonstration shows.


    [ Lay Language Paper Index | Press Room ]