ASA PRESSROOM


Acoustical Society of America
159th Meeting Lay Language Papers


[ Lay Language Paper Index | Press Room ]



Data-based Speech Synthesis: Progress and Challenges

 

H. Timothy Bunnell - bunnell@asel-udel.edu

Center for Pediatric Auditory and Speech Sciences, Nemours Biomedical Research

1600 Rockland Rd.

Wilmington, DE 19803

 

Ann K. Syrdal - syrdal@research.att.com

AT&T Labs - Research

180 Park Ave.

Florham Park, NJ 07932

 

Popular version of paper 3pID1

Presented Wednesday afternoon, April 21, 2010

159th ASA Meeting, Baltimore, MD

 
 

Initial approaches to synthesizing speech by computer required extensive and detailed knowledge about speech articulation and acoustics. The synthetic speech generated by these systems is typically described by listeners as sounding robotic rather than human.

 

Our talk will describe some recent developments using a newer, data-based approach: concatenative speech synthesis. With data-based techniques, a synthetic voice is derived from a digital database of recorded human speech, typically recorded from one talker. Short snippets of natural speech are then strung together into a sequence (concatenated) to create synthetic speech utterances.

 

Early concatenative synthesizers speech databases were limited to one pre-selected instance of each of the possible sequential pairs (diphones) of basic speech sounds that could occur within or between words in a language. In English, a minimum inventory of about 1000 diphones was needed. These diphones would be frequency-warped and shrunk or stretched in duration as needed for a given synthetic utterance. These modifications introduced acoustic distortion, so diphone concatenative synthesis also sounded quite robotic.

 

In the late 1990's, unit selection concatenative synthesis was introduced. Unit selection text-to-speech (TTS) systems operate on all the data from a large database, typically several hours of continuous natural speech. The constituent units of the speech database are located, annotated on several linguistic levels and indexed for retrieval. The synthesis algorithm then amounts to a complex data search and retrieval process in which small units of recorded speech are selected to match the units required to produce the intended synthetic utterance. Since there are many instances of speech units to choose from, the units do not require much, if any, signal processing to change their frequency or duration. Consequently, unit selection synthetic speech resembles natural speech much more closely than the earlier techniques. For example, AT&T Natural Voices TTS is a unit selection system: http://www2.research.att.com/~ttsweb/tts/demo.php

 

Already, some systems based on very large collections of recorded continuous speech from an individual can produce short sentences that are difficult to distinguish from natural speech. However, no present system is able to render paragraph-length texts in a manner that is wholly natural. In part, this is because the systems are doing a very superficial job of interpreting the meaning that the text is intended to convey. This is an area where much additional work is needed. This shortcoming is more or less apparent depending upon the nature of the material being read. Expressive material is more challenging for TTS than reading factual information.

 

Along with the enhanced naturalness and intelligibility of recent concatenative systems, the number of practical applications for synthetic speech has increased. These fall broadly into two categories: business applications, and assistive technology. Among business applications that have benefited from new synthesis technology are a variety of limited-domain applications (e.g., navigation) and unrestricted text readers such as the TTS voices now provided in ebook readers.

 

For several decades, people with severe speech disorders have used communication aids (now called Speech Generating Devices or SGDs) that employ synthetic speech. These aids allow the user to type or otherwise select words and sentences to be spoken, and for some people represent their primary mode of communication. Until recently, most SGDs provided only rule-based synthesis that sounded robotic, and certainly unlike any specific person. The physicist Steven Hawking is a prominent example of a person who uses just such an SGD. In Professor Hawkings case, he has used a specific model for such a long time that he now identifies with its voice quality. However, for many other users of SGDs, the availability of highly natural sounding and more intelligible concatenative voices has led to significant improvements in the quality of their communication.

 

A foremost concern of SGD users is that the synthetic speech be intelligible. For many years, the DECtalk systems were regarded as the most intelligible systems on the commercial market. However, recent unit concatenation systems are demonstrably more intelligible than the rule-based systems of the 1980s. For instance, In the Nemours Speech Research Laboratory, we compared five female TTS voices including four concatenative voices and one DECtalk voice for intelligibility. The most striking result of this experiment was the difference in intelligibility between the DECtalk voice and all of the unit concatenation systems: listeners found the DECtalk voice much harder to understand than any of the concatenative voices.

 

This does not mean that all concatenative TTS systems are necessarily more intelligible than rule-based systems. Simply put, concatenative TTS voices are only as good as the speech data that goes into them. This becomes an especially important factor in one emerging area of application for concatenative TTS systems: creation of personal TTS voices for individual users. Software is now available that allows any individual with fluent speech to spend a few hours recording their speech on a home computer system and then have those recordings converted to a concatenative TTS voice. This is an especially attractive capability people who have been diagnosed with a neurodegenerative disease such as ALS (also called Lou Gehrigs disease). ALS patients typically lose the ability to speak as the disease progresses, but while they are still speaking fluently, they can bank their voice for later use in a concatenative TTS system. ALS patients can then have an SGD with synthetic speech that closely resembles their own voice. Recent press coverage of film critic Roger Eberts has revealed his enthusiasm over receiving an SGD built from recordings of his own voice.

 

A number of laboratories have developed applications around the concept of voice banking for people who are at risk for losing their voice. The most advanced of these projects at present is the ModelTalker project (http://www.ModelTalker.com), which has allowed close to 200 people to create personalized concatenative TTS voices from recordings made on a personal computer at home. The ModelTalker project aims to allow novice users to record speech data of adequate quality and quantity for generating a concatenative voice without the need for professional recording equipment and expert assistance.

 

While the ability to create acceptable personalized TTS voices has largely been realized with concatenative synthesis, considerable room for improvement remains. Ideally, SGD users should be able to make their voice sound happy, sad, or angry and express surprise or doubt not only with words, but with tone of voice. At present, the only good method for producing these effects in data-based synthesis is to expand the inventory of recorded speech to include utterances spoken in a happy, sad, angry, etc. voice. Unfortunately, greatly increasing the diversity and size of the speech inventory to be recorded is impractical or even impossible for many potential users. Indeed, one of the greatest barriers to the broad use of voice banking for ALS patients is simply the amount of speech they must record to produce an acceptable basic concatenative voice.

 

Ultimately, we expect that research will provide solutions to the problems of creating fully natural sounding and expressive synthetic speech. Probably as part of those solutions, we will also learn how to capture the voice quality of an individual from a relatively small but representative sample of their fluent speech whether the individual is an adult or a child. Moreover, it is possible that this will allow us to go one step further and generate realistic natural-sounding voices for dysarthric individuals who presently cannot produce anything more than a few isolated vowel sounds. In fact, separate projects at Oregon Health Sciences University, and Northeastern University are already exploring how some data-based TTS technology might be used to achieve this. There is much work ahead, but great promise for TTS in Assistive Technology