Data-based
Speech Synthesis: Progress and Challenges
H. Timothy Bunnell - bunnell@asel-udel.edu
Center
for Pediatric Auditory and Speech Sciences, Nemours Biomedical Research
1600
Rockland Rd.
Wilmington,
DE 19803
Ann K. Syrdal - syrdal@research.att.com
AT&T
Labs - Research
180 Park
Ave.
Florham
Park, NJ 07932
Popular
version of paper 3pID1
Presented
Wednesday afternoon, April 21, 2010
159th
ASA Meeting, Baltimore, MD
Initial
approaches to synthesizing speech by computer required extensive and detailed
knowledge about speech articulation and acoustics. The synthetic speech generated by these
systems is typically described by listeners as sounding robotic rather than
human.
Our talk
will describe some recent developments using a newer, data-based approach: concatenative speech synthesis. With data-based techniques, a synthetic voice
is derived from a digital database of recorded human speech, typically recorded
from one talker. Short snippets of
natural speech are then strung together into a sequence (concatenated) to create
synthetic speech utterances.
Early concatenative synthesizers
speech databases were limited to
one pre-selected instance of each of the possible sequential pairs (diphones) of basic speech sounds that could occur within or
between words in a language. In English,
a minimum inventory of about 1000 diphones was
needed. These diphones
would be frequency-warped and shrunk or stretched in duration as needed for a
given synthetic utterance. These
modifications introduced acoustic distortion, so diphone
concatenative synthesis also sounded quite robotic.
In the
late 1990's, unit selection concatenative synthesis
was introduced. Unit selection
text-to-speech (TTS) systems operate on all the data from a large database,
typically several hours of continuous natural speech. The constituent units of the speech database
are located, annotated on several linguistic levels and indexed for
retrieval. The synthesis algorithm then
amounts to a complex data search and retrieval process in which small units of recorded
speech are selected to match the units required to produce the intended
synthetic utterance. Since there are
many instances of speech units to choose from, the units do not require much,
if any, signal processing to change their frequency or duration. Consequently, unit selection synthetic speech
resembles natural speech much more closely than the earlier techniques. For example, AT&T Natural Voices TTS is
a unit selection system: http://www2.research.att.com/~ttsweb/tts/demo.php
Already,
some systems based on very large collections of recorded continuous speech from
an individual can produce short sentences that are difficult to distinguish
from natural speech. However, no present
system is able to render paragraph-length texts in a manner that is wholly
natural. In part, this is because the systems are doing a very superficial job
of interpreting the meaning that the text is intended to convey. This is an
area where much additional work is needed.
This shortcoming is more or less apparent depending upon the nature of
the material being read. Expressive material is more challenging for TTS than
reading factual information.
Along
with the enhanced naturalness and intelligibility of recent concatenative
systems, the number of practical applications for synthetic speech has
increased. These fall broadly into two categories: business applications, and
assistive technology. Among business applications that have benefited from new
synthesis technology are a variety of limited-domain applications (e.g., navigation) and unrestricted text readers
such as the TTS voices now provided in ebook readers.
For
several decades, people with severe speech disorders have used communication
aids (now called Speech Generating Devices or SGDs) that employ synthetic
speech. These aids allow the user to type or otherwise select words and
sentences to be spoken, and for some people represent their primary mode of
communication. Until recently, most SGDs
provided only rule-based synthesis that sounded robotic, and certainly unlike
any specific person. The physicist Steven Hawking is a prominent example of a
person who uses just such an SGD. In Professor Hawkings
case, he has used a specific model for such a long time that he now identifies
with its voice quality. However, for many other users of SGDs, the availability
of highly natural sounding and more intelligible concatenative
voices has led to significant improvements in the quality of their
communication.
A foremost concern of SGD users is that the
synthetic speech be intelligible. For many years, the DECtalk
systems were regarded as the most intelligible systems on the commercial
market. However, recent unit concatenation systems are demonstrably more
intelligible than the rule-based systems of the 1980s. For instance, In the
Nemours Speech Research Laboratory, we compared five female TTS voices
including four concatenative voices and one DECtalk voice for intelligibility. The most striking result
of this experiment was the difference in intelligibility between the DECtalk voice and all of the unit concatenation systems:
listeners found the DECtalk voice much harder to
understand than any of the concatenative voices.
This does not mean that all concatenative
TTS systems are necessarily more intelligible than rule-based systems. Simply
put, concatenative TTS voices are only as good as the
speech data that goes into them. This becomes an especially important factor in
one emerging area of application for concatenative
TTS systems: creation of personal TTS voices for individual users. Software is
now available that allows any individual with fluent speech to spend a few
hours recording their speech on a home computer system and then have those
recordings converted to a concatenative TTS voice.
This is an especially attractive capability people who have been diagnosed with
a neurodegenerative disease such as ALS (also called Lou Gehrigs disease). ALS
patients typically lose the ability to speak as the disease progresses, but
while they are still speaking fluently, they can bank their voice for later
use in a concatenative TTS system. ALS patients can
then have an SGD with synthetic speech that closely resembles their own
voice. Recent press coverage of film
critic Roger Eberts has revealed his enthusiasm over receiving an SGD built
from recordings of his own voice.
A number of laboratories have developed applications
around the concept of voice banking for people who are at risk for losing their
voice. The most advanced of these projects at present is the ModelTalker project (http://www.ModelTalker.com), which
has allowed close to 200 people to create personalized concatenative
TTS voices from recordings made on a personal computer at home. The ModelTalker project aims to allow novice users to record
speech data of adequate quality and quantity for generating a concatenative voice without the need for professional
recording equipment and expert assistance.
While
the ability to create acceptable personalized TTS voices has largely been
realized with concatenative synthesis, considerable
room for improvement remains. Ideally, SGD users should be able to make their
voice sound happy, sad, or angry and express surprise or doubt not only with
words, but with tone of voice. At present, the only good method for producing
these effects in data-based synthesis is to expand the inventory of recorded
speech to include utterances spoken in a happy, sad, angry, etc. voice.
Unfortunately, greatly increasing the diversity and size of the speech
inventory to be recorded is impractical or even impossible for many potential
users. Indeed, one of the greatest barriers to the broad use of voice banking
for ALS patients is simply the amount of speech they must record to produce an
acceptable basic concatenative voice.
Ultimately,
we expect that research will provide solutions to the problems of creating
fully natural sounding and expressive synthetic speech. Probably as part of
those solutions, we will also learn how to capture the voice quality of an
individual from a relatively small but representative sample of their fluent
speech whether the individual is an adult or a child. Moreover, it is possible
that this will allow us to go one step further and generate realistic
natural-sounding voices for dysarthric individuals
who presently cannot produce anything more than a few isolated vowel sounds. In
fact, separate projects at Oregon Health Sciences University, and Northeastern
University are already exploring how some data-based TTS technology might be
used to achieve this. There is much work ahead, but great promise for TTS in
Assistive Technology