Acoustical Society of America
ICA/ASA '98 Lay Language Papers


Future Directions in
Speech Information Processing

Sadaoki Furui- furui@cs.titech.ac.jp
Dept. Computer Science, Tokyo Institute of Technology
2-12-1, Ookayama, Meguro-ku, Tokyo, 152 Japan

Popular version of paper 1aPLa1
Presented Monday morning, June 22, 1998
ICA/ASA '98, Seattle, WA

Overview

This paper predicts future directions in speech processing technologies, including speech recognition, synthesis and coding. It describes the most important research problems, and tries to forecast where progress will be made in the near future and what applications will become commonplace as a result of the increased capabilities.

Speech recognition, synthesis, and coding systems are expected to play important roles in an advanced multi-media society with user-friendly human-machine interfaces. Speech recognition systems include not only those that recognize messages but also those that recognize the identity of the speaker. Services using these systems will include voice dialing, database access and management, guidance and transactions, automated reservations, various order-made services, dictation and editing, electronic secretarial assistance, robots, automated interpreting (translating) telephony, security control, digital cellular communications, and aids for the handicapped (e.g., reading aids for the blind and speaking aids for the vocally handicapped).

The most promising application area for speech technology is telecommunications. The fields of communications, computing, and networking are now converging in the form of personal information/communication terminals. In the near future, personal communications services will become popular, and everybody will have their own portable telephone. Several technologies will play major roles in this communications revolution, but speech processing will be one of the key technologies. By using advancing speech synthesis/recognition technology, telephone sets will become useful personal terminals for communicating with computer systems. Speaker recognition techniques are expected to be widely used in the future as methods of verifying the claimed identity in telephone banking and shopping services, information retrieval services, remote access to computers, credit-card calls and so forth.

Recent Progress in Speech Recognition

Dictation of speech reading newspapers such as north America business newspapers including Wall Street Journal, conversational speech recognition using an ATIS task, and, recently, broadcast news dictation have been actively investigated. Common features of these systems exist in using cepstral parameters and their regression coefficients as speech features, triphone HMMs as acoustic models, vocabularies of several thousand or several ten thousand entries, and statistical language models such as bigrams and trigrams. Such methods have been applied not only to English but also to French, German, Italian, Spanish, Chinese and Japanese, and, although there are several language-specific characteristics, similar recognition results have been obtained. Recently, *INRSwitchboardNS and NRCall HomeNS tasks using natural conversational speech have been *B actively investigated. In spite of the remarkable recent progress, we are still far behind our ultimate goal of understanding free conversational speech uttered by any speaker under any environment.

Robust Speech Recognition

Ultimate speech recognition systems should be capable of robust, speaker-independent or speaker- adaptive, continuous speech recognition. It is crucial to establish methods that are robust against voice variation due to individuality, the physical and psychological condition of the speaker, telephone sets, microphones, network characteristics, additive background noise, speaking styles, and so on. It is also important for the systems to impose few restrictions on tasks and vocabulary. To solve these problems, it is essential to develop automatic adaptation techniques.

Extraction and normalization of (adaptation to) voice individuality is one of the most important issues. A small percentage of people occasionally cause systems to produce exceptionally low recognition rates. This is an example of the *INRsheep and goatsNS phenomenon. Speaker adaptation *B (normalization) methods can usually be classified into supervised (text-dependent) and unsupervised (text-independent) methods. Unsupervised, on-line, incremental adaptation is ideal, since the system works as if it were a speaker-independent system, and it performs increasingly better as it is used.

Detection-Based Approach for Spontaneous Speech Recognition

One of the most important issues for speech recognition is how to create language models (rules) for spontaneous speech. When recognizing spontaneous speech in dialogs, it is necessary to deal with variations that are not encountered when recognizing speech that is read from texts. These variations include extraneous words, out-of-vocabulary words, ungrammatical sentences, disfluency, partial words, repairs, hesitations, and repetitions. It is crucial to develop robust and flexible parsing algorithms that match the characteristics of spontaneous speech. A paradigm shift from the present transcription-based approach to a detection-based approach will be important to solve such problems. How to extract contextual information, predict users' responses, and focus on key words are very important issues.

Speaker Recognition

Recently, various topics of research interest in speaker recognition have led to new approaches and techniques. They include VQ- and ergodic-HMM-based text-independent recognition methods, a text-prompted recognition method, parameter/distance normalization techniques, model adaptation techniques, and methods of updating models as well as a priori thresholds in speaker verification. However, there are still many problems for which good solutions remain to be found. The open questions include (a) How can human beings correctly recognize speakers? (b) What feature parameters are appropriate for speaker recognition? (c) How can we fully exploit the clearly evident encoding of identity in prosody and other suprasegmental features of speech? (d) Is the *INRsheep and *B goats*INS problem universal? (e) Can we ever reliably cluster speakers on the basis of similarity/ *B dissimilarity? (f) How do we deal with long-term variability in people's voices?

Articulatory and Perceptual Constraints

Knowledge and technology from a wide range of areas, including the use of articulatory and perceptual constraints, will be necessary to develop speech technology. For example, when several phonemes or syllables are continuously spoken, as in the case of usual sentence speech, the tongue, jaw, lips, etc. move asynchronously in parallel, and yet with coupled relationships. Current speech analysis techniques, however, represent speech as a simple time series of spectra. It will become necessary to analyze speech by decomposing it into several hidden factors based on speech production mechanisms. This approach seems to be essential for solving the coarticulation problem, one of the most important problems in both speech synthesis and recognition.

The human hearing system is far more robust than machine systems - more robust not only against the direct influence of additive noise but also against speech variations (that is, the indirect influence of noise), even if the noise is very inconsistent. Speech recognizers are therefore expected to become more robust when the front end uses models of human hearing. This can be done by imitating the physiological organs or by reproducing psychoacoustic characteristics.

Research on the Human Brain

Although observation and modeling of the movement of vocal systems along with the physiological modeling of auditory peripheral systems have recently made great progresses, the mechanism of speech information processing in our human brain has hardly been investigated. Psychological experiments on human memory clearly showed that speech plays a far more important and essential role than vision in the human memory and thinking processes. Whereas models of separating acoustic sources have been researched in *INRauditory scene analysisNS, the *B mechanisms of how meanings of speech are understood and how speech is produced have not yet been made clear.

It will be necessary to clarify the process by which human beings understand and produce spoken language, in order to obtain hints for constructing language models for spoken language, which is very different from written language. It is necessary to be able to analyze context and accept ungrammatical sentences. It is about time to start active research on clarifying the mechanism of speech information processing in the human brain so that epoch-making technological progress can be made based on the human model.


Prof. Sadaoki Furui will be staying at the Westin Hotel (Tel: 206-728-1000, 800-228-3000, Fax: 206-728-2259) during the conference.