Shubha Kadambe - email@example.com
Information Sciences Laboratories, M/S: ML 69, HRL Laboratories
3011 Malibu Canyon Rd., Malibu, CA 90265, USA
Popular version of paper 5aSCa1
Presented Friday Morning, March 19, 1999
ASA/EAA/DAGA '99 Meeting, Berlin, Germany
Speech is the unique mode of communication associated with human beings. Speech scientists believe that early humans used facial and hand gestures to communicate with others. However, when they realized that this mode of communication disrupted their work, they learned the art of speaking to keep their hands free while communicating with others. In the age of computers, we are again being limited by having to use our hands to communicate with the computer. Therefore, it has been a dream of human beings to talk to computers instead of using the key board, mouse, etc. We are close to 2001 and yet we are pretty far away from materializing "Hal" of 2001: A Space Odyssey!
Despite the efforts of several speech scientists over several decades we are not in a position to build Hal today or in 2001 because human speech has too many variabilities. The human speech production system consists of lungs, trachea, glottis (two lip like ligaments at the upper part of the trachea), velum, oral (mouth) cavity (that includes tongue, jaws, and lips) and nasal cavity. One variability in human speech is due to the differences in these physical organs from one person to the other (which is referred to as inter-speaker variability). The second variability is due to some small changes in these organs depending on the emotional or physical status (having cold, under stress, etc.) of the person (which is referred to as intra-speaker variability).
In order for us to use "speech" as a mode of communication with computers, we first need to understand how human speech communication works. Basically, the human communication system consists of the brain which sends out the control signals to the human speech production system, based on what needs to be conveyed; the human speech production system generates speech signals based on the received control signals; the speech signal is transmitted; the human ears receive these signals and represent speech features in the form of some patterns and send these patterns to the brain. The brain does some kind of pattern matching and uses some understanding mechanism to decode the message that was conveyed; depending on the message the brain sends out control signals to the speech production system of the other person; this process repeats until the end of the communication session.
From the above description it can be seen that the speech communication is not a simple problem even though it appears that we human beings do it so effortlessly. For the communication with a computer using speech, we need to replace a human being by a computer in essence. In the case of a computer the function of the human ear is replaced by some signal processing algorithm which extracts the relevant features from the speech signal and creates some form of patterns. The function of the human brain is replaced by a speech recognition and understanding algorithm, which comprehends what the other person is conveying. The control signals of the human brain take the form of some text, and the function of the human speech production is replaced by a text-to-speech synthesis algorithm.
Research has been conducted in all these areas. This paper focusses on the signal processing aspect of the communication system described above. Mainly, the relevant speech features consists of formant frequencies and pitch period. The formant frequencies relate to variations in the mouth and nasal cavity (depending on the sound that is produced). These two cavities can be considered as a combination of acoustic tubes. Each tube has its own resonant frequency. Formant frequencies correspond to these resonant frequencies. The pitch period corresponds to the glottal closure and opening. The time interval between two glottal closures is defined as pitch period. Each unit of speech sound that is referred to as phoneme/phone has a unique set of these features and can be characterized by these features. The speech recognition algorithms exploit this while recognizing sounds.
Due to the variabilities mentioned earlier, these features need to be represented using some statistical approaches. Statistics alone do not help all the time, however. There is also a need for some signal processing algorithms which are robust to these variabilities. Several signal processing algorithms have been developed. Recently wavelet processing was introduced to the signal processing community. To put it in simple terms, wavelets are a window function which lets you see different parts (small, medium, big) of the signal. In other words, it helps in extracting both locally changing features and globally changing ones. In addition, since the shape of the wavelet can be changed, it can accommodate inter- and intra- speaker variabilities. When the speech signal is transmitted (in air, telephone or any other medium) it is corrupted by noise. The signal processing algorithm that is used to extract features must be capable of removing noise, as well. Since wavelets have different sizes, they have different bandwidths, which helps remove noise that is spread out across the signal. In this paper a special category of wavelets - adaptive wavelets - is used to take care of inter- and intra- speaker variabilities and noise. The relevant speech features are extracted in terms of the wavelet features. They are used both to compress the speech signals and to recognize them. This paper describes these two applications. We have developed a good quality variable bit-rate coder, and a phoneme recognizer. The phoneme recognition rate can be improved when the wavelet features are added.
We hope that all these little steps that scientists take will help in building "Hal" in the near future, if not in 2001!