Michael H. Krane, email@example.com
Daniel Sinder, James Flanagan
CAIP Center, Rutgers University, Piscataway, NJ 08854-8088 USA
Popular version of papers 3aSCb1 and 3aSCb6
Presented Wednesday morning, March 17, 1999
ASA/EAA/DAGA '99 Meeting, Berlin, Germany
If you have used a telephone lately you've surely encountered the now ubiquitous voice prompt menu. It is often an easy matter to determine whether the voice you are hearing is a recording of a human speaking or one produced entirely by a computer, or "synthesized", in the parlance of speech technology. One reason a synthetic voice is easy to spot is that it lacks the natural rhythm and pitch variations inherent in the expressiveness of ordinary language. Another reason is that the sounds themselves are but a rough copy of the sounds we produce when we speak. It turns out that knowledge of the same processes by which smokers blow smoke rings may actually be the missing link to improved synthetic speech.
For many years it has been thought that a better description of the physical process by which the sounds are produced would make synthetic speech sounds mimic real ones more faithfully. The problem lies in how these physical processes have been modeled. Airflow problems are notoriously complex; suitable approximations are nearly always sought in order to highlight the most relevant physics. The classical approach in speech science has been to reduce the problem to one of purely sound motion, where still air in the vocal system and outside the mouth is disturbed only by the small compressions and expansions which comprise the sound field. While this approximation has allowed a great deal of progress in understanding how speech sounds are produced and how to mimic them with a computer, speech synthesis based on this model is easily recognizable as artificial. In reality, sound is not the only kind of air motion involved. The air in the vocal system is not static, but moves from the lungs out of the mouth, carrying the sound field along with it. In addition, turbulence is generated. While these two types of motion are not acoustic, they can have a profound impact on how sounds are produced and how they are transmitted to the ear of the listener.
A systematic effort to apply a specialization of the science of fluid mechanics, known as aeroacoustics, to speech is now in its infancy. Aeroacoustics was developed in the early 1950s to help reduce noise from jet aircraft engines being developed for civilian use -- a problem concerning by sound production and transmission by airflow. Aeroacousticians bring several tools and expertise previously unavailable to speech science: a theoretical framework that is more general than the one speech science has used heretofore, and techniques, both computational and experimental, by which airflow and the sound it makes may be characterized.
First, the air in which sound waves need to travel is itself moving, and in a non-uniform way. This motion of the acoustic medium carries the sound waves along with it. In speech, this effect is important only if the vocal tract has narrow constrictions, such as those formed by nearly touching the tongue to the palate, as we do when we say "sssssss." The narrower the constriction, the greater the air velocity due to acoustic fluctuations. In a situation where there is no net flow of air, a parcel of air oscillates back and forth in the constriction. The tendency of the air to resist this back-and-forth motion increases as the diameter of the constriction gets smaller and the length of the constriction gets larger. If the air has net motion in one direction or the other due to motion of the acoustic medium through the constriction, however, then the parcel of air may be blown out of the constriction before it has time to oscillate back and forth. This has a profound impact on the resistance of the air to acoustic fluctuations, and to the frequencies at which the vocal system will resonate. The net effect of bulk air motion on speech sounds is to lower these resonant frequencies.
A second effect of air motion on speech sound production and transmission is due to turbulent flow in the vocal system. One way to characterize turbulent flow is by a quantity known as "vorticity," which is a measure of the "spin" of a fluid particle. The smoke rings mentioned at the top of the article are a well-known example of this phenomenon. Vorticity, it turns out, plays a key role in the interaction between the sound field and the turbulent flow. Accelerating fluid particles which possess vorticity can either produce or absorb sound, depending on the sound field in which the vortical particle is situated. Turbulent flow is generated at locations where the cross-sectional area of the vocal tract increases suddenly. At these locations, the turbulence takes energy from the sound field. For example, upwards of 99% of the kinetic energy of the airflow through the vocal folds is turned into vortical motion, so that only the remaining small fraction will propagate into the vocal system as sound. However, as it travels up the vocal tract the turbulence can give energy back to the sound field when it passes by obstructions such as the epiglottis, the flap which closes to protect the lungs when swallowing. However, the process by which vorticity gives up energy to the sound field is extremely inefficient, so that the vorticity persists until it is dissipated into heat by friction. In some speech sounds, called fricatives (such as *s' as in *so', *sh' as in *she'), the sounds are produced entirely by turbulent flow passing by an obstacle such as the teeth. This is also the case when we whisper.
To predict the sound produced or dissipated by vorticity, it is necessary to know the characteristics of the vorticity field. This represents quite a challenge; turbulent flow is one of the most difficult problems in physics. To compute a flow such as those seen in the vocal system might take tens of hours on a supercomputer, far too long to be of any use in a speech synthesizer. However, enough is known about the general characteristics of turbulent vorticity in pipes that simplified models can be developed to capture the essential behavior in an inexact, but compact manner. Computations using these models can take on the order of a minute on a desktop computer, and have produced large improvements both in understanding of the physics of speech sound generation and the development of natural-sounding synthetic speech.