[ Lay Language Paper Index | Press Room

Creating Virtual Spatial Audio
Via Scientific Computing
and Computer Vision

Ramani Duraiswami -
Richard O. Duda, V. Ralph Algazi, Larry Davis, Nail Gumerov, Qing-Huo Liu, Shihab Shamma, Howard Elman, Rama Chellappa, Yiannis Aloimonos, S.T. Raveendra
Institute for Advanced Computer Studies, University of Maryland, College Park, MD, 20742

 Popular version of paper 4aPP11
 Presented Thursday Morning, 11:15 a.m., December 7, 2000
 ASA/NOISE-CON 2000 Meeting, Newport Beach, CA
 Work supported by NSF.

There are many scientific, commercial and entertainment applications for 3-D or spatial sound. An ideal virtual spatial audio system would produce the illusion of hearing sounds as if you were actually present in the room. The 3-D PC-soundcards and the home theater systems that are now available are able to place sounds far to the left, right, and even behind the listener. However, accurate and controllable placement of sounds in all three dimensions -- left and right, up and down, near and far -- is beyond the ability of current technology. All three dimensions must be controlled to produce the virtual audio needed for virtual reality. Our research is directed at making this possible.

Hearing scientists have shown that - in principle, at least - it is possible to make sounds appear to come from any desired location. Furthermore, this can be done using only two signals - the sound reaching each of the ears. By properly controlling the sounds sent through headphones to the left and the right ears, the experience of being in a 3-D sound space can be reproduced. By using clever digital signal processing techniques, audio engineers have shown that the same effects can be produced using only two loudspeakers.

The secret to creating these effects emerged from careful study of the cues that humans use to locate a sound source.  The most familiar cue is the so-called interaural time difference, the difference in the times at which the sound waves coming from a source reach our two ears.  However, the interaural time difference is by no means the only cue.  Although it accounts for much of our left/right perception, it does not explain our up/down or our near/far perception.

It turns out that we use not only the sound traveling directly from the source to our ear canals, but also the sound that reaches us via other more indirect paths, after being scattered off our external ears, heads, and bodies, as well as walls, floors, and other surfaces in the surrounding environment.  It is this scattering process that endows the received waves with cues that the brain deciphers and processes to locate the source.

Fig. 1: Sound from a source that reaches our ears includes both the sound that arrives along a direct path and sound that is scattered by the environment and our external ears and body. This scattering process amplifies or attenuates different frequency components, producing cues that enable our brains locate the source. 


When a sound wave is scattered off an object, its behavior is governed by the ratio of the object size and the wavelength of the sound. When the object size is much larger than the wavelength, the sound bounces off like a ray of light hitting a mirror. However, when the object size and the wavelength of the sound wave are comparable, the scattered wave is much more complex. Furthermore, the various components of the sound that reach the ear via different paths interact with one another. For ordinary sounds that contain many different frequencies, the result is a change in the balance between different frequencies that we are subconsciously able to attribute to the location of the source.  The process by which the brain does this localization is a subject of intense research by neuropsychologists.

The function that encodes the relative amplification or attenuation of the sound at a particular frequency is called the "Head Related Transfer Function," and is often abbreviated as the HRTF.  The secret to rendering virtual audio accurately is to obtain the HRTF accurately.  However, there is a major complication. HRTFs are different for different people. Because we all have different sized and shaped ears, heads and bodies, we all have different HRTFs.  Just as we need our own customized eyeglasses to see properly, we need our own customized HRTFs to hear spatial sound properly.  Failure to account for individual differences leads to problems such as elevation errors and high rates of front/back confusion.

Furthermore, for a true perception of a localized source, the cues must change with the motion of the listener. If they do not change properly, the listener can become confused, and can even experience the sound as coming from within his or her head.  Thus, the HRTF must not only be customized to the individual listener, but it must also change correctly when the listener moves.

Our research is directed at finding effective ways to solve these two key problems: (a) quick and accurate determination of individual HRTFs, and (b) quick and accurate ways to modify the HRTFs dynamically in accordance with a listener's movements, including changes that arise from changes in the listener's posture. In both cases, we want to take advantage of advances in computer vision research and technology to solve these problems.

In recent years, it has become possible to use computers to determine many physical properties of objects from digital video, using developments in powerful computer vision methods. To measure individual HRTFs, we will use computer vision techniques to obtain accurate 3-D surface models of a person's torso, head, and ears.  We will then calculate the HRTF by using numerical methods to solve the basic equations of physics that govern the propagation of sound waves.  We expect that this approach will be much more rapid, accurate and convenient than the acoustic methods currently used to measure HRTFs.  In spite of the massive computation required, advances in high-performance computing now make such an approach possible. Additionally, part of our research will be devoted to developing even faster computational methods.

To modify the HRTFs dynamically, we will use computer vision techniques to track people as they move, and to modify the HRTF accordingly.  In our initial work, we will neglect any possible effects of the limbs and torso, and will focus on the changes that stem solely from head translations and rotations.  However, we also will decompose the HRTFs into parts that separately account for the contributions of different parts of the body. This decomposition will provide an ability to account for the important effects of postural changes. 

Fig.2: Our ears have extremely different shapes, resulting in very different sound scattering characteristics, and consequently, very individual HRTFs.  

We believe that this research will lead to effective methods for measuring individual HRTFs and modifying them dynamically, thereby providing both the static and the dynamic cues that will produce true 3-D virtual audio.  Such an accomplishment will be a major advance in the use of information technology in virtual audio.

 An overview of our long-term project, as well as preliminary results that compare the numerical techniques to be used with analytical and experimental results for scattering from simple shapes, will be presented at the conference.

 [ Work supported by NSF ]

Lay Language Paper Index   | Press Room  ]