ASA Lay Language Papers
163rd Acoustical Society of America Meeting


Auditory Depth Control: Investigating associated physical parameters
that make a 3-D sound image project out of your TV



Sungyoung Kim -- sungyoung@beat.yamaha.co.jp
Hiraku Okumura -- hiraku@beat.yamaha.co.jp
Sound & IT Development Division, Yamaha Corporation
Hamamatsu, 430-8650 Japan

Makoto Otani -- otani@cs.shinshu-u.ac.jp
Faculty of Engineering, Shinshu University,
Nagano, 380-8553 Japan

Popular Version of Paper 1aHT5
Presented Monday morning, May 14, 2012
163rd ASA Meeting, Hong Kong

Recent visual information technology has successfully leaped to the mass production of three-dimensional (3-D) visual images. While the technology has been around for a long time, the aesthetic accomplishment in the recent movie "AVATAR" caught the public's attention and accelerated the provision of 3-D hardware and software to consumers. The resulting image allows humans to perceive the relative and absolute depths of visual objects.

For auditory objects, however, it has been difficult to create a convincing control of perceived depth, in particular between a screen (front speaker) and a listening position. Previously, the authors proposed a new method that controls and locates auditory images near to a listener, which provides listeners with a coherent sense of 3-D-rendered content. Specifically, we were able to create an auditory image projecting out of a TV as a 3-D visual object does. The method utilized a prototype electrostatic loudspeaker located above the listening position (at a ceiling, for example). The loudspeaker is thin (1 mm), light (400 g/m2), and flexible (possible to fold, roll, and have printing placed on its cover), and therefore it can be conveniently located on a wall or ceiling as seen in Figure 1.

SYKim_fig1

Figure 1. Photo of an electrostatic loudspeaker used to control auditory depth. Coupled with spectral modification, this thin and light electrostatic loudspeaker generates an auditory image near a listener's head when it is placed above the listening position.

Compared with a conventional loudspeaker, this loudspeaker made the perceived distance nearer than the physical distance of a sound source. In particular, when placed above, this loudspeaker generated an image inside the listener's head. This phenomenon often happens when we listen with a pair of headphones, and it is known as inside-the-head localization (IHL). With additional signal processing that eliminated spectral cues of elevation of the loudspeaker, listeners reliably perceived an auditory image located around their heads. We used this near auditory image to control perceived depth between a screen and a listening position.

To better understand why the loudspeaker generated an auditory image near a listener, the authors investigated physical parameters that highlight the difference between a conventional (spherical wave) and an electrostatic (planar wave) loudspeaker. Specifically, we wanted to find related physical parameters that changed according to the distances for a conventional loudspeaker yet remained constant for the electrostatic loudspeaker. Physical parameters include the interaural level difference (ILD), the interaural time difference (ITD), the interaural phase difference (IPD), and the variance in group delays of head related transfer functions (HRTFs).

To extract the physical parameters, we measured HRTFs of two loudspeakers using a spherical microphone (Schoeps KFM6) at three distances-0.5, 1, and 2 m-in an anechoic chamber. The electrostatic loudspeaker was 60 cm wide and 60 cm tall. The conventional loudspeaker was a custom-made one incorporating a small driver unit-FOSTEX FF85K. While it was ideal to put the loudspeakers in the vertical axis, they were positioned along the horizontal axis due to the limitation of the chamber's size.

Subsequently, we calculated numerical simulation of HRTFs using the boundary element method (BEM), which could account for reflecting objects in the acoustic field simulation. We measured surface geometry of a head-and-torso model using an optical 3-D digitizer, and we produced two computational head models with and without shoulders. In addition, we simulated two sound-source positions-in front of and above-for the head model with shoulders. For each condition, two wave types, spherical and planar, were simulated, resulting in a total of six simulations. The two wave types used in the measurement were assumed to imitate the radiation pattern of the conventional and electrostatic loudspeakers, respectively. Distances of spherical sources were identical to measured distance, and the distance of the planar wave was set at 1.8 m.

The analysis of measured HRTFs showed that the electrostatic loudspeaker produced a smaller difference between group delays across frequency at two ear positions than did the conventional loudspeaker. Not only the binaural difference but also the group delays of each ear had less variation for the electrostatic loudspeaker. Figure 2 shows that the variance increased monotonically with the sound source distance for the conventional loudspeaker, while it maintained a relatively small value for the electrostatic loudspeaker.

SYKim_fig2

Figure 2. Variance of group delays of measured HRTFs. The variance covaried with the sound source distance for the conventional loudspeaker, while it maintained a relatively small value for the electrostatic loudspeaker.

Analysis of numerical simulations further revealed that the increase in the variance of group delay was related to a reflection from a torso. As Figure 3 shows, when the simulation accounted for only the head model, the distance dependency seemed to disappear. Yet with the torso model in it, the values associated with spherical wave simulation increased with source distances. Furthermore, the increase became larger when the sound source was above, possibly because of direct reflections from the shoulder. For planar wave simulation, however, the values remained small as the measured results.

This implicates that the listener might rely on the variance of group delays to estimate perceived distance, especially when the sound source is above. Also, since the planar wave radiating from the electrostatic loudspeaker generated smaller variances, it was possible to create an auditory image near to the listener's head regardless of the physical distances, coupled with only small variance in binaural cues (ITD, in particular).

SYKim_fig3

Figure 3. Variances of group delays of three numerical simulations: (1) a head-only model, (2) a head-and-torso model with the loudspeaker located in front of the model, and (3) a head-and-torso model with the loudspeaker located above. The variances of spherical waves were displayed according to their distances. Planar waves (one distance - 1.8m) were displayed as single lines for the comparison to spherical waves.

In this study, we investigated physical parameters that differentiate a sound image radiating from an electrostatic loudspeaker from a sound image radiating from a conventional loudspeaker. A planar wave from an electrostatic loudspeaker is less influenced by reflections from a listener's shoulder, resulting in less variance of group delay and ITD and making the listener perceive a very-near auditory image. With additional engineering manipulation, it was possible to continuously move an auditory image from the position of a TV to the listener's position. With this auditory depth control, the authors are currently investigating the multimodal synchrony of near visual and auditory images, focusing on whether integration of perceived depth for two modalities would allow subjects to experience increased reality and a sense of immersive presence.