Fast and perceptually convincing simulation of room acoustics: Shoebox rooms with bells and whistles. 

Oliver Buttler –

Torben Wendt –

Steven van de Par –

Stephan D. Ewert –


Medical Physics and Acoustics and Cluster of Excellence Hearing4all,

University Oldenburg

Carl-von-Ossietzky-Straße 9-11

26129 Oldenburg, GERMANY


Popular version of paper 3aAA7,

“Perceptually plausible room acoustics simulation including diffuse reflections”

Presented Wednesday morning, May 9, 2018, 10:50-11:05 AM, Location: NICOLLET C

175th ASA Meeting, Minneapolis



Today’s audio technology allows us to create virtual environments where the listener feels immersed in the scene. This technology is currently used in entertainment, computer games, but also in research where the function of a hearing aid algorithm or the behavior of humans in complex and realistic situations is investigated. To create such immersive virtual environments, besides convincing computer graphics also convincing computer sound is of key importance. We can easily experience the richness of the acoustic world when we close our eyes. We can hear that the acoustic world allows us to perceive sounds in an omnidirectional way such that we can perceive a sound source from different directions or even around a corner, and we might even be able to hear whether we are in a concert hall or bathroom, based on the acoustics.

To create immersive and convincing acoustics in virtual reality applications, computationally efficient methods are required. While in the last decades, the development towards today’s astonishing real-time computer graphics was strongly driven by the first-person computer game genre, until recently, comparable techniques in computer sound received much less attention. One reason might be that the physics of sound propagation and acoustics is at least as complicated as that of light propagation and illumination, and computing power was mainly spent on computer graphics so far. Moreover, from early on, computer graphics focused on the creation of visually convincing results rather than on physics-based simulation which allowed for tremendous simplifications of the computations. Methods for simulating acoustics, however, often focused on physics-based to predict how a planned concert hall or classroom might sound like. These methods disregarded perceptual limitations of our hearing system that might allow for significant simplifications of the acoustic simulations.

Our perceptually plausible room acoustics simulator [RAZR,, 1]  creates a computationally efficient acoustics simulation by drastic simplifications with respect to physical accuracy while still accomplishing a perceptually convincing result. To achieve this, RAZR approximates the geometry of real rooms by a simple, empty shoebox-shaped room and calculates the first sound reflections from walls as if they were mirrors creating image sources for a sound source in the room [2]. Later reflections that we perceive as reverb are treated in an even more simplified way and only the temporal decay of sound energy and the binaural distribution at our two ears is considered using a so-called feedback-delay-network [FDN, 3].

Although we demonstrated that a good perceptual agreement with real non-shoebox rooms is indeed achieved [1], the empty shoebox-room simplification might be too inaccurate for rooms which strongly diverge from this assumption, e.g., a staircase or a room with multiple interior objects. Here multiple reflections and scattering occur which we simulate in a  perceptually convincing manner by temporal smearing of the simulated reflections. A single parameter was introduced to quantify deviations from an empty shoebox room and thus the amount of temporal smearing. We demonstrate that a perceptually convincing room acoustical simulation can be obtained for sounds like music and impulses similar to a hand clap. Given its tremendous simplifications, we believe that RAZR is optimally suited for real-time acoustics simulation even in mobile devices were virtual sounds could be embedded in augmented reality applications.




Figure 1. (Fig1_Ewert_shoeboxes.jpg) Examples for the simplification of different real room geometries to shoeboxes in RAZR. The red boxes indicate the shoebox approximation. The green box in panel c) indicates a second, coupled volume attached to the lower main volume. While the rooms in panel a) and b) might be well approximated with the empty shoebox, the rooms in panel c) and d) show more severe deviations which were accounted for by a single parameter estimating the deviation from the shoebox in percent and by applying the according temporal smearing to the reflections.


Figure 2. (Fig2_Ewert_perception.jpg) Perceptually rated differences between real room recordings (A: large aula, C: corridor, S: seminar room) and simulated rooms with a hand-clap-like sound source (pulse). Different perceptual attributes are shown in the panels. The error bars indicate inter-subject standard deviations. Depending on the attribute, ordinate scales range from “less pronounced” to “more pronounced” or semantically fitting descriptors. The different symbols show the amount of deviation from the empty shoebox assumption as percentage. It can be seen that with a deviation of 20% the critical attributes in the lower panel are rated near zero and thus show a good correspondence with the real room. The remaining overall difference is mainly caused by differences in tone color which can be easily addressed.


Figure 3. (Fig3_Ewert_vavelab.jpg, see first page) The virtual audio-visual environment lab at the University of Oldenburg features 86 loudspeakers and 8 subwoofers arranged in a full spherical setup to render 3-dimensional simulated sound fields. The foam wedges at the walls create an anechoic environment, so that the sound created by the loudspeakers is not affected by unwanted sound reflections at the walls.


Sound 1. (Sound_Ewert_Aula_shm_000.wav) Simulation of the large aula without the assumption of interior objects and multiple sound reflections on those objects. Although the sound is not distorted, an unnatural and crackling sound impression is obvious at the beginning.


Sound 2. (Sound_Ewert_Aula_shm_020.wav) Simulation of the large aula with the assumption of 20% of the empty space filled with objects. The sound is more natural and the crackling impression at the beginning is gone.




[1] T. Wendt, S. Van De Par, and S. D. Ewert, “A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation,” Journal of the Audio Engineering Society, 62(11):748–766, 2014.

[2] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, 65(4):943–950, 1979.

[3] J.-M. Jot and A. Chaigne, “Digital delay networks for designing artificial reverberators,” In 90th Audio Engineering Society Convention, Audio Engineering Society, 1991.

Share This