Evaluating Environmental Sounds from a Presence Perspective for Virtual Reality Applications
© Rolf Nordahl. 2010
Received: 15 January 2010
Accepted: 31 August 2010
Published: 11 October 2010
We propose a methodology to design and evaluate environmental sounds for virtual environments. We propose to combine physically modeled sound events with recorded soundscapes. Physical models are used to provide feedback to users' actions, while soundscapes reproduce the characteristic soundmarks of an environment. In this particular case, physical models are used to simulate the act of walking in the botanical garden of the city of Prague, while soundscapes are used to reproduce the particular sound of the garden. The auditory feedback designed was combined with a photorealistic reproduction of the same garden. A between-subject experiment was conducted, where 126 subjects participated, involving six different experimental conditions, including both uni- and bimodal stimuli (auditory and visual). The auditory stimuli consisted of several combinations of auditory feedback, including static sound sources as well as self-induced interactive sounds simulated using physical models. Results show that subjects' motion in the environment is significantly enhanced when dynamic sound sources and sound of egomotion are rendered in the environment.
The simulation of environmental sounds for virtual reality (VR) applications has reached a level of complexity that most of the sonic phenomena which happen in the real world can be reproduced using physical principles or procedural algorithms. However, until now little research has been performed on how such sounds can contribute to enhance sense of presence and immersion when inserted in a multimodal environment. Although sound is one of the fundamental modalities in the human perceptual system, it still contains a large area for exploration for researchers and practitioners of VR . While research has provided different results concerning multimodal interaction among the senses , several questions remain in how one can utilize to the highest potential audiovisual phenomena when building interactive VR experiences.
As a matter of fact, following the computational capabilities of evolving technology, VR research has moved from being focused on unimodality (e.g., the visual modality) to new ways to elevate the perceived feeling of being virtually present and to engineer new technologies that may offer a higher degree of immersion, here understood as presence considered as immersion .
Engineers have been interested in the audio-visual interaction from the perspective of optimizing the perception of quality offered by technologies [4, 5]. Furthermore, studies have shown that by utilizing audio, the perceived quality of lower quality visual displays can increase . Likewise, researchers from neuroscience and psychology have been interested in the multimodal perception of the auditory and visual senses . Studies have been addressing issues such as how the senses interact, which influences they have on each other (predominance), and audio-visual phenomena such as the cocktail party effect  and the ventriloquism effect .
The design of immersive virtual environments is a challenging task, and cross-modal stimulation is an important tool for achieving this goal . However, the visual modality is still dominant in VR technologies. A common approach when designing multimodal systems consists of adding other sensorial stimulations on top of the existing visual rendering. This approach presents several disadvantages and does not always allow to exploit the full potential which can be provided by a higher consideration to auditory feedback.
2. Auditory Presence in Virtual Environments
The term presence has been used in many different contexts, and there is still need for the clarification of this term . Such phenomenon has recently been elevated to a status, where it has been used as a qualitative metric for evaluation of virtual reality systems . Most researchers involved in presence studies agree that presence can be defined as a feeling of "being there" [12, 13]. Presence can also be understood as "perceptual illusion of non-mediation"  or "suspension of disbelief" of being located in environments that are not real .
In , Lombard and Ditton outline different approaches to presence. Presence can be viewed as social richness, realism, transportation, and immersion. Sound has received relatively little attention in presence research, although the importance of auditory cues in enhancing sense of presence has been outlined by several researchers [11, 14, 15]. Most of the research relating to sound and presence has examined the role of sound versus nonsound and the importance of spatial qualities of the auditory feedback.
In , some experiments were performed with the aim to characterize the influence of sound quality, sound information, and sound localization on users' self-ratings of presence. The sounds used in their study were mainly binaurally recorded ecological sounds, that is, footsteps, vehicles, doors, and so forth. It was found that especially two factors had high positive correlation with sensed presence: sound information and sound localization.
The previously described research implies that there are two important considerations when designing sounds for VEs, namely, that sounds should be informative and enable listeners to imagine the original (or intended) scene naturally and the other being that sound sources should be well localizable by listeners.
Another related line of research has been concerned with the design of the sound itself and its relation to presence [17, 18]. Taking the approach of ecological perception, in  it is proposed that expectation and discrimination are two possibly presence-related factors: expectation being the extent to which a person expects to hear a specific sound in a particular place and discrimination being the extent to which a sound will help to uniquely identify a particular place. The result from their studies suggested that, when a certain type of expectation was generated by a visual stimulus, sound stimuli meeting this expectation induced a higher sense of presence as compared to when sound stimuli mismatched with expectations were presented along with the visual stimulus. These findings are especially interesting for the design of computationally efficient VEs, since they suggest that only those sounds that people expect to hear in a certain environment need to be rendered.
In previous research, we described a system which provides interactive auditory feedback made of a combination of self-sounds and soundscape design . The goal was to advocate the use of interactive auditory feedback as a means to enhance motion of subjects and sense of presence in a photorealistic virtual environment. We focused both on ambient sounds, defined as sound characteristics of a specific environment which the user cannot modify, as well as interactive sounds of subjects' footsteps, which were synthesized in real time and controlled by actions of users in the environment. The idea of rendering subjects' self-sound while walking on different surfaces is motivated by the fact that walking conveys enactive information which manifests itself predominantly through haptic and auditory cues. In this situation, we consider visual cues as playing an integrating role and to be the context of the experiments. In this paper, we extend our research by providing an in-depth evaluation of the system and its ability to enhance the sense of presence and motion of subjects in a virtual environment. We start by describing the context of this research, that is, the BENOGO project, whose goal was to design photorealistic virtual environments where subjects could feel present. We then describe the multimodal architecture designed and the experiments whose goal was to assess the role of interactive auditory feedback in enhancing motion of subjects in a virtual environment as well as sense of presence.
3. The BENOGO Project
Among the different initiatives to investigate how technology can enhance sense of immersion in virtual environments, the BENOGO project (which stands for "being there without going") (http://www.benogo.dk), completed in 2005, had as its main focus the development of new synthetic image-rendering technologies (commonly referred to as Image-Based Rendering (IBR)) that allowed photorealistic 3D real-time simulations of real environments.
The project aimed at providing a high degree of immersion to subjects for perceptual inspection through artificially created scenarios based on real images. Throughout the project, the involved researchers wished to contribute to a multilevel theory of presence and embodied interaction, defined by three major concepts: immersion, involvement, and fidelity. At the same time, the project aimed at improving the IBR technology on those aspects that were found most significant in enhancing the feeling of presence. The BENOGO project was concerned with the reproduction of real sceneries that might be even taken from surroundings familiar to the subject that uses the technology. The thought behind such approach is that in the future we can offer people to visit sites without people having to physically travel to the place.
The BENOGO project makes extensive use of IBR, that is, the photographic reproduction of real scenes. Such technique is dependent on extensive collections of visual data and therefore makes considerable demand on data processing and storage capabilities. One of the drawbacks of reconstructing images using the IBR technique is the fact that, when the pictures are captured, no motion information can be present in the environment. This implies that the reconstructed scenarios are static over time. Depth perception and direction are varied according to the motion of the user, which is able to investigate the environment at 360° inside the so-called region of exploration (REX). However, no events happen in the environment, which make it rather uninteresting to explore.
An occurring problem of IBR technology for VEs has been that subjects in general showed very little movement of head and body. This is mostly due to the fact that only visual stimuli were provided. By transferring information from film studies and current practice, practitioners emphasize that auditory feedback such as sound of footsteps signifies the characters giving them weight and thereby subjecting the audience to interpretation of embodiment.
We hypothesize that the movement rate can be significantly enhanced by introducing self-induced auditory feedback produced in real time by subjects while walking in the environment.
We start by describing the content of the multimodal simulation, and we then describe how the environment was evaluated.
4. Designing Environmental Sounds for Virtual Environments
The main goal of the auditory feedback was both to reproduce the soundscape of the botanical garden of Prague and to allow subjects to hear the sound of their own footsteps while walking in the environment. The implementation of the two situations is described in the following.
4.1. Simulating the Act of Walking
We are interested in combining sound synthesis based on physical models with soundscape design in order to simulate the act of walking on different surfaces and place them in a context. Specifically, we developed real-time sound synthesis algorithms which simulate the act of walking on different surfaces. Such sounds were simulated using a synthesis technique called modal synthesis .
Every vibrating object can be considered as an exciter which interacts with a resonator. In our situation, the exciters are the subjects' shoes, and the resonators are the different walking surfaces. In modal synthesis, every mode (i.e., every resonance) of a complex object is identified and simulated using a resonator. The different resonances of the object are connected in parallel and excited by different contact models, which depend on the interaction between the shoes and the surfaces. Modal synthesis has been implemented to simulate the impact of a shoe with a hard surface.
In the case of stochastic surfaces, such as the impact of a shoe with gravel, we implemented the physically informed stochastic models (PhISM) .
The footstep synthesizer was built starting by analyzing footsteps recorded on surfaces obtained from the Hollywood Edge Sound Effects library (http://www.hollywoodedge.com). For each recorded set of sounds, single steps were isolated and analyzed. The main goal of the analysis was to identify an average amplitude envelope for the different footsteps, as well as extracting the main resonances and isolating the excitation.
Despite its simplicity, the shoe controller was effective in enhancing the user's experience, as it will be described later. While subjects were navigating around the environment, the sandals were coming in contact with the floor, thereby activating the pressure sensors. Through the use of a microprocessor, the corresponding pressure value was converted into an input parameter which was read by the real-time sound synthesizer implemented in Max/MSP (http://www.cycling74.com). The sensors were wirelessly connected to a microcontroller, as shown in Figure 2, and the microprocessor was connected to a laptop PC.
The continuous pressure value was used to control the force of the impact of each foot on the floor, to vary the temporal evolution of the synthetic generated sounds. The use of physically based synthesized sounds allowed to enhance the level of realism and variety compared to sampled sounds, since the produced sounds of the footsteps depended on the impact force of subjects in the environment, and therefore varied dynamically. In the simulation of the botanical garden, we used two different surfaces: concrete and gravel. The concrete surface was used most of the time and corresponded to the act of walking around the visitors' floor. The gravel surface was used when subjects were stepping outside the visitors' floor.
Both surfaces were rendered through an 8-channel surround sound system.
4.2. Simulating Soundscapes
In order to reproduce the characteristic soundmarks of a botanical garden, a dynamic soundscape was built. The soundscape was designed by creating an 8-channel soundtrack in which subjects could control the position of different sound sources.
In the laboratory shown in Figure 4, eight speakers were positioned in a parallelepipedal configuration. Current commercially available sound delivery methods are based on sound reproduction in the horizontal plane. However, we decided to deliver sounds in eight speakers and thereby implementing full 3D capabilities. By using this method, we were allowed to position both static sound elements as well as dynamic sound sources linked to the position of the subject. Moreover, we were able to maintain a similar configuration to other virtual reality facilities such as CAVEs , where eight-channel surround is presently implemented, in order to perform in the future experiments with higher-quality visual feedback. This is the reason why 8-channel sound rendering was chosen compared to, for example, binaural rendering .
"static" soundscape, reproduced at max. peak of 58?dB, measured c-weighted with slow response. This soundscape was delivered through the 8-channel system;
dynamic soundscape with moving sound sources, developed using the VBAP algorithm, reproduced at max. peak of 58?dB, and measured c-weighted with slow response;
auditory simulation of ego-motion, reproduced at 54?dB (this has been recognised as the proper output level as described in ).
The content of the soundscape in the first two conditions was the same. The soundscape contained typical environmental sounds present in a garden such as bird singing and insects flying. The soundscape was designed by performing a recording in the real botanical garden in Prague and reproducing a similar content by using sound effects from the Hollywood Edge Sound Effects library.
In the first and second conditions, the soundscape only varied in the way it was rendered. In the second condition, in fact, the position of the sound sources was dynamic and controlled by the user's motion, who was wearing a head tracker as described below. In the third condition, the dynamic soundscape was augmented with auditory simulation of ego-motion obtained by having subjects generating in real-time footsteps of themselves walking in the garden.
5. A Multimodal Architecture
In order to combine the auditory and the visual feedback, together with the shoe controller, two computers were installed in the laboratory. One computer was running the visual feedback and other one the auditory feedback together with the interactive shoes. A Polhemus tracker (IsoTrak II3), attached to the head mounted display was connected to the computer running the visual display, and allowed to track the position and orientation of the user in 3D. The computer running the visual display was connected to the computer running the auditory display via TCP socket. Connected to the sound computer, there was the interface RME Fireface 800 which allowed delivering sound to the eight channels and the wireless shoe controller. The mentioned controller, developed specifically for these experiments, allowed detecting the footsteps of the subjects and mapping these to the real-time sound synthesis engine. The different hardware components were connected together as shown in Figure 6.
The visual stimulus was provided by a standard PC running SUSE Linux 10. This computer was running the BENOGO software using the REX disc calledPrague Botanical Garden.
The head-mounted-display (HMD) used was a VRLogic V82. It features Dual 1.3 diagonal Active Matrix Liquid Crystal Displays with resolution per eye: ((640 × 3) × 480), (921,600 color elements) equivalent to 307,200 triads. Furthermore, the HMD provides a field of view of 60° diagonal. The tracker used (Polhemus IsoTrak II3) provides a latency of 20 milliseconds with a refresh rate of 60 Hz.
The audio system was created using a standard PC running MS Windows XP SP 2. All sound was run through Max/MSP, and as output module a Fireface 800 from RME5 (http://www.rmeaudio.com/english/firewire/) was used. Sound was delivered by eight Dynaudio BM5A speakers (http://www.dynaudioacoustics.com). Figure 5 shows a view of the surround sound lab, where the experiments were run. In the center of the picture, the tracker's receiver is shown.
6. Evaluating the Architecture
Visual only: This condition had only unimodal (visual) input.
Visual with footstep sounds: In this condition, the subjects had bi-modal perceptual input (audio and visual) comparable to our earlier research .
Visual with full sound: This condition implies that subjects were treated with full perceptual visual and audio input. This condition included static sound design and 3D sound (using the VBAP algorithm) as well as rendering sounds from ego-motion (the subjects triggered sounds via their footsteps).
Visual with fully sequenced sound: This condition was strongly related to condition 3. However, it was run in three stages: the condition started with bi-modal perceptual input (audio and visual) with static sound design. After 20 seconds, the rendering of the sounds from ego-motion was introduced. After 40 seconds the 3D sound started.
Visual with sound +3D sound: This condition introduced bi-modal (audio and visual) stimuli to the subjects in the form of static sound design and the inclusion of 3D sound (the VBAP algorithm using the sound of a mosquito as sound source). In this condition no rendering of ego-motion was conducted.
Visual with music. In this condition the subjects were introduced to bi-modal stimuli (audio and visual) with the sound being a piece of music described before (see ). This condition was used as a control condition, to ascertain that it was not sound in general that may influence the in- or decreases in motion. Furthermore, it enabled us to deduce if the results recorded from other conditions were valid. From this, it should be possible to deduce how the specific variable sound design from the other experimental conditions affects the subjects.
Six different conditions to which subjects were exposed during the experiments. The number in the second column refers to the auditory feedback previously described.
Motion analysis for the different conditions considering only the 2D motion.
Visual w. foot
Sound + 3D
Motion analysis for the different conditions including vertical movement.
Visual w. foot
Sound + 3D
As Table 4 shows, results are very consistent with the analysis and results without taking into account the vertical motion. The trends, seen from the condition ranked according to mean values, indicate that the addition of auditory stimuli induces a positive effect on motion. Both for head and complete movement, results show that the mean values for the conditions are similar in ranking. A statistical analysis shows that in the conditions Full and Full seq, when viewed against the condition Visual only, the average body motion is significantly higher when the auditory stimuli are introduced. (Full compared to Visual only ( ), Full seq compared to Visual only ( )).
8. Measuring Presence
As a final analysis of the six experimental conditions, we investigated the qualitative measurements of the feeling of presence. Through the tests for all conditions we implemented all questions from the SVUP questionnaire . The SVUP is concerned with examining four items, where the most important item in relation to our thesis is the feeling of presence. The SVUP questionnaire does so by asking the subjects to answer four questions which all relate to the feeling of presence. The results of these answers are then averaged for each subject, resulting in what is referred to as the presence index. The questions relate to the naturalness of interaction with the environment and sense of presence and involvement in the experience. All answers were given on a Likert scale , from 1–7, (1 represents not at all and 7 represents very much).
Average presence index for the six experimental conditions.
Visual w. foot
Sound + 3D
It is also interesting to notice the answers to one of the questions from the SVUP questionnaire, namely, how much subjects felt that the experience was influenced by their own motion, rated on a scale from 1 to 100. The condition visuals w. footsteps has the highest rating in this situation ( ), with a significant difference with the second highest ranked condition in the list (full seq., ) ( ). This shows that the footstep synthesizer actually works, since users realize that they are controlling the feedback. Moreover, it is reasonable to assume that, when no soundscape is present, the users can focus more attention on the footstep sounds, therefore, recognizing the tight coupling between the act of walking and footsteps sounds in the environment.
An overall analysis of variance on the results shows that no significant differences were noticeable among the different conditions.
One reason that may affect the overall results derived from the self-report of the subjects is that the experiments of this study were done as a between subjects exploratory study. The fact that the individual subject only experienced one condition was optimal in the sense that issues concerning subjects becoming accustomed to the VE or finding it increasingly boring was minimized.
However, since the subjects have no other conditions as a frame of reference, this may be a plausible cause of what we have experienced through these results of the SVUP presence index, that is, that between-subjects as a method for this particular presence index is not adequate since the subjects give their initial feeling of how they felt without having anything to measure this feeling against. However, the quantitative data from the motion tracking shows clear results with significance, and the between-subjects strategy is well suited towards such experiments. Overall, mean and median values are very central in the scale, with a small standard deviation, which means that users provided in general an average evaluation, without any specific condition which was significantly more pronounced in the Likert scale. This can be due to the fact that subjects experienced only one condition, so they did not have a frame of reference to compare.
In this paper, we investigated the role of dynamic sounds in enhancing motion and presence in virtual reality. Results show that 3D sounds with moving sound sources and auditory rendering of ego-motion significantly enhance the quantity of motion of subjects visiting the VR environment.
It is very interesting to notice that it is not the individual auditory stimulus that affects the increase of motion of the subjects, but rather it is the combination of soundscapes, 3-dimensional sound, and auditory rendering of one's own motion that induces a higher degree of motion.
We also investigated whather the sense of presence was increased when interactive sonic feedback was provided to the users. Results from the SVUP presence questionnaire do not show any statistical significance in the increase of presence.
We are currently extending these results to environments, where the visual feedback is more dynamic and interactive, such as computer games and virtual environments reproduced using 3D graphics.
Permission to make digital or hard copies of all or part of this paper for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
- Stanney KM, et al.: Handbook of Virtual Environments: Design, Implementation, and Applications. Lawrence Erlbaum Associates; 2002.Google Scholar
- Stein BE, Meredith MA, Wolf S: The Merging of the Senses. The MIT Press, Cambridge, Mass, USA; 1993.Google Scholar
- Lombard M, Ditton T: At the heart of it all: the concept of presence. Journal of Computer-Mediated Communication 1997., 3(2):Google Scholar
- Dixon NF, Spitz L: The detection of auditory visual desynchrony. Perception 1980, 9(6):719-721. 10.1068/p090719View ArticleGoogle Scholar
- Rihs S: The influence of audio on perceive picture quality and subjective audio-video delay tolerance. Proceeding of the MOSAIC Workshop Advanced Methods for the Evaluation of Television Picture Quality, 1995Google Scholar
- Storms RL, Zyda MJ: Interactions in perceived quality of auditory-visual displays. Presence: Teleoperators and Virtual Environments 2000, 9(6):557-580. 10.1162/105474600300040385View ArticleGoogle Scholar
- Kohlrausch A, Vand de Par S: Auditory-visual interaction: from fundamental research in cognitive psychology to (possible) applications. Proceedings of the IST/SPIE Conference on Human Vision and Electronic Imaging, 1999Google Scholar
- Arons B: A review of the cocktail party effect. Journal of the American Voice 1992., 12:Google Scholar
- Handel S: Perceptual Coherence: Hearing and Seeing. Oxford University Press, Oxford, UK; 2006.View ArticleGoogle Scholar
- Durlach N, Mavor A: Virtual reality: Scientific and Technological challenges. National Academy Press, Washington, DC, USA; 1995.Google Scholar
- Slater M: A note on presence terminology. Presence Connect 2003., 3(3):Google Scholar
- Lessiter J, Freeman J, Keogh E, Davidoff J: A cross-media presence questionnaire: the ITC-sense of presence inventory. Presence 2001, 10(3):282-297. 10.1162/105474601300343612View ArticleGoogle Scholar
- Witmer BG, Singer MJ: Measuring presence in virtual environments. A presence questionnaire. Presence: Teleoperators and Virtual Environments 1998, 7(3):225-240. 10.1162/105474698565686View ArticleGoogle Scholar
- IJsselsteijn WA: Presence in Depth. Technische Universiteit Eindhoven, Eindhoven, The Netherlands; 2004.Google Scholar
- Gilkey RH, Weisenberger JM: The sense of presence for the suddenly deafened adult—implications for virtual environments. Presence: Teleoperators and Virtual Environments 1995, 4(4):357-363.Google Scholar
- Ozawa K, Chujo Y, Suzuki Y, Sone T: Psychological factors involved in auditory presence. Acoustical Science and Technology 2003, 24(1):42-44. 10.1250/ast.24.42View ArticleGoogle Scholar
- Chueng P, Marsden P: Designing auditory spaces to support sense of place: the role of expectation. In Proceedings of the CSCW Workshop: The Role of Place in Shaping Virtual Community, 2002. Citeseer;Google Scholar
- Serafin S, Serafin G: Sound design to enhance presence in photorealistic virtual reality. In Proceedings of the International Conference on Auditory Display, 2004. Citeseer; 6-9.Google Scholar
- Nordahl R: Increasing the motion of users in photorealistic virtual environments by utilizing auditory rendering of the environment and ego-motion. Proceedings of Presence 2006, 57-62.Google Scholar
- Adrien JM: The missing link: modal synthesis. Representations of Musical Signals Table of Contents 1991, 269-298.Google Scholar
- Cook PR: Physically informed sonic modeling (PhISM): synthesis of percussive sounds. Computer Music Journal 1997, 21(3):38-49. 10.2307/3681012View ArticleGoogle Scholar
- Cruz-Neira C, Sandin DJ, DeFanti TA, Kenyon RV, Hart JC: Cave. Audio visual experience automatic virtual environment. Communications of the ACM 1992, 35(6):65-72.View ArticleGoogle Scholar
- Begault DR: 3-D Sound for Virtual Reality and Multimedia. AP Professional, Boston, Mass, USA; 1994.Google Scholar
- Nordahl R: Auditory rendering of self-induced motion in virtual reality. In M. Sc. project report. Department of Medialogy, Aalborg University Copenhagen; 2005.Google Scholar
- Mozart , Amadeus W: Piano Quintet in E flat, K. 452, 1. Largo Allegro Moderato. Philips Digitals Classics, 446 236-2, 1987Google Scholar
- Västfjäll D, Larsson P, Kleiner M: Development and validation of the Swedish viewer-user presence questionnaire (SVUP). 2000.Google Scholar
- Maurer TJ, Pierce HR: A comparison of likert scale and traditional measures of self-efficacy. Journal of Applied Psychology 1998, 83(2):324-329.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.