In order to optimally design a useful multipurpose database for environmental sounds, it is necessary to have a fuller understanding of the nature of environmental sounds, what they represent for humans, factors in environmental sound perception, and how their perception may be similar or different for different listeners. This information will guide sound selection by database users: researchers, designers, and musicians. The ability to use experimentally obtained perceptual criteria in sound selection, in addition to a thorough description of technical characteristics of the sounds, constitutes a unique feature of the present database. Although what Gaver termed "everyday listening" [5] is a frequent activity, the nature of the experience has been remarkably underscrutinized, both in common discourse and in the scientific literature, and alternative definitions exist [6, 7]. This is changing, as the numerous articles in this volume will attest, but still even our basic understanding of environmental sounds has large lacunae.
Thus, unlike speech and music, there is no generally agreed upon formal structure or taxonomy for environmental sounds. Instead, there are several prominent approaches to environmental sound classification that have been advanced over the last several decades [5–7]. A major initial contribution to environmental sound research is contained within the framework of Acoustic Ecology advanced by Schafer [6] who advanced the notion of the soundscape as the totality of all sounds in the listener's dynamic environment. Further extended by Truax [7] in his Acoustic Communication model, speech, music, and soundscape (that includes all other sounds in the environment) are treated as part of the same acoustic communication continuum wherein sounds' acoustic variety increases from speech to soundscape, while sounds' rule-governed perceptual structure, temporal density of information, and specificity of meaning all increase from soundscapes to speech. Importantly, the Acoustic Communication approach also treats listening as an active process of interacting with one's environment and distinguishes among several different levels of listening such as listening-in-search (when specific acoustic cues are being actively sought in the sensory input), listening-in-readiness (when the listener is ready to respond to specific acoustic cues if they appear but is not actively focusing his/her attention on finding them), and background listening (when listeners are not expecting significant information or otherwise actively processing background sounds). The theoretical constructs of the Acoustic Communication model are intuitive and appealing and have been practically useful in the design of functional and aesthetically stimulating acoustic environments [8]. However, directed mostly toward more general aspects of acoustic dynamics of listener/environment interactions, as regards cultural, historical, industrial, and political factors and changes at the societal level, it is still the case that more specific perceptual models are needed to investigate the perception of environmental sounds in one's environment.
In his seminal piece, What Do We Hear in the World [5], Gaver attempted to construct a descriptive framework based on what we listen for in everyday sounds. He examined previous efforts, such as libraries of sound effects on CD, which were largely grouped by the context in which the sound would appear, for example, "Household sounds" or "Industry sounds." While this would be useful for people who are making movies or other entertainment, he found it not very useful for a general framework. "For instance, the categories are not mutually exclusive; it is easy to imagine hearing the same event (e.g., a telephone ringing) in an office and a kitchen. Nor do the category names constrain the kinds of sounds very much."
Instead, he looked at experimental results by himself and others [9–12] which suggested in everyday listening that we tend to focus on the sources of sounds, rather than acoustic properties or context. He reasoned that in a hierarchical framework, "Superordinate categories based on types of events(as opposed to contexts) provide useful clues about the sorts of sounds that might be subordinate, while features and dimensions are a useful way of describing the differences among members of a particular category." Inspired by the ecological approach of Gibson [13], he drew a sharp distinction between "musical listening", which is focusing on the attributes of the "sound itself", and "everyday listening" in which "
the perceptual dimensions and attributes of concern correspond to those of the sound-producing event and its environment, not to those of the sound itself."
Based on the physics of sound-producing events, and listeners' description of sounds, Gaver proposed a hierarchical description of basic "sonic events," such as impacts, aerodynamic events and liquid sounds, which is partially diagrammed in Figure 1. From these basic level events, more complex sound sources are formed, such as patterned sources (repetition of a basic event), complex sources (more than one sort of basic level event), and hybrid sources (involving more than one basic sort of material).
Gaver's taxonomy is well thought out, plausible, and fairly comprehensive, in that it includes a wide range of naturally occurring sounds. Naturally there are some that are excluded—the author himself mentions electrical sounds, fire and speech. In addition, since the verbal descriptions were culled from a limited sample of listener responses, one must be tentative in generalizing them to a wider range of sounds. Nevertheless, as a first pass it is a notable effort at providing an overall structure to the myriad of different environmental sounds.
Gaver provided very limited experimental evidence for this hierarchy. However, a number of experiments both previous and subsequent have supported or been consistent with this structuring [12, 14–18] although some modifications have been proposed, such as including vocalizations as a basic category (which Gaver himself considered). It was suggested in [16] that although determining the source of a sound is important, the goal of the auditory system is to enable an appropriate response to the source, which would also necessarily include extracting details of the source such as the size and proximity and contextual factors that would mitigate such a response. Free categorization results of environmental sounds from [16] showed that the most common basis for grouping sounds was on source properties, followed by common context, followed by simple acoustic features, such as Pitched or Rumbling and emotional responses (e.g., Startling/Annoying and Alerting). Evidence was provided in [19] that auditory cognition is better described by the actions involved from a sound emitting source, such as "dripping" or "bouncing", than by properties of their causal objects, such as "wood" or "hollow". A large, freely accessible database of newly recorded environmental sounds has been designed around these principles, containing numerous variations on basic auditory events (such as impacts or rolling), which is available at http://www.auditorylab.org/.
As a result, the atomic, basic level entry for the present database will be the source of the sound. In keeping with the definition provided earlier, the source will be considered to be the objects involved in a sound-producing event with enough description of the event to disambiguate the sound. For instance, if the source is described as a cat, it is necessary to include "mewing", "purring", or "hissing" to provide a more exact description. There are several potential ways to describe the source, from physical objects to perceptual and semantic categories. Although the present definition does not allow for complete specificity, it does strike a balance between that and brevity and allows for sufficient generalization that imprecise searches can still recover the essential entries.
Of course sounds are almost never presented in isolation but in an auditory scene in which temporally linear mixtures of sounds enter the ear canal and are parsed by the listener. Many researchers have studied the regularities of sound sources that can be exploited by listeners to separate out sounds, such as common onsets, coherent frequency transitions, and several other aspects (see, e.g., [20]). The inverse process, integration of several disparate sources into a coherent "scene", has been much less studied, as has the effect of auditory scenes on perception of individual sounds [21–23]. As a result, the database will also contain auditory scenes, which consist of numerous sound sources bound together by a common temporal and spatial context (i.e., recorded simultaneously). Some examples are a street scene in a large city, a market, a restaurant or a forest. For scenes, the context is the atomic unit for the description.
Above these basic levels, multiple hierarchies can be constructed, based on the needs and desires of the users, which are detailed in the next section.