Environmental Sound Perception: Metadescription and Modeling Based on Independent Primary Studies
© Nicolas Misdariis et al. 2010
Received: 2 June 2009
Accepted: 1 February 2010
Published: 10 May 2010
The aim of the study is to transpose and extend to a set of environmental sounds the notion of sound descriptors usually used for musical sounds. Four separate primary studies dealing with interior car sounds, air-conditioning units, car horns, and closing car doors are considered collectively. The corpus formed by these initial stimuli is submitted to new experimental studies and analyses, both for revealing metacategories and for defining more precisely the limits of each of the resulting categories. In a second step, the new structure is modeled: common and specific dimensions within each category are derived from the initial results and new investigations of audio features are performed. Furthermore, an automatic classifier based on two audio descriptors and a multinomial logistic regression procedure is implemented and validated with the corpus.
The purpose of this study is to transpose and extend the timbre description principles from musical sounds to environmental sounds, which are by nature considered as nonmusical. More precisely, environmental sounds were first defined by Vanderveer  as " any possible audible acoustic event which is caused by motions in the ordinary human environment. (⋯) Besides ( ) having real events as their sources (⋯), ( ) [they] are usually more "complex'' than laboratory sinusoids, (⋯), ( ) [they] are meaningful, in the sense that they specify events in the environment. (⋯), ( ) the sounds to be considered are not part of a communication system, or communication sounds, they are taken in their literal rather than signal or symbolic interpretation."
Within the restricted framework given by the scope of the primary research upon which the present study is based (see Section 2), the final aim is also to automate indexing and classification of environmental sounds. This goal is actually essential for sound quality measurement, as well as for further sound-content-based searching and browsing methods that use perceptual models of environmental sounds and often require measurements based on perceptually relevant acoustical similarities. Indeed, in the sound-quality field, most studies use acoustical/psychoacoustic descriptors such as loudness or roughness in order to explain unpleasantness ratings, whereas several studies have shown that no "universal'' descriptors exist for all classes of everyday sounds.
The work detailed in this article starts from four primary industrial studies on sound attributes dealing with sounds produced by car engines (Susini et al. [2–4], McAdams et al. ), air-conditioning units (Susini et al. ), car horns (Lemaitre et al. [7, 8]), and closing car doors (Parizet et al. ). The aim of these studies was to apply the methodology developed to study the timbre of musical sounds to a specific category of environmental sounds. The standard methodology used in these studies was based on a multidimensional scaling technique (MDS) applied to dissimilarity ratings.
The MDS technique is a fruitful tool for studying perceptual relationships among sounds and for determining the underlying auditory attributes used by participants to rate the perceived similarities among sounds. The term auditory attribute is used to describe the perceived properties or qualities of the sounds. Well-known auditory attributes include loudness, pitch, duration, sharpness, and so forth. The MDS technique does not require a priori assumptions concerning the number of auditory attributes or their nature, unlike semantic differential methods that use ratings along specific dimensions, such as roughness, for example. The MDS technique represents the perceived similarities in a low-dimensional Euclidean space (referred to as the perceptual space), so that the distances among the stimuli reflect the perceived dissimilarities. Each dimension of the space (called a perceptual dimension) is assumed to correspond to a perceptual continuum that is common to the whole set of sounds. Thus the main hypothesis with the MDS technique is that the sounds under study can be compared on auditory attributes that are shared by all sounds in the corpus. In other words, this technique is appropriate for characterizing sounds that are comparable along continuous auditory attributes of a homogenous corpus composed of sounds produced by the same type of source (musical instruments, car sounds, vacuum cleaner noises, etc.). Considering musical sounds, the most common timbre space found by several studies (among which Grey , Krumhansl , McAdams et al. , and Marozeau et al. ) consisted of three dimensions correlated with acoustic features in order to associate a measurable sound parameter with each perceptual dimension of timbre. The assumption of this approach rests on the model suggested by McAdams , who postulates that the recognition of sound sources arises from a process of analysis, computation, and extraction of a certain number of auditory features related to the acoustic parameters of the signals. Then, in many of these musical timbre studies, the three dimensions were found to be significantly correlated with a spectral feature that most often represented auditory brightness (energy distribution along the frequency scale), a temporal feature that characterized attack, and a spectro-temporal feature corresponding to spectral variations over time. The MDS technique has been shown to be an efficient tool for revealing and describing the previously unknown auditory attributes underlying the timbre of musical sounds.
In the present context, environmental sound studies, experimental data, analyses, and acoustic parameters have been reviewed and compared from the four initial studies. An investigation of these combined data was conducted, and an attempt to model the resulting structures on the basis of the primary results was made using generalized toolboxes (essentially, "Ircamdescriptor" from Peeters  and "Auditory Toolbox" from Slaney ) in order to unify—and in some cases to improve—the description of the initial data. Here we will first introduce and describe all the studies taken into account in this review, their stimulus sets, the experiments performed, the resulting perceptual spaces, and the correlated acoustic features. Then, in order to contribute to environmental sound perception, we will first present the organization of this global stimulus set in terms of the main environmental sound classes, propose both interclass and intraclass structure descriptions, and finally initiate an automatic classification modeling approach within the restricted scope of the present study, but on the basis of perceptually relevant data and results gathered during its experimental parts.
2. Primary Studies
Because the following step of the methodology needs a small number of sounds to be experimentally feasible, a preliminary step is sometimes used in order to reduce the original corpus to an acceptable number of stimuli (usually not more than 20 samples). Free-sorting tasks and cluster analyses (see Section 3.1 for further details) are used to attain this goal. A free-sorting task consists in asking participants to sort the sounds of the set into as many categories as they wish. Thus, they identify the main categories of sounds that are studied and allow for the selection of representative subsets of sounds by homogeneously sampling across the categories.
A dissimilarity rating experiment collects the perceived dissimilarities among the sounds, which are then used as proximity data. It consists in asking the participants to rate directly the dissimilarity between both sounds of each possible pair within the set of sounds. The evaluation is made on a continuous scale labelled "Very Similar" at the left end and "Very Dissimilar" at the right end. It has the great advantage that it does not impose any predefined rating criteria on the listener.
The proximity data are modeled with a multidimensional scaling (MDS) analysis that fits distances in a geometrical space to the dissimilarity data. The dimensions of this space represent the perceptual dimensions underlying the proximities. Different levels of complexity exist in the MDS approach depending on the model and associated algorithm (see Appendix A); in the present case, two particular MDS techniques were used in the studies: the INDSCAL (Individual Differences Scaling) and CLASCAL (Latent-Class Approach) models.
The final step of a timbre study is to give a physical interpretation of the perceptual dimensions revealed by the MDS analysis. This is usually done by submitting the perceptual dimensions to linear regression analyses with relevant acoustic features. Some of them are psychoacoustic descriptors, that is, acoustic features that have been found to correspond to auditory sensations. Models that compute psychoacoustic descriptors are usually based on a model of the peripheral auditory system.
The main goal of this study was to analyze the timbre of the sounds of car interiors in a given driving condition from the driver/passenger point of view.
The sounds were recorded in 16 different vehicles at two different engine modes. The engine modes defined two substudies: study A1 involved sounds produced when the engine was running in 3rd gear at 4000 RPM (Round Per Minute) and study A2 involved sounds produced when the engine was running in 5th gear at 3500 RPM. A preliminary experiment showed that loudness was the main auditory cue used by the participants to rate the dissimilarity. Thus, in order to let other auditory attributes emerge, loudness was equalized. Both stimulus sets were composed of 16 stereophonic sounds that were 4.1 seconds in duration. Their levels—after loudness equalization—varied between 69 and 80 dB SPL (Sound Pressure Level).
For each engine mode stimulus set, a dissimilarity rating experiment was conducted with 30 participants.
2.1.4. Analysis and Results for Study A1
A CLASCAL analysis (see Appendix A) of the data yielded a 1-latent class, 3-dimensional space with specificities. Figures 13(a) to 13(c) represent the projections of the space, and Table 9 reports the correlation coefficients of the acoustic features best fitting the perceptual dimensions. The first dimension is correlated [ , ] with a feature corresponding to the relative balance of the harmonic (motor) and noise (air turbulence) components. The second dimension is correlated [ , ] with a variation of the spectral centroid with the frequency dimension represented in ERB-rate (see Appendix B for more details). The third dimension is significantly correlated [ , ] with an acoustic feature quantifying the spectral decrease of the harmonic part of the sound.
2.1.5. Analysis and Results for Study A2
A CLASCAL analysis (see Appendix A) yielded a 1-latent class, 2-dimensional space with specificities. Figure 14 represents the perceptual space and Table 10 reports the correlation coefficients of the acoustic features best fitting the perceptual dimensions; the features that are the best correlated with the two dimensions are also reported in Table 10. The first dimension is correlated [ , ] with an acoustic feature conveying the relative balance between two groups of partials of the harmonic part of the signal (see Appendix B for more details). The second dimension is correlated [ , ] with the spectral centroid computed on the C-weighted version of the signal (see Appendix B for more details).
2.2. Study B. Interior Air-Conditioning Units 
This study focused on the sound quality of interior air-conditioning units.
The initial set consisted of 43 sounds produced by units of different brands. A free-sorting experiment was first conducted to select an homogeneous subset of sounds representative of the existing range for this type of sounds. The results of this experiment also showed that three categories were made mainly by grouping together sounds with similar loudness levels. As in study A, in order to prevent loudness from dominating the ratings (possibly masking more subtle effects), the sounds were selected in the category corresponding to a medium loudness level (average level: 46.5 dB SPL, 2.2 dB standard deviation). An informal experiment was then performed with only 5 participants to get an initial estimate of the perceptual space structure. The outcome of the MDS analysis was that the space was not homogeneously sampled. Therefore, synthesized sounds were added and redundant sounds were removed in order to produce a more homogeneously distributed stimulus set. The synthesized sounds were created on the basis of features of the sounds in the stimulus set, using a geometric interpolation within the space. The resulting stimulus set consisted of 19 sounds: 15 recordings of air-conditioning units and 4 synthesized sounds. They were all 5.9 seconds in duration with levels varying between 44 and 52 dB SPL.
The dissimilarity rating experiment was conducted with 50 participants.
2.2.4. Analysis and Results
A CLASCAL analysis (see Appendix A) of the dissimilarity ratings yielded a 5-latent class, 3-dimensional space with specificities. Figures 15(a) to 15(c) represent the projections of the 3-dimensional space, and Table 11 presents the correlation coefficients of the features that are the best correlated with the perceptual dimensions. The first dimension is correlated [ , ] with a feature corresponding to the relative balance of the harmonic (motor) and noise (air turbulence) components. The second dimension is correlated with a frequency-weighted variation of the spectral centroid of the noise component [ , ]. The third dimension is correlated with loudness [ , ]. Indeed, even though the selected sounds are in the same range of loudness, they were not equalized in loudness.
This study concerned the timbre of car horns in order to define specifications for the design of new sounds. The initial stimulus set consisted of 43 recordings of current car horn sounds. These sounds can be either monophonic (one note) or polyphonic (two or three notes to make a chord) and are produced by two different mechanisms: a metal plate or a folded horn that acts as a resonator and is attached to the membrane of an electroacoustic driver. Both produce very specific timbres. A preliminary sorting experiment highlighted 9 main categories of sounds connected with these different mechanisms and properties.
A sample of 22 sounds was chosen among the 9 categories. Among these 22 sounds, 13 were monophonic and 9 were polyphonic, 10 were produced by "plate" resonators and 12 by "horn" resonators. All sounds lasted between 0.6 and 2.2 seconds. Their levels varied between 63 and 77 dB SPL.
A dissimilarity rating experiment using this set of sounds was conducted with 41 participants.
2.3.4. Analysis and Results
The dissimilarity ratings were submitted to a CLASCAL analysis (see Appendix A), resulting in a 6-latent class, 3-dimensionial space with specificities. Figures 16(a) to 16(c) represent the projections of the 3-dimensional space, and Table 12 reports the best-correlated features (see , for more details on the acoustic features). The first dimension is correlated [ , ] with roughness. The second dimension is correlated [ , ] with a variation of the spectral centroid integrating a perceptual approach to compute this parameter (ERB scale, see Marozeau et al. ). The third dimension is correlated [ , ] with an acoustic feature related to the fine structure of the spectral envelope.
2.4. Study D. Car Door Closing 
The main goal was to study the timbre of car door closing sounds in the context of evaluating their sound quality.
An initial set of 27 stereophonic recordings (16 sounds from different cars and 11 sounds from two cars with modified seals) was submitted to a sorting experiment with 31 participants in order to select a representative subset of 12 sounds. Among these 12 sounds, 4 were recorded from cars with modified seals. The durations of the sounds varied between 0.3 and 0.5 seconds, and their levels varied between 66 and 84 dB SPL.
A dissimilarity rating experiment was conducted with 40 participants.
2.4.4. Analysis and Results
The dissimilarity data were submitted to an INDSCAL analysis (see Appendix A). A 3-dimensional space was found. Figures 17(a) to 17(c) represent its projections, and Table 13 reports the correlation coefficients of the features best correlated with the perceptual dimensions. The first dimension is correlated with a feature corresponding to sharpness, as defined by Aures  [ , ], as well as to the spectral centroid [ , ]. The second dimension is correlated [ , ] with an indicator related to the temporal evolution of instantaneous loudness, according to Zwicker's model . No descriptor was found that correlated significantly with the third dimension.
2.5. Comparisons and Discussion
Table summarizing the methodological context and main results for studies A1, A2, B, C, and D (see corresponding sections above and Appendix B for further details).
D—Car door closing
A1: 16 snds 3rd gear, 4000 rpm
19 sounds (4 synthesized)
A2: 14 snds, 5th gear, 3500 rpm
A1: 3 dim., specif., 1 lat. class.
3 dim., specif., 5 lat. class.
3 dim., specif., 6 lat. class.
A2: 2 dim., specif., 1 lat. class.
A1, dim.1: RAPmv-A
dim.1: Spectral centroid
A1, dim.2: CGg-ERB
dim.2: Spectral centroid
dim.2: Cleanness indicator
A1, dim.3: Dec
dim.3: N (Loudness)
dim.3: Spectral deviation
A2, dim.1: rad-2N/0.5N
A2, dim.2: CGg-C
- (i)In every study, at least one dimension in the perceptual space resulting from the MDS analysis is found to be related to a spectral centroid feature, usually describing the "brightness'' of a sound. This "brightness'' attribute seems therefore incontrovertible when trying to compare two sounds of any of these kinds of sources. However, this attribute seems to take different forms across the studies.
It can be computed with a frequency weighting representing the variation in sensitivity of the human ear over the audible frequency range at different presentation levels (A-, B-, and C-weightings).
It can introduce a much more sophisticated model of the hearing process (ERB filters).
It is sometimes only computed on a particular part of the signal (noise part). The subsidiary questions are: will all these "brightness'' predictors be as efficient for all studies? If not, is there a particular calculation that fits all of the spaces equally well?
In 3 studies, relevant acoustic features appear to include separate calculations for the harmonic and noise parts of the signals. The signal processing needed for this separation is quite complex and often includes the setting of several initial algorithm parameters. Again, the question raised is as follows: will a common set of these parameters result in the same efficiency of separation for the correlation scores in the 3 studies?
For the 2 other studies, the correlation results exhibit specific relevant acoustic features. This fact confirms that a universal low-dimensional perceptual space describing all sounds does not exist. It would also tend to agree with McAdams et al.  and McAdams  who observed that when sounds are produced by classes of sources that are too different the dissimilarity judgments may be based on cognitive factors rather than on perceptual signal-related ones, which results in a strongly categorical description.
3. Metaprocessing: Complementary Experimental Investigations
a categorical (discrete) level corresponding to the main sound-event categories, each of them being related to a distinct physical cause or source,
a continuous level that will associate each of these categories with a perceptual space with salient dimensions that can be either specific to the given category or shared with the others.
a free-sorting task to identify the main sound event categories composing the overall sound corpus, combined from the initial corpora presented in Section 2,
a forced-choice sorting task on more heterogeneous sounds (extracted from commercial sound libraries) in order to extend and determine more precisely the boundaries of these categories. The results of these experiments will then be used in the last part of the study (see Section 4) in order to define new ways of modeling the structure on both discrete and continuous levels, as defined in our hypothesis, that describe the main sound-event categories and perceptual dimensions attached to each of those categories, respectively.
3.1. Experiment: Free-Sorting Task on the Initial Corpora
In order to identify the main perceptual categories among the sounds under consideration in this study, a free-sorting experiment with this complete stimulus set was conducted.
Twenty participants (8 women and 12 men) volunteered as listeners for this experiment and were paid for their participation. All reported having normal hearing.
The resulting unified stimulus set is a collection of 83 sounds distributed as follows: 16, 14, 19, 22, and 12 sounds from studies A1, A2, B, C, and D, respectively. See Sections 2.1.2, 2.2.2, 2.3.2, 2.4.2 for more details on the sounds. In order to prevent the listeners from sorting the sounds according to their loudnesses, a preliminary loudness-equalization experiment was conducted with 7 participants working at IRCAM, resulting in an 83-sound loudness-equalized corpus.
Testing took place in a double-walled IAC sound-isolation booth. The sounds were played over Sennheiser HD 520 II headphones through an RME Fireface 400 audio card plugged into a Macintosh Mac Pro (Mac OS X v10.4 Tiger) workstation. The test was run using a Graphical User Interface (GUI) developed by Vincent Rioux in Matlab (v7.0.4) including stimulus control, data recording, and sound play-back.
At the beginning of the procedure, the participants were given written instructions briefly presenting the context of the study and detailing the task to be performed. The task was to classify the 83 sounds of the corpus into as many categories as they wished according to their own criteria and, in a second step, to select the most representative sound—the prototype—for each of the classes (see Section 3.2 for the definition of a prototype by Rosch ). In the GUI, the sounds were represented as dots that could be either played (double-click) or moved (drag and drop) to the dedicated area of the screen in order to graphically compose the categories (see Figure 18 for an illustration of the interface).
The experimental data consist of individual incidence matrices (coding the set partitions of each subject) that are summed to form a cooccurrence matrix. The cooccurrence matrix represents how many subjects have placed each pair of sounds in the same category. This can also be interpreted as a proximity matrix (Kruskal and Wish ). In the present case, we derived a hierarchical tree representation from these data using an unweighted arithmetic average clustering (UPGMA) analysis algorithm (see P. Legendre and L. Legendre  for computational details). In such a representation, the distance between two sounds is represented by the height of the node that links them. Among the 91,881 triplets that can be formed out of 83 sounds, 94% follow the ultrametric inequality, which shows the adequacy of the tree representation for these data (see P. Legendre and L. Legendre ). The tree representation is shown in Figure 2. It can be clearly seen that 3 main categories constitute the unified corpus. Looking in detail at the items inside each of them, we can observe that these 3 categories correspond, respectively, to studies A and B (right part of Figure 2), study C (left part of Figure 2), and study D (middle part of Figure 2). Moreover, listening to these items led us to propose a semantic labeling for each of these 3 categories: "motor,'' (musical) "instrument-like,'' and "impact,'' respectively.
After summing the matrix over the panel of listeners, a submatrix is extracted for each of the 3 final categories with the rows and lines indexed by the sounds constituting the category.
Each submatrix is averaged over its rows, and the highest score gives the index of the prototype.
"motor'' (first category): sounds from both car interior (studies A1 and A2) and air-conditioning unit (study B) corpora. These sounds have two discriminable components: a harmonic part with a quite low fundamental frequency produced by a "motor" and a noisy part produced by air turbulence;
"instrument-like'' (second category): sounds that correspond to the car horn corpus (study C), which are defined by one or several higher tones, closer to those produced by musical instruments than those generated by motors;
"impact'' (third category): sounds of the car door closing corpus (study D). Actually, one can easily discriminate these sounds from the others because of their temporal structure. This idea is consistent with the discrimination of percussive and sustained sounds among musical timbres. Indeed, impact sounds of the environment are quite close to musical percussive sounds in terms of sound production.
This categorization is consistent with the product sound classification proposed by Özcan and van Egmond  defined by 6 sound categories: air, alarm, cyclic, impact, liquid, and mechanical. Even though these product sounds were from a domestic context, Özcan and van Egmond. found an impact category; they also found an alarm category that can correspond to the present instrument-like category with regard to basic similarities in pitch, harmonic structure, or stationary aspects of the sounds; and finally, the present motor category can be linked to both their air and mechanical categories.
3.2. Experiment 2: Forced-Choice Sorting Task on an Extended Corpus
On the basis of the previous results (3 main categories of environmental sounds within the scope of the unified corpus, with an associated prototype for each), a second experiment was conducted in order to generate a more heterogeneous corpus that would better represent the range of variation of each category. This was done by means of a forced-choice procedure, the main choices being the categories found in Experiment . These 3 categories were each identified by their respective prototypic sound extracted from Experiment , instead of being verbally defined as it is usually done in this kind of procedure. The notion of prototype is based on psychological principles related to the way one organizes knowledge of the surrounding world. For Rosch , a prototype is the element of a group that is the most similar to all items inside the group and, at the same time, that is on average the most different from items of all the other groups. The notion of prototype used in the present study is directly derived from Rosch's concept.
Furthermore, the outcome of this second experiment will also provide perceptually validated data for the modeling part of the present study in order to implement an automatic classifier (see Section 4.2).
Twenty-one participants (8 women and 13 men) volunteered as listeners for this experiment and were paid for their participation. All reported having normal hearing.
for motor sounds: truck, aircraft, motorbike, helicopter, crane, vacuum cleaner, fridge, blender, electric shaver, lawn mower;
for instrument-like sounds: phone ringing, dishes squeak, door creak, alarm, bell;
for impact sounds: glass shock, various doors closing (fridge door, house door, etc.), computer keyboard, water drop, tennis ball.
Again, the sounds needed to be equalized in loudness so that the judgments would not be based on this auditory attribute. However, considering the high number of sounds, a preliminary experiment of loudness equalization would have been quite long. As a consequence, the sounds' loudnesses were equalized with regard to the value given by the loudness model of Zwicker and Fastl .
The final corpus was composed of 150 loudness-equalized sounds with an equal distribution of 50 sounds in each category.
The same technical equipment as in Experiment was used. However, the study was run using a GUI specifically developed in the PsiExp v3.4 experimentation environment including stimulus control and data recording (Smith ). The sounds were played with Cycling74's Max/MSP software (v4.6).
At the beginning of the experiment, the participants were given written instructions briefly presenting the context of the study and detailing the task to be performed. They were asked to classify the 150 sounds of the new corpus into 3 unnamed categories associated with their respective prototypical sounds by clicking on the corresponding button. A fourth button labeled "other'' allowed participants to not choose any of the 3 main categories (see Figure 19 for an illustration of the interface). The specificity of the present paradigm was to make the categories explicit with the prototype sounds found in Experiment —with the obvious exception of the class "other''—instead of naming them directly. This implementation was chosen in order to avoid any ambiguity in the understanding of the arbitrary semantic attributes that did not result from verbalization analyses.
The high standard deviation of the number of rejected sounds ("other") might be related to differences in strategy among the participants who did not use the same selectivity threshold, or the same granularity, to group the sounds.
The high mean number of sounds combined with a relatively low standard deviation for the motors shows a consensus among the participants that proves the adequacy of the selection of sounds for this category.
To the contrary, the relatively high standard deviations for the two other categories show some variability in the listeners' judgments, which is probably due to the quite large variety of chosen sounds for these categories. For the "instrument-like'' category, the variability seems to be related to the difficulty of theoretically defining this type of sound, whereas for the impacts, it could be explained by the overly general character of this category.
Experiment 2—distribution of the sounds in the experimental categories.
Prototype 1 (motor)
Prototype 2 (instrum-like)
Prototype 3 (impact)
The partitioning of the data across the 3 categories shows a good consensus on a certain number of sounds for each class. With this result, we are then able to make a selection of sounds that are clearly associated with one of the 3 categories revealed in Experiment . This leads to the constitution of a perceptually validated sound corpus with regard to the motor, instrument-like, and impact categories, which is now large and representative enough to consider the conception and validation of a predictive tool for automatic classification of environmental sounds in these three categories.
a motor category including 49 sounds from 3 different corpora (A1, A2, B), each of them being described by a perceptual space and augmented with 50 perceptually validated new sounds, for a total of 99 items,
an instrument-like category including 22 sounds from corpus C described by a perceptual space and augmented with 27 perceptually validated new sounds, for a total of 49,
an impact category including 12 sounds from corpus D described by a perceptual space and augmented with 47 perceptually validated new sounds, for a total of 59.
In the next step, this corpus will serve as input for the implementation of the automatic classifier detailed in Section 4.2.
4. Metaprocessing: Modeling the Description Structure
This section was designed to confirm the second part of the starting hypothesis, which stipulates that both intercategory and intracategory properties exist, that is, dimensions shared by the all categories and specific dimensions related to their mutual discriminating differences. Furthermore, the knowledge of these discriminating features could facilitate the implementation of a predictive tool capable of automatically recognizing whether a new item belongs to one of the 3 metacategories. Note that every acoustic feature mentioned in this section is extracted either from the Ircamdescriptor toolbox (CUIDADO project, Peeters ) or from the Auditory Toolbox (Slaney ).
4.1. Continuous Level: Unifying the Perceptual Space Dimensions
In this first part, we investigate the shared and specific properties across the categories by considering the data coming from the corpus described by perceptual space dimensions (corpuses A to D). The main idea is to unify these data by recomputing the acoustic features explaining the different perceptual dimensions in a more systematic manner, in order to point out regularities and singularities among the given spaces. The implementation of these acoustic features is detailed in Appendix D.
Note that some of the stimulus sets contain only monophonic sounds, whereas others contain only stereophonic sounds, and, although the acoustic features are calculated on both channels in the latter case, the salience of an indicator in one channel compared to the other depends on the recording context. For example, if a car interior sound has been recorded from the driver's seat, the most relevant channel for a given sound feature will probably not be the same as if it had been recorded from the passenger's seat. Accordingly, the features in the correlation tables can be either from the left or the right channel, or from the mean of both channels.
4.1.1. MDS Analyses Compatibility
The two models giving rise to the perceptual spaces that will be unified in this section are INDSCAL and CLASCAL (see Appendix A). As both models remove the rotational invariance of the obtained spaces, one could assume that both models would result in similar main perceptual dimensions (even if possible slight differences in the items' positions or the axes' orientations may be due to the precision of the model). However, the presence of specificities in the latter can modify the psychological meaning of the dimensions. Indeed, the fact that a part of the Euclidean distances is explained by those specificities leads to a modification of the proportion explained by the dimensions. Thus the dimensions obtained by both models will not necessarily be the same.
All the same, the only sound corpus for which the INDSCAL method was used (study D) corresponded to a different sound category than those of the other corpora (see Section 3.1). As a consequence, the fact that the dimensions were obtained differently from the other studies is not a problem. Indeed, this perceptual space will be studied separately from the others.
4.1.2. Motor Category
One of the main characteristics of this kind of sound is that it contains two different simultaneous parts. The first one corresponds to a harmonic pattern that can be easily modeled by a sum of sinusoids, and the second one corresponds to the noise resulting from the air turbulence. Perceptually, these two parts are highly discriminable. Consequently, unlike the other two categories, both parts need to be taken into account independently when estimating the acoustic features. This is the reason why harmonic separation methods were tested and used in order to describe both parts, as well as their mutual interaction.
Dimension 1: Harmonic Emergence (HNR)
Dimension 2: Complex Brightness
- Study A1.
This dimension seems to be well correlated with the Perceptual Spectral Spread—PSS (see Appendix D) calculated with logarithmic scales for both magnitude (level) and frequency. The linear regression of this feature with the third dimension of the perceptual space of this study is shown in Figure 5.
- Study B.
Unlike study A, the sounds were not initially loudness-equalized in study B. Quite logically, the last dimension of this MDS analysis result is found to be significantly correlated with Loudness (see Appendix D). The linear regression between Loudness and the third dimension of the study B perceptual space is shown in Figure 6.
Loudness is a perceptually strong characteristic that can easily prevent slight variations of other features from emerging. Moreover, the fact that no third perceptual dimension was obtained for stimulus set A2 can be related to the predominance of the noisy part, which can mask some variations of other features. On the contrary, when the sounds are loudness-equalized and when the harmonic part is not entirely masked by the noise, such as in stimulus set A1, a third perceptual dimension (PSS, Perceptual Spectral Spread) seems to emerge and matches that of the perceptual space of (pseudo-)harmonic instrument-like sounds (see Section 4.1.2, Dimension 3). For these reasons, we were not able to unify this third dimension along the three corpora (A1, A2 and B).
4.1.3. Instrument-Like Category
Dimension 1: Roughness—Study C
Dimension 2: Perceptual Spectral Centroid (PSC): Simple Brightness—Study C
Dimension 3: Perceptual Spectral Spread (PSS)—Study C
4.1.4. Impact Category
Dimension 1: Perceptual Spectral Centroid (PSC): Simple Brightness—Study D
Dimension 2: Cleanness Indicator—Study D
Dimension 3: Sound Level—Study D
One feature, Brightness, that is preponderant for the description of all sound categories (i.e., 1 dimension of the 5 perceptual spaces). This feature is actually a combination of different spectral envelope features: the perceptual spectral centroid of both harmonic and noise parts of the signal ( and )—or perceptual spectral centroid of the whole signal (PSC)—and perceptual spectral spread (PSS). And no unique combination has been found to describe uniformly this dimension. So this feature still remains a generic notion of brightness and cannot be transformed into a real metric for quantifying this dimension.
- (ii)One or two features, in each category, that are related to specificities of the corresponding sounds:
motor sound perception is largely characterized by the mixture of two highly discriminable parts, in terms of either energy or spectral content;
instrument-like sounds present timbre features that have been found previously for musical sounds (essentially, roughness);
an important part of the perceptual discriminability of impact sounds is related to a temporal behavior feature, describing the sounds' cleanness.
4.2. Categorical Level: Building an Automatic Classifier
Now that we have identified the intercategory particularities, we must address the development of a predictive tool able to automatically classify the sounds on the basis of a perceptually validated corpus. In other words, the aim here is to use the results presented in Section 4.1 as relevant cues in order to find a limited number of acoustic features that would be efficient for the implementation of an automatic perceptual classifier.
4.2.1. Specificities of the Categories
impact sounds differ from the other ones in their temporal structure: they are quite short because they are damped, while the other sounds are as long as desired because they are sustained;
instrument-like sounds differ from the other ones in their spectral structure: their spectrum energy is usually localized in the middle frequencies and their spread is quite low, because they are harmonic sounds whose degree of spectral envelope decrease is high. To the contrary, the spectrum energy of the other sounds is localized in much lower frequencies with a much higher spread and a lower degree of spectral envelope decrease.
Thus, it seems obvious that the cue that discriminates motor sounds from impact sounds, for instance, is very different from the one that discriminates motor sounds from instrument-like sounds. As a consequence, it is quite certain that a unique feature will not be enough to describe the categories, and it is more likely that we will have to use a pair of temporal and spectral features.
temporal features: Log-Attack-Time (LAT), Temporal Increase (TI), Temporal Decrease (TD), Temporal Centroid (TC), Effective Duration (ED), Energy Modulation Frequency (EMF), and Energy Modulation Amplitude (EMA);
spectral features: mean component of Spectral Centroid (SC), Spectral Spread (SSp), Spectral Skewness (SSk), Spectral Kurtosis (SK), Spectral Slope (SSl), Spectral Decrease (SD), Spectral RollOff (SR), and Spectal Variation (SV).
4.2.2. Classification Modeling Tool: The Multinomial Logistic Regression
4.2.3. Model Selection
LR value for each spectral/temporal feature pair. Spectral features are in rows and temporal features are in columns.
4.2.4. Model Validation
reestimating the model on a randomly selected reduced part of the stimulus set, 70% of it for instance (144 sounds with respect to the distribution in the 3 categories),
calculating the estimated probabilities on the remaining 30% (63 sounds),
evaluating the error percentage. (We consider as an error the case of a sound for which the probability of belonging to its supposed category is smaller than one of the two other probabilities. This means that if the model has to choose the category to which the sound belongs, it will choose a wrong one.)
This procedure was performed 100 times with a different random selection of sounds each time. This method tests whether the effectiveness of the model prediction will hold when applied to other sounds than those used to estimate its coefficients.
Results of the predicting tool based on SSp/ED features, after 100 runs of a 70%-learning/30%-predicting loop on the 207-sound perceptually extended corpus (Experiment 2).
Minimum recall number
Minimum recall percentage
Maximum recall number
Maximum recall percentage
Recall number standard deviation
Mean recall number
Mean recall percentage
Mean recall percentage interval
The selected model tested on a 207-sound stimulus set (augmented corpus established in Experiment , Section 3.2) gives significant stable results in terms of automatic classification with only around 4% mean error in the prediction, with only 2 predicting acoustic features. This is a rather encouraging result, even if this tool is built with only 3 main sound categories of quite different kinds (motor, instrument-like, and impact). It could be extended to other categories in order to cover a larger scope of environmental sounds.
Other automatic classification methods exist that are much more complex and that use much more input information about the sounds. But considering the significant results of this relatively simple method, exploring these algorithms further is quite pointless. However, with more than 3 categories, these methods may outperform the one presented here and could therefore be useful for efficient automatic classification.
From a larger point of view, other classification approaches also exist that are less time consuming with regard to the available data needed for performing them: they usually consist in defining sound classes, collecting training examples for each class, computing a large set of spectral and temporal features on sounds, and letting a machine learning method pick features that are efficient in discriminating the classes. But, the main difference between this approach and the one proposed in the present paper relies on the fact that in the former, the classes are arbitrarily defined (or at least, are the result of a single expert's analysis), whereas in the present paper the classes are deduced from an experimental procedure, which is more time consuming but allows them to be considered as perceptually relevant. This is one of the original contributions of this study with regard to traditional methods based on a priori sound categories and powerful learning techniques (e.g., like the ones used in Music Information Retrieval research (http://www.ismir.net/)).
a categorical level that considers the different sound categories corresponding to particular sound production mechanisms,
a continuous level that defines, within each of these categories, the perceptual space of the sounds representing the perceptual dissimilarity between two sounds of the same kind.
This description is associated with automatic processing of acoustic features. When considering a new sound of one of these kinds, this processing allows: (i) the identification of the sound category to which it belongs, with regard to the probabilities estimated by the logistic regression model and (ii) its correct placement along several perceptual dimensions.
One feature is preponderant for the description of all sound categories, that is, brightness usually based on spectral envelope features. Therefore, this perceptual feature appears to describe musical sounds as well as environmental sounds.
- (ii)One or two features, in each category, are related to a specificity of the corresponding sounds:
motor sound perception is largely characterized by the mixture of two highly discriminable parts (harmonic and noise), in terms of either energy or spectral content;
instrument-like sounds present timbre features, originally derived for the description of musical sounds;
an important part of the perceptual discriminability of impact sounds is related to a temporal behavior feature, describing the sounds' cleanness.
This modeling approach also includes the building of a predictive tool based on logistic regression able to classify automatically and rather efficiently (with only 4% mean error) this metastructure with regard to the 3 categories under consideration.
Note that contrary to musical timbre, for which attack time is an important cue of the perceptual space, the studies revealed no temporal features corresponding to the two first categories. This may be mainly due to the quasistationary nature of these sounds. Nonetheless, a temporal parameter associated with a spectral one appeared to be fairly efficient in automatically discriminating impulsive environmental sounds (car door closing) from nonimpulsive ones.
However, according to Özcan and van Egmond , other major sound categories, such as liquid or cyclic sounds, exist and need a definition as well, and their main perceptual features must be investigated. Furthermore, they focused their study on domestic "product sounds,'' while we were more interested in industrial sounds. Considering environmental sounds in a more general sense may again reveal other categories that would also need to be taken into consideration when building an overall environmental sound description structure, in terms of either definition or automatic description.
From an application point of view, the relevant acoustic features obtained for the three categories of sounds will allow us to conceive of perceptually relevant organization structures of large environmental sound collections and to propose retrieval systems using an intuitive query process by searching for sounds that are similar to a target sound in that kind of database. The search will be based on similarity metrics computed from the acoustic features, and stored with the sounds in the database as proposed by previous studies for musical sounds (Blum et al. , Misdariis et al. , Qi et al. ). From a larger perspective, these results should also contribute to the elaboration of a functional Computer-Aided Sound Design framework as they will help users to describe, associate, compare, share, and finally manipulate sounds that can be considered as prototypes or initial ideas of concepts that the designer has in mind and tries to materialize in the framework of a specific project.
A. Multidimensional Scaling (MDS) Analysis' Principles
A.1. MDS Models
In this model, the space is rotationally invariant, which means that rotating its axes will not intrinsically change the space structure as long as they remain orthogonal.
In (A.3), is the distance between sound and sound , is the index of the latent classes, is the coordinate of sound along the th dimension, is the weight of dimension for class , is the total number of dimensions, is the weight of the specificities for class , and is the specificity of sound .
The class structure is latent: there is no a priori assumption concerning the latent class to which a given subject belongs. The CLASCAL analysis yields a spatial representation of the stimuli on the dimensions, with the specificity of each stimulus, the probability that each subject belongs to each latent class, and the weights or saliences of each salient perceptual dimension for each class.
Moreover, in the INDSCAL and CLASCAL models, the presence of dimension weights that differ between subjects or classes of subjects removes the rotational invariance of the obtained spaces, because the dimensions are fixed by the presence of those weights. As a consequence, it is assumed in both models that the dimensions of the space are perceptually meaningful.
B. Complementary Data and Initial Results Related to the Four Primary Studies
RAPmv-A: A-weighted harmonic-to-noise ratio. Both harmonic and noise parts were separated using additive analysis/synthesis (see Rodet , for more detail on the separation technique). The feature is the ratio of their levels expressed in dB(A).
Dec: Harmonic spectral decrease. This feature is related to the shape of the spectral envelope computed from the harmonic components of the signal. In the present case, this feature is computed on the bandwidth of the spectrum, but represents the relative decrease in the envelope of the harmonic spectrum only between 266?Hz and 2300?Hz.
rad - 2N/0.5N: 2N and 0.5N harmonic ratios, where N is deduced from the RPM value of engine rotation.
CGg-C: Spectral centroid, with linear frequency using C weighting.
NHR-A: Feature corresponding to the relative balance of the harmonic (motor) and noise (air turbulence) components. The best correlation is obtained with the A-weighted version of this parameter.
-B: B-weighted spectral centroid of the noise component. For this dimension, the emergence of a spectral pitch led us to consider the spectral centroid (SC). More precisely, we compute the SC of each of the two parts of the sound: the noise component ( ) and the harmonic component ( ). The best correlation with Dimension 2 is obtained for using B-weighting.
N: Loudness. Indeed, even though the selected sounds are in the same range of loudness, they were not equalized in loudness.
Roughness: feature modeled by the amplitude modulation rate of the temporal envelope (expressed in asper) and related to the sensation of auditory roughness.
Spectral centroid: feature describing the spectral distribution of the energy of the sound, computed from a frequency decomposition on the ERB scale (Marozeau et al. ). It has been identified as corresponding to the sensation of "brightness.''
Spectral deviation: feature related to the fine structure of the spectral envelope. It is computed based on the smoothness of the outputs of the filter-bank (Marozeau et al. ).
Spectral centroid: feature describing the spectral distribution of the energy of the sound.
Sharpness: feature defined by Aures , similar to spectral centroid with perceptual modeling.
Cleanness indicator: indicator that is derived from the temporal loudness calculation according to Zwicker's model . The algorithm takes into account temporal integration and temporal masking. The proposed indicator is based on the temporal evolution of the curve.
C. Illustration of the Experimental Graphical User Interfaces Used in Experiments 1 and 2
D. Details of Acoustic Features Calculation
D.1. RMS Value
The estimation of the RMS (Root-Mean-Square) value of the signal is frame-based and is calculated every 60 ms with a Blackman window. The feature is the mean value over time.
Loudness is the intensive attribute of human hearing. It thus describes the subjective aspect of the intensity of a signal by considering masking effects that occur over the whole spectrum and the filtering steps of the hearing path. The loudness model used is the ISO 532-B model from Zwicker and Fastl .
D.3. Harmonic Emergence Feature
D.4. Spectral Centroid
D.5. Spectral Spread
D.6. Complex Brightness
D.8. Cleanness Indicator
where FFT256 is the 256-point Fast Fourier Transform.
Some of these results come from the SampleOrchestrator project funded by the French Agence Nationale de la Recherche (ANR): http://www.ircam.fr/306.html?&tx_ircamprojects_pi1%5BshowUid%5D=36&tx_ircamprojects_pi1%5BpType%5D=p&cHash=9859699b3d.
- Vanderveer NJ: Ecological acoustics: human perception of environmental sounds, unpublished doctoral dissertation. Cornell University; 1979.Google Scholar
- Susini P, McAdams S, Winsberg S: Caractérisation perceptive des bruits de véhicules. Proceedings of 4ème Congrès français d'acoustique, 1997, Marseille, France
- Susini P, McAdams S, Winsberg S: Perceptual characterisation of vehicules noises. EEA Symposium: Psychoacoustic in Industry and Universities, 1997, Eindhoven, The Netherlands
- Susini P, McAdams S, Winsberg S: A multidimensional technique for sound quality assessment. Acustica 1999, 85(5):650-656.Google Scholar
- McAdams S, Susini P, Misdariis N, Winsberg S: Multidimensional characterisation of perceptual and preference judgements of vehicle and environmental noises. Proceedings of Euronoise Conference, 1998, Munich, Germany
- Susini P, McAdams S, Winsberg S, Perry I, Vieillard S, Rodet X: Characterizing the sound quality of air-conditioning noise. Applied Acoustics 2004, 65(8):763-790. 10.1016/j.apacoust.2004.02.003View ArticleGoogle Scholar
- Lemaitre G, Susini P, Winsberg S, McAdams S, Letinturier B: The sound quality of car horns: a psychoacoustical study of timbre. Acta Acustica United with Acustica 2007, 93(3):457-468.Google Scholar
- Lemaitre G, Susini P, Winsberg S, McAdams S, Letinturier B: The sound quality of car horns: designing new representative sounds. Acta Acustica United with Acustica 2009, 95(2):356-372. 10.3813/AAA.918158View ArticleGoogle Scholar
- Parizet E, Guyader E, Nosulenko V: Analysis of car door closing sound quality. Applied Acoustics 2008, 69(1):12-22. 10.1016/j.apacoust.2006.09.004View ArticleGoogle Scholar
- Grey JM: Multidimensional perceptual scaling of musical timbres. The Journal of the Acoustical Society of America 1977, 61(5):1270-1277. 10.1121/1.381428View ArticleGoogle Scholar
- Krumhansl CL: Why is musical timbre so hard to understand? In Structure and Perception of Electroacoustic Sound and Music. Edited by: Nielzen S, Olsson O. Elsevier, Amsterdam, The Netherlands; 1989:43-53. (Excerpta Medica 846)Google Scholar
- McAdams S, Winsberg S, Donnadieu S, de Soete G, Krimphoff J: Perceptual scaling of synthesized musical timbres: common dimensions, specificities, and latent subject classes. Psychological Research 1995, 58(3):177-192. 10.1007/BF00419633View ArticleGoogle Scholar
- Marozeau J, de Cheveigne A, McAdams S, Winsberg S: The dependency of timbre on fundamental frequency. The Journal of the Acoustical Society of America 2003, 114(5):2946-2957. 10.1121/1.1618239View ArticleGoogle Scholar
- McAdams S: Recognition of auditory sound sources and events. In Thinking in Sound: The Cognitive Psychology of Human Audition. Edited by: McAdams S, Bigand E. Oxford University Press, Oxford, UK; 1993.View ArticleGoogle Scholar
- Peeters G: A large set of audio features for sound description (similarity and classification). CUIDADO project Ircam technical report 2004. http://www.ircam.fr/anasyn/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdfGoogle Scholar
- Slaney M: Auditory Toolbox, version 2. Tech. Rep. 1998-010, Interval Research Corporation; 1998.http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/Google Scholar
- Aures W: Der sensorische Wohlklang als Funktion psychoakustischer Empfindungsgrößssen. Acustica 1985, 58(5):282-290.Google Scholar
- Zwicker E: Procedure for calculating loudnesss of temporally variable sounds. The Journal of the Acoustical Society of America 1977, 62(3):675-682. 10.1121/1.381580View ArticleGoogle Scholar
- McAdams S: Recognition of auditory sources and events. In Thinking in Sound: The Cognitive Psychology of Human Audition. Edited by: McAdams S, Bigand E. Oxford University Press, Oxford, UK; 1993:146-198.View ArticleGoogle Scholar
- Rosch E: Principles of categorization. In Cognition and Categorization. Lawrence Erlbaum Associates; 1978:27-48.
- Kruskal JB, Wish M: Multidimensional Scaling, Sage University Paper Series on Quantitative Applications in the Social Sciences 07-011. Sage Publications, Beverly Hills, Calif, USA; 1978.Google Scholar
- Legendre P, Legendre L: Numerical Ecology. Development in Environmental Modelling. 2nd edition. Elsevier, Amsterdam, The Netherlands; 1998.Google Scholar
- Özcan E, van Egmond R: Memory for product sounds: the effect of sound and label type. Acta Psychologica 2007, 126(3):196-215. 10.1016/j.actpsy.2006.11.008View ArticleGoogle Scholar
- Zwicker E, Fastl H: Psychoacoustics: Facts and Models. Springer, New York, NY, USA; 1990.Google Scholar
- Smith BK: PsiExp: an environment for psychoacoustic experimentation using the IRCAM musical workstation. In Society for Music Perception and Cognition Conference. University of Berkeley, Berkeley, Calif, USA; 1995.Google Scholar
- Ellis DPW: Sinewave and Sinusoid+Noise Analysis/Synthesis in Matlab. 2003, http://labrosa.ee.columbia.edu/matlab/sinemodel/
- Woodcock S: MATLAB econometrics toolbox. 2003, http://www.sfu.ca/~swoodcoc/software/software.html
- Blum T, Keislar D, Wheaton J, Wold E: Audio database with content-based retrieval. Annual Review of Physiology 1995, 61: 457-476.Google Scholar
- Misdariis N, Smith BK, Pressnitzer D, Susini P, McAdams S: Validation of a multidimensional distance model for perceptual dissimilarities among musical timbres. The Journal of the Acoustical Society of America 1998, 103(5):3005-3006.View ArticleGoogle Scholar
- Qi H, Hartono P, Suzuki K, Hashimoto S: Sound database retrieved by sound. Acoustical Science and Technology 2002, 23(6):293-300. 10.1250/ast.23.293View ArticleGoogle Scholar
- Carroll JD, Chang J-J: Analysis of individual differences in multidimensional scaling via an n-way generalization of "Eckart-Young" decomposition. Psychometrika 1970, 35(3):283-319. 10.1007/BF02310791MATHView ArticleGoogle Scholar
- Winsberg S, De Soete G: A latent class approach to fitting the weighted Euclidean model, clascal. Psychometrika 1993, 58(2):315-330. 10.1007/BF02294578View ArticleGoogle Scholar
- Rodet X: The additive analysis-synthesis package. Ircam tutorial, http://recherche.ircam.fr/equipes/analyse-synthese/DOCUMENTATIONS/additive/index-e.html
- Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M: Complex sounds and auditory images. In Auditory Physiology and Perception. , Oxford, UK; 1995:429-446.Google Scholar
- Slaney M: An efficient implementation of the Patterson-Holdsworth auditory filter bank. Tech. Rep. 35, Apple Computer; 1993.Google Scholar
- Bogaards N, Roebel A, Rodet X: Sound analysis and processing with AudioSculpt 2. Proceedings of International Computer Music Conference (ICMC '04), 2004, Miami, Fla, USA
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.