- Research Article
- Open Access
An Ontological Framework for Retrieving Environmental Sounds Using Semantics and Acoustic Content
© Gordon Wichern et al. 2010
- Received: 1 March 2010
- Accepted: 19 October 2010
- Published: 21 October 2010
Organizing a database of user-contributed environmental sound recordings allows sound files to be linked not only by the semantic tags and labels applied to them, but also to other sounds with similar acoustic characteristics. Of paramount importance in navigating these databases are the problems of retrieving similar sounds using text- or sound-based queries, and automatically annotating unlabeled sounds. We propose an integrated system, which can be used for text-based retrieval of unlabeled audio, content-based query-by-example, and automatic annotation of unlabeled sound files. To this end, we introduce an ontological framework where sounds are connected to each other based on the similarity between acoustic features specifically adapted to environmental sounds, while semantic tags and sounds are connected through link weights that are optimized based on user-provided tags. Furthermore, tags are linked to each other through a measure of semantic similarity, which allows for efficient incorporation of out-of-vocabulary tags, that is, tags that do not yet exist in the database. Results on two freely available databases of environmental sounds contributed and labeled by nonexpert users demonstrate effective recall, precision, and average precision scores for both the text-based retrieval and annotation tasks.
- Semantic Similarity
- Retrieval Performance
- Semantic Concept
- Mean Average Precision
- Link Weight
With the advent of mobile computing, it is currently possible to record any sound event of interest using the microphone onboard a smartphone, and immediately upload the audio clip to a central server. Once uploaded, an online community can rate, describe, and reuse the recording appending social information to the acoustic content. This kind of user-contributed audio archive presents many advantages including open access, low cost entry points for aspiring contributors, and community filtering to remove inappropriate content. The challenge in using these archives is overcoming the "data deluge" that makes retrieving specific recordings from a large database difficult.
The content-based query-by-example (QBE) technique where users query with sound recordings they consider acoustically similar to those they hope to retrieve has achieved much success for both music  and environmental sounds . Additionally, content-based QBE is inherently unsupervised as no labels are required to rank sounds in terms of their similarity to the query (although relevancy labels are required for formal evaluation). Unfortunately, even if suitable recordings are available they might still be insufficient to retrieve certain environmental sounds. For example, suppose a user wants to retrieve all of the "water" sounds from a given database. As sounds related to water are extremely diverse in terms of acoustic content (e.g., rain drops, a flushing toilet, the call of a waterfowl, etc.), QBE is inefficient when compared to the simple text-based query "water." Moreover, it is often the case that users do not have example recordings on hand, and in these cases text-based semantic queries are often more appropriate.
Assuming the sound files in the archive do not have textual metadata, a text-based retrieval system must relate sound files to text descriptions. Techniques that connect acoustic content to semantic concepts present an additional challenge, in that learning the parameters of the retrieval system becomes a supervised learning problem as each training set sound file must have semantic labels for parameter learning. Collecting these labels has become its own research problem leading to the development of social games for collecting the metadata that describes music [3, 4].
Most previous systems for retrieving sound files using text queries, use a supervised multicategory learning approach where a classifier is trained for each semantic concept in the vocabulary. For example, in  semantic words are connected to audio features through hierarchical clusters. Automatic record reviews of music are obtained in  by using acoustic content to train a one versus all discriminative classifier for each semantic concept in the vocabulary. An alternative generative approach that was successfully applied to the annotation and retrieval of music and sound effects  consists of learning a Gaussian mixture model (GMM) for each concept. In  support vector machine (SVM) classifiers are trained for semantic and onomatopoeia labels when each sound file is represented as a mixture of hidden acoustic topics. A large-scale comparison of discriminative and generative classification approaches for text-based retrieval of general audio on the Internet was presented in .
One drawback of the multiclass learning approach is its inability to handle semantic concepts that are not included in the training set without an additional training phase. By not explicitly leveraging the semantic similarity between concepts, the classifiers might miss important connections. For example, if the words "purr" and "meow" are never used as labels for the same sound, the retrieval system cannot model the information that these sounds may have been emitted from the same physical source (a cat), even though they are widely separated in the acoustic feature space. Furthermore, if none of these sounds contain the tag "kitty" a user who queries with this out of vocabulary tag might not receive any results, even though several cat/kitty sounds exist in the database.
In an attempt to overcome these drawbacks we use a taxonomic approach similar to that of [10, 11] where unlabeled sounds are annotated with the semantic concepts belonging to their nearest neighbor in an acoustic feature space, and WordNet [12, 13] is used to extend the semantics. We aim to enhance this approach by introducing an ontological framework where sounds are linked to each other through a measure of acoustic content similarity, semantic concepts (tags) are linked to each other through a similarity metric based on the WordNet ontology, and sounds are linked to tags based on descriptions from a user community.
We refer to this collection of linked concepts and sounds as a hybrid (content/semantic) network [14, 15] that possesses the ability to handle two query modalities. When queries are sound files the system can be used for automatic annotation or "autotagging", which describes a sound file based on its audio content and provides suggested tags for use as traditional metadata. When queries are concepts they can be used for text-based retrieval where a ranked list of unlabeled sounds that are most relevant to the query concept is returned. Moreover, queries or new sounds/concepts can be efficiently connected to the network, as long as they can be linked either perceptually if sound based, or lexically if word based.
In describing our approach, we begin with a formal definition of the related problems of automatic annotation and text-based retrieval of unlabeled audio, followed by the introduction of our ontological framework solution in Section 2. The proposed hybrid network architecture outputs a distribution over sounds given a concept query (text-based retrieval) or a distribution over concepts given a sound query (annotation). The output distribution is determined from the shortest path distance between the query and all output nodes (either sounds or concepts) of interest. The main challenge of the hybrid network architecture is computing the link weights. Section 3 describes an approach to determine the link weights connecting sounds to other sounds based on a measure of acoustic content similarity, while Section 4 details how link weights between semantic concepts are calculated using a WordNet similarity metric. It is these link weights and similarity metrics that allow queries or new sounds/concepts to be efficiently connected to the network. The third type of link weight in our network are those connecting sounds to concepts. These weights are learned by attempting to match the output of the hybrid network to semantic descriptions provided by a user community as outlined in Section 5.
We evaluate the performance of the hybrid network on a variety of information retrieval tasks for two environmental sound databases. The first database contains environmental sounds without postprocessing, where all sounds were independently described multiple times by a nonexpert user community. This allows for greater resolution in associating concepts to sounds as opposed to binary (yes/no) associations. This type of community information is what we hope to represent in the hybrid network, but collecting this data remains an arduous process and limits the size of the database.
In order to test our approach on a larger dataset, the second database consists of several thousand sound files from the Freesound project . While this dataset is larger in terms of the numbers of sound files and semantic tags it is not as rich in terms of semantic information as tags are applied to sounds in a binary fashion by the user community. Given the noisy nature (recording/encoding quality, various levels of post production, inconsistent text labeling) of user-contributed environmental sounds, the results presented in Section 6 demonstrate that the hybrid network approach provides accurate retrieval performance. We also test performance using semantic tags that are not previously included in the network, that is, out-of-vocabulary tags are used as queries in text-based retrieval and as the automatic descriptions provided during annotation. Finally, conclusions and discussions of possible topics of future work are provided in Section 7.
That is, the score function should be highest for sounds relevant to the query.
Formally, we define the hybrid network as a graph consisting of a set of nodes or vertices (ovals and rectangles in Figure 1) denoted by . Two nodes can be connected by an undirected link with an associated nonnegative weight (also known as length or cost), which we denote by . The smaller the value of the stronger the connection between nodes and . In Figures 1(a)–1(c) the presence of an edge connecting node to node indicates a value of , although the exact values for are not indicated, while the dashed edges connecting the query node to the rest of the network are added at query time. If the text or sound file query is already in the database, then the query node will be connected through the node representing it in the network by a single link of weight zero (meaning equivalence).
where is the index among possible paths between nodes and . Given starting node , we can efficiently compute (9) for all using Dijkstra's algorithm , although for QBE (Figure 1(a)) the shortest path distance is simply the acoustic content similarity between the sound query and the template used to represent each database sound. We now describe how the link weights connecting sounds and words are determined.
As shown in Figures 1(a)–1(c), each sound file in the database is represented as a template, and the construction of these templates will be detailed in this section. Methods for ranking sound files based on the similarity of their acoustic content typically begin with the problem of acoustic feature extraction. We use the six-dimensional feature set described in , where features are computed from either the windowed time series data, or the short-time Fourier Transform (STFT) magnitude spectrum of 40 ms Hamming windowed frames hopped every 20 ms (i.e., 50% overlapping frames). This feature set consists of RMS level, Bark-weighted spectral centroid, spectral sparsity (the ratio of and norms calculated over the short-time Fourier Transform (STFT) magnitude spectrum), transient index (the norm of the difference of Mel frequency cepstral coefficients (MFCC's) between consecutive frames), harmonicity (a probabilistic measure of whether or not the STFT spectrum for a given frame exhibits a harmonic frequency structure), and temporal sparsity (the ratio of and norms calculated over all short-term RMS levels computed in a one second interval).
In addition to its relatively low dimensionality, this feature set is tailored to environmental sounds while not being specifically adapted to a particular class of sounds (e.g., speech). Furthermore, we have found that these features possess intuitive meaning when searching for environmental sounds, for example, crumbling newspaper should have a high transient index and birdcalls should have high harmonicity. This intuition is not present with other feature sets, for example, it is not intuitively clear how the fourth MFCC coefficient can be used to index and retrieve environmental sounds.
where is the univariate Gaussian pdf with mean and standard deviation evaluated at .
where and represent the length of the feature trajectories for sounds and , respectively. Although the semimetric in (12) does not satisfy the triangle inequality, its properties are (a) symmetry , (b) nonnegativity , and (c) distinguishability if and only if .
One technique, for determining concept-concept link weights is to a assign a link of weight zero (meaning equivalence) to concepts with common stems, for example, run/running and laugh/laughter, while other concepts are not linked. To calculate a wider variety of concept-to-concept link weights, we use a similarity metric from the WordNet::Similarity library . A comparison of five similarity metrics from the WordNet::Similarity library in terms of audio information retrieval was studied in . In that work the Jiang and Conrath ( ) metric  performed best in terms of average precision, but had part of speech incompatibility problems that did not allow concept-to-concept comparisons for adverbs and adjectives. Therefore, in this work we use the vector metric because it supports the comparison of adjectives and adverbs, which are commonly used to describe sounds. The vector metric computes the cooccurrence of two concepts within the collections of words used to describe other concepts (their glosses) . For a full review of WordNet similarity, see [20, 22].
where is the joint probability between and , is the conditional probability of sound given concept , and is defined similarly. Our goal in determining the social link weights connecting sounds and concepts is that the hybrid network should perform both the annotation and text-based retrieval tasks in a manner consistent with the social information provided from the votes matrix. That is, the probability distribution output by the ontological framework using (7) with should be as close as possible to from (15) and the probability distribution output using (8) with should be as close as possible to from (16). The difference between probability distributions can be computed using the Kullback-Leibler (KL) divergence.
Empirically, we have found that setting the initial weight values to , leads to quick convergence. Furthermore, if resources are not available to use the KL weight learning technique, setting the sound-concept link weights to provides a simple and effective approximation of the optimized weight.
Presently, the votes matrix is obtained using only a simple tagging process. In the future we hope to augment the votes matrix with other types of community activity, such as discussions, rankings, or page navigation paths on a website. Furthermore, sound-to-concept link weights can be set as design parameters rather than learned from a "training set" of tags provided by users. For example, expert users can make sounds equivalent to certain concepts through the addition of zero-weight connections between specified sounds and concepts, thus, improving query results for nonexpert users.
In this section, the performance of the hybrid network on the annotation and text-based retrieval tasks will be evaluated. (QBE results were considered in our previous work  and are not presented here).
6.1. Experimental Setup
Two datasets are used in the evaluation process. The first dataset, which we will refer to as the Soundwalks data set contains 178 sound files uploaded by the authors to the Soundwalks.org website. The 178 sound files were recorded during seven separate field recording sessions, lasting anywhere from 10 to 30 minutes each and sampled at 44.1 KHz. Each session was recorded continuously and then hand-segmented by the authors into segments lasting between 2–60 s. The recordings took place at three light rail stops (75 segments), outside a stadium during a football game (60 segments), at a skatepark (16 segments), and at a college campus (27 segments). To obtain tags, study participants were directed to a website containing ten random sounds from the set and were asked to provide one or more single-word descriptive tags for each sound. With 90 responses, each sound was tagged an average of 4.62 times. We have used 88 of the most popular tags as our vocabulary.
Because the Soundwalks dataset contains 90 subject responses, a nonbinary votes matrix can be used to determine the sound-concept link weights described in Section 5. Obtaining this votes matrix requires large amounts of subject time, thus, limiting its size. To test the hybrid network performance on a larger dataset, we use 2064 sound files and a 377 tag vocabulary from Freesound.org. In the Freesound dataset tags are applied in a binary (yes/no) manner to each sound file by users of the website. The sound files were randomly selected from among all files (whether encoded in a lossless or lossy format) on the site containing any of the 50 most used tags and between 3–60 seconds in length. Additionally, each sound file contained between three and eight tags, and each of the 377 tags in the vocabulary were applied to at least five sound files.
Database partitioning procedure for each cross validation run.
Number of sound files
In network (training)
Out of network (testing)
Number of tags
Out of vocabulary
In annotation each sound in the testing set is used as a query to provide an output distribution over semantic concepts. For a given sound query we denote by the set of tags, and the number of relevant tags for that query. Assuming tags in a database are ranked in order of decreasing probability for a given query, by truncating the list to the top tags, and counting the number of relevant tags, denoted by , we define and . Average precision is found by incrementing and averaging the precision at all points in the ranked list where a relevant sound is located. Additionally, the area under the receiver operating characteristics curve (AROC) is found by integrating the ROC curve, which plots the true positive versus false positive rate for the ranked list of output tags.
Annotation performance using out-of-vocabulary semantic concepts.
In vocabulary (upper bound)
Out of vocabulary (WordNet)
Out of vocabulary (Baseline)
6.3. Text-Based Retrieval
Text-based retrieval performance using out-of-vocabulary semantic concepts.
In vocabulary (upper bound)
Out of vocabulary (WordNet)
Out of vocabulary (Baseline)
Top four results from Soundwalks data set for text-based retrieval with out of vocabulary query "rail". Parenthetical descriptions are not actual tags, but provided to give an idea of the acoustic content of the sound files.
rail train segment94.wav (train bell) segment165.wav (traffic/train horn)
rail voice segment136.wav (pa announcement) segment133.wav (pa announcement)
rail train segment40.wav (train brakes) segment30.wav (train bell/brakes)
rail train segmen40.wav (train brakes) segment147.wav (train horn)
6.4. In-Vocabulary Semantic Information
Performance of retrieval tasks with the Soundwalks dataset using WordNet connections between in-vocabulary semantic concepts.
Currently, a significant portion of freely available environmental sound recordings are user contributed and inherently noisy in terms of audio content and semantic descriptions. To aid in the navigation of these audio databases we show the utility of a system that can be used for text-based retrieval of unlabeled audio, content-based query-by-example, and automatic audio annotation. Specifically, an ontological framework connects sounds to each other based on a measure of perceptual similarity, tags are linked based on a measure of semantic similarity, while tags and sounds are connected by optimizing link weights given user preference data. An advantage of this approach is the ability of the system to flexibly extend when new sounds and/or tags are added to the database. Specifically, unlabeled sound files can be queried or annotated with out-of-vocabulary concepts, that is, tags that do not currently exist in the database.
One possible improvement to the hybrid network structure connecting semantics and sound might be achieved by exploring different link weight learning techniques. Currently, we use a "divide and conquer" approach where the three types of weights (sound-sound, concept-concept, sound-concept) are learned independently. This could lead to scaling issues, especially if the network is expanded to contain different node types. One possible approach to overcome these scaling issues could be to learn a dissimilarity function from ranking data . For example, using the sound similarity, user preference, and WordNet similarity data to find only rankings between words and sounds of the form "A is more like B than C is D", we can learn a single dissimilarity function for the entire network that preserves this rank information.
Another enhancement would be to augment the hybrid network with a recursive clustering scheme such as those described in . We have successfully tested this approach in , where each cluster becomes a node in the hybrid network, and all sounds assigned to each cluster are connected to the appropriate cluster node by a link of weight zero. These cluster nodes are then linked to the nodes representing semantic tags. While this approach limits the number of sound-tag weights that need to be learned, the additional cluster nodes and links tend to cancel out this savings. Furthermore, when a new sound is added to the network we still must compute its similarity to all sounds previously in the network (this is also true for new tags). For sounds, it might be possible to represent each sound file and sound cluster as a Gaussian distribution, and then use symmetric Kullback-Leibler divergence to calculate the link weights connecting new sounds added to the network to preexisting clusters. Unfortunately, this approach would not extend to the concept nodes in the hybrid network as we currently know of no technique for representing a semantic tag as a Gaussian, even though the WordNet similarity metric could be used to cluster the tags. Perhaps a technique where a fixed number of sound/tag nodes are sampled to have link weights computed each time a new sound/tag is added to the network could help make the ontological framework more computationally efficient. A link weight pruning approach might also help improve computational complexity.
Finally, using a domain-specific ontology such as the MX music ontology  might be better suited to audio information retrieval than a purely lexical database such as WordNet. For environmental sounds, the theory of soundscapes [28, 29] might be a convenient first step, as the retrieval system could be specially adapted to the different elements of a soundscape. For example, sounds such as traffic and rain could be connected to a keynote sublayer in the hybrid network, while sounds such as alarms and bells could be connected to the sound signal sublayer. Once the subjective classification of sound files into the different soundscape elements are obtained adding this sublayer into the present ontological framework could be an important enhancement to the current system.
This material is based upon work supported by the National Science Foundation under Grants NSF IGERT DGE-05-04647 and NSF CISE Research Infrastructure 04-03428. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
- Casey MA, Veltkamp R, Goto M, Leman M, Rhodes C, Slaney M: Content-based music information retrieval: current directions and future challenges. Proceedings of the IEEE 2008, 96(4):668-696.View ArticleGoogle Scholar
- Wichern G, Xue J, Thornburg H, Mechtley B, Spanias A: Segmentation, indexing, and retrieval for environmental and natural sounds. IEEE Transactions on Audio, Speech and Language Processing 2010, 18(3):688-707.View ArticleGoogle Scholar
- Turnbull D, Liu R, Barrington L, Lanckriet G: A game-based approach for collecting semantic annotations of music. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '07), 2007, Vienna, AustriaGoogle Scholar
- Mandel MI, Ellis DPW: A Web-based game for collecting music metadata. Journal of New Music Research 2008, 37(2):151-165. 10.1080/09298210802479300View ArticleGoogle Scholar
- Slaney M: Semantic-audio retrieval. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '02), 2002, Orlando, Fla, USA 4: 4108-4111.Google Scholar
- Whitman B, Ellis D: Automatic record reviews. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '04), 2004 470-477.Google Scholar
- Turnbull D, Barrington L, Torres D, Lanckriet G: Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech and Language Processing 2008, 16(2):467-476.View ArticleGoogle Scholar
- Kim S, Narayanan S, Sundaram S: Acoustic topic model for audio information retrieval. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009, New Paltz, NY, USA 37-40.Google Scholar
- Chechik G, Ie E, Rehn M, Bengio S, Lyon D: Large-scale content-based audio retrieval from text queries. Proceedings of the 1st International ACM Conference on Multimedia Information Retrieval (MM '08), August 2008, Vancouver,Canada 105-112.View ArticleGoogle Scholar
- Cano P, Koppenberger M, Le Groux S, Ricard J, Herrera P, Wack N: Nearest-neighbor generic sound classification with a WordNet-based taxonomy. Proceedings of the 116th AES Convention, 2004, Berlin, GermanyGoogle Scholar
- Martinez E, Celma O, Sordo M, de Jong B, Serra X: Extending the folksonomies of freesound.org using content- based audio analysis. Proceedings of the Sound and Music Computing Conference, 2009, Porto, PortugalGoogle Scholar
- WordNet, http://wordnet.princeton.edu/
- Fellbaum C: WordNet: An Electronic Lexical Database. Edited by: Fellbaum C. MIT Press, Cambridge, Mass, USA; 1998.MATHGoogle Scholar
- Wichern G, Thornburg H, Spanias A: Unifying semantic and content-based approaches for retrieval of environmental sounds. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '09), 2009, New Paltz, NY, USA 13-16.Google Scholar
- Mechtley B, Wichern G, Thornburg H, Spanias AS: Combining semantic, social, and acoustic similarity for retrieval of environmental sounds. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '10), 2010Google Scholar
- Freesound, http://www.freesound.org/
- Rijsbergen CJV: Information Retrieval. Butterwoths, London, UK; 1979.Google Scholar
- Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. 2nd edition. MIT Press and McGraw-Hill, Cambridge, UK; 2001.MATHGoogle Scholar
- Huang BH, Rabiner LR: A probabilistic distance measure for hidden Markov models. AT&T Technical Journal 1985, 64(2):1251-1270.MathSciNetGoogle Scholar
- Pederson T, Patwardhan S, Michelizzi J: Wordnet:similarity—measuring the relatedness of concepts. In Proceedings of the 16th Innovative Applications of Artificial Intelligence Conference (IAAI '04), 2004, Cambridge, MA, USA. AAAI Press; 1024-1025.Google Scholar
- Jiang J, Conrath D: Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the International Conference on Research in Computational Linguistics (ROCLING X '97), 1997, Taiwan 19-33.Google Scholar
- Budanitsky A, Hirst G: Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. Proceedings of the Workshop on WordNet and Other Lexical Resources, 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, 2001, Pittburgh, Pa, USAGoogle Scholar
- Mandala R, Tokunaga T, Tanaka H: The use of wordnet in information retrieval. Proceedings of the Workshop on Usage of WordNet in Natural Language Processing Systems, 1998, Montreal, Canada 31-37.Google Scholar
- Cilibrasi RL, Vitányi PMB: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 2007, 19(3):370-383.View ArticleGoogle Scholar
- Ouyang H, Gray A: Learning dissimilarities by ranking: from SDP to QP. Proceedings of the 25th International Conference on Machine Learning (ICML '08), July 2008, Helsinki, Finland 728-735.View ArticleGoogle Scholar
- Xue J, Wichern G, Thornburg H, Spanias A: Fast query by example of environmental sounds via robust and efficient cluster-based indexing. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), April 2008, Las Vegas, Nev, USA 5-8.Google Scholar
- Ferrara A, Ludovico LA, Montanelli S, Castano S, Haus G: A semantic web ontology for context-based classification and retrieval of music resources. ACM Transactions on Multimedia Computing, Communications and Applications 2006, 2(3):177-198. 10.1145/1152149.1152151View ArticleGoogle Scholar
- Schafer R: The Soundscape: Our Sonic Environment and the Tuning of the World. Destiny Books, Rochester, Vt, USA; 1994.Google Scholar
- Truax B: Acoustic Communication. Ablex Publishing, Norwood, NJ, USA; 1984.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.