Semantic Labeling of Nonspeech Audio Clips
© Xiaojuan Ma et al. 2010
Received: 30 June 2009
Accepted: 7 January 2010
Published: 1 March 2010
Human communication about entities and events is primarily linguistic in nature. While visual representations of information are shown to be highly effective as well, relatively little is known about the communicative power of auditory nonlinguistic representations. We created a collection of short nonlinguistic auditory clips encoding familiar human activities, objects, animals, natural phenomena, machinery, and social scenes. We presented these sounds to a broad spectrum of anonymous human workers using Amazon Mechanical Turk and collected verbal sound labels. We analyzed the human labels in terms of their lexical and semantic properties to ascertain that the audio clips do evoke the information suggested by their pre-defined captions. We then measured the agreement with the semantically compatible labels for each sound clip. Finally, we examined which kinds of entities and events, when captured by nonlinguistic acoustic clips, appear to be well-suited to elicit information for communication, and which ones are less discriminable. Our work is set against the broader goal of creating resources that facilitate communication for people with some types of language loss. Furthermore, our data should prove useful for future research in machine analysis/synthesis of audio, such as computational auditory scene analysis, and annotating/querying large collections of sound effects.
Natural language is a highly complex yet efficient means of communication with great expressive power, and it is the primary mode of human communication and information exchange . However, for people with language disabilities, speakers of minority languages in a setting where another language dominates, and learners of foreign languages, the linguistic channel of communication may be less effective. To compensate, nonverbal representations of concepts that people communicate about have been explored and evaluated as a means to support linguistic representations. These include animations and videos  and especially still pictures . However, very little research has been done on the use of nonspeech audio to convey and express concepts in Augmentative and Alternative Communication (AAC).
Some research indicates that nonspeech audio perception may be impaired together with speech perception for people with specific pathological profiles because the process may share certain channel and brain regions . But there is evidence that in many other cases, people who suffer language impairments (as after a stroke) still retain the ability to recognize environmental sounds [5, 6]. This work suggests that for both language-impaired populations and for healthy speakers whose comprehension is compromised for other reasons, nonspeech audio (environmental sounds) have the potential of conveying concepts and assisting language comprehension.
Compared to static images, audio perception may require a greater processing workload , as sound clips have temporal extension. However, the fact that additional time is required to finish listening to a sound clip is similar to that needed to finish viewing an animation or video. In fact, research  has shown that in some cases, sound can actually enhance vision perception, suggesting that adding nonspeech audio material may promote people's comprehension of visual languages. Moreover, some concepts, such as "thunder," are inherently auditory in nature and can be better described by a sound than by a picture.
Previous research (i.e., ) examined how environmental sounds are perceived in the human brain. However, not much work has been devoted to the question as to which semantic concepts are associated with nonspeech audio and how, as most of the auditory scene analysis and classification research focused on using automatic machine learning algorithms (e.g., [10, 11]). This question has motivated the work reported here, which concerns people's evocation of concepts with specific sounds. The majority of currently available nonspeech audio databases, such as the BBC Sound Effects Library that we used in our experiment, include sound labels provided by the recording engineers; the labels are therefore not based on discrimination and identification. The Freesound Project  asked volunteers to label submitted sounds but relatively few high quality labels were contributed. Marcell et al.  ran several studies gathering human labels and classification for 120 sounds produced by animals, people, musical instruments, tools, signals, and liquids. While it dovetails well with our own research, this work covered a much smaller sample of common concepts and focused on naming sound sources instead of semantic descriptions. Other studies  concentrated on a collection of human gestural contact sounds (scraping, hammering, etc.), but primarily looked at the human ranked similarity and categorization of these sounds, rather than looking at linguistic descriptions of the sounds.
In this paper, we describe a pilot study and a large-scale experiment devoted to collect human semantic labels for over 300 nonspeech sounds, which are specifically designed to convey a set of 184 familiar concepts referring to entities and events that can be linguistically expressed by different parts of speech. We examined the effectiveness with which the sound clips evoke concepts and the extent to which the labels we collected agree with the a priori labels. Three different attempts to categorize sounds into semantically coherent classes in terms of their auditory expressiveness allowed some conclusions, but failed for most of the hypothesized categories. Possible reasons for the failure to identify and associate sounds with target concepts are discussed.
This section describes the procedure of constructing a semantic network of concepts enhanced by nonspeech audio, including (1) a core vocabulary and (2) "soundnails" associated with each concept.
2.1. Vocabulary Selection
The broader goal of our research is to use nonspeech audio to improve language comprehension and acquisition for people facing language disabilities or language barriers so as to facilitate daily communication, and for language learning and language rehabilitation. We designed a "core vocabulary" that includes words needed to discuss common topics in daily communication covering the major parts of speech. The initial vocabulary came from Lingraphica , a commercial communication device developed by the Lingraphicare Company, for people with language impairments. It contains 1376 words (after stemming and eliminating symbols). We compared this initial vocabulary with the collection of words generated from the BBC Sound Effects Library  captions, which has 1368 words after stemming and without nonlinguistic symbols. For the overlapping Vocabulary between Lingraphica and BBC library (and after the removal of function words like articles and prepositions), words were divided according to their parts of speech. For those which can be assigned to multiple parts of speech (like "walk" and "thunder"), only the more frequent sense (based on WordNet ) was kept. The final word inventory included 211 nouns, 68 verbs, 27 adjectives, and 16 adverbs (http://soundnet.cs.princeton.edu/OMLA/study/HearMe_Mturk/quality_control/ViewCount.php?fn=all_noun_211.txt (fn=all_verb_68.txt, all_adj_27.txt, and all_adv_16.txt for different parts of speech)).
2.2. Creating the Soundnails
We chose the BBC Sound Effects Library (Original Series)  as our principal source for the nonspeech audio representations because it provides a large enough vocabulary that can overlap with the core of our initial vocabulary for a variety of auditory events and scenes with high quality, cleanly recorded, sound clips. The BBC Sound Effects Library contains 40 CDs of industry standard high-quality sound clips recorded and labelled by BBC's top engineers. The sounds range from more general scenes like interior and exterior environments, household, and natural environments to more specific categories like cars, hospital, birds, weather, and so forth. All the sounds in the library are labelled in great detail, for example, "Gale Force Wind And Rain On Yacht (Recorded In Cabin)," and "Car, Rolls Royce Silver Sprite, Interior, Electrical Window, Open and Close."
Despite its size, this collection did not include auditory representations for all the words in our vocabulary. We looked into other resources as well, including the Freesound Project  and the FindSounds website . The Freesound Project is a collaborative database where volunteers submit and label sounds they record. The FindSounds website is a search engine for online audio. Compared to the BBC Sound Effects Library, audio clips from these two resources have bigger variance and are less reliable in both quality and labels.
2.2.1. Concept-Audio Association
Audioability four-point rating scale.
cannot make sound or be used to produce sound and cannot be represented by sound
can make sound or be used to produce sound, but cannot be represented by sound
can make sound or be used to produce sound, and may be able to be represented by sound, meaning the sound could be ambiguous
can make sound or be used to produce sound, and can be represented by sound, meaning the sound is distinctive
Of the 322 words, 184 of them received a rating of 2 or higher, which means that they are considered "audioable." Two additional SoundLab members joined the discussion and finalized the sound scripts for these 184 target words (one script for each word). Those scripts formed the guidelines for selecting and assigning sound clips based on their original labels. A target word could be assigned to more than one sound.
2.2.2. Soundnail Creation
The majority of the BBC sound effects are dozens of seconds long, and many even last several minutes; this is also the case for the clips obtained from Freesound and FindSounds. These long clips carry richer and more complex information than can be conveyed by a single concept. For the applications we have in mind, such scripts are not suitable. Another problem is the size of the clips. They are high-resolution stereo files, which makes them difficult to store and load. For practicality and quality control, we edited all files to uniform length and down-sampled all selected clips to 16 kHz, 16 bit mono, which is a sample rate at which people can still well recognize the sound scene. Our 16 kHz sample rate decision was based on the fact that many games (especially mobile/handheld) use 11.025 or 22.05 kHz sample rate, and the speech recognition community has historically used 16 kHz for recognizers. It was critical to keep file sizes small for web transmission to our test subjects (see below), and we could not guarantee that they would have the proper mpeg/other audio decompressor installed and working on their computers. Our committee concluded that 16 kHz, 16 bit audio was of acceptable quality, and this was verified in a pilot study . All sound clips were randomly chunked into five-second fragments, as it was desired to keep files short, of the same length to balance the experimental conditions, but long enough to still embed enough information.
Audio features used in soundnail creation.
Mean and standard deviations of RMS Energy
The average frequency that will vary for each signal.
The average frequency of the signal weighted by magnitude.
How much the frequency varies over time.
50% and 80% Spectral Rolloff
How much of the frequencies are concentrated below a given threshold (50% and 80%).
Mel-Frequency Cepstral Coefficients: amplitudes of spectra specified by a set of filters.
A total of 327 5-second soundnails were generated and assigned. All soundnails are of the same power, except in specific cases requiring lower or higher volume, such as sound from far away sources.
first, whether the soundnails actually convey the intended concepts;
second, if not, what concept people agree on instead;
third, for disagreements among the labellers, what causes the ambiguity and how can it help to select better auditory representations.
Labeling 327 sounds is an intensive task. Furthermore, to leverage individual differences and generalize meaningful semantic labels, a large number of human participants is required, which makes it impractical and expensive to carry out such a study in a controlled lab environment. Therefore, we decided on an alternative, conducting an online survey on the platform provided by Amazon Mechanical Turk (AMT) .
3.1. Tasks and Interface
What is the source of the sound? (What object(s)/living being(s) is/are involved?)
Where are you likely to hear the sound?
How is the sound made? (What action(s) is/are involved in creating the sound?)
3.2. Controlled Pilot Study with HCI Students
Although the main study was conducted online, we carried out a pilot study in advance to test and modify the study interface (e.g., autoplay of the sound and phrasing of the questions) and generate ground truth human labels for quality control of the online study (details see the next section).
Twenty-two Princeton undergraduate students from the Human Computer Interface (HCI) Technology class participated in the pilot study. Five to eight labels were produced for each soundnail, and the time to label each soundnail was automatically logged as well. A poststudy questionnaire was given to gather feedback on the design and interface of the experiment.
3.3. Online Sound Labelling Study on Amazon Mechanical Turk
Amazon Mechanical Turk (AMT) is a web platform operated by Amazon, where people can post web-based surveys in which people all over the world can take part, requiring only an Amazon account. AMT provides services including account management, task management, participant control, and participation payment transaction.
In our sound labelling study, soundnails were shuffled and randomly grouped into 32 assignments of 10 to 11 sounds each, noted as Human Intelligence Tasks (HITs) by AMT. The size of the HIT was based on the response time logged in the pilot study, which avoids an overly long or tiring task. We requested at least 100 people to label each HIT, and no one person could label the same HIT twice. On average, the completion time per HIT was 14.64 minutes. The completion of the experiment took 97 days. Individual completion time per sound was logged.
Examples of country and participant counts for AMT study.
United States (49 states)
3.3.2. Quality Control
Before subjects could proceed to the actual study, there was a login page with auditory captcha of a person reading a sequence of letters and numbers. Subjects were required to enter what had been said correctly in order to access the experiment page (Figure 2(c)). This step ensures that people can hear the sound properly and listen carefully, avoiding a situation where "robots" hack into the system.
At the beginning of each HIT, an instruction clip was played, demonstrating what kind of sound would be played, and how to answer the three questions. Participants were asked to put down mandatory words at specified places as a practice. This step ensures to further check the sound system and to avoid automatically generated and thus invalid responses; it also helps participants to familiarize themselves with the interface and gives an idea of the desired level of description detail.
- (4)Once labels were submitted, our system compared the new results with the ground truth data from the pilot study to ensure that people were actually paying attention to the study and that meaningful labels were assigned. Finally, a manual review determined whether to accept or reject the work.
4. Data Processing and Generation of Sense Sets
After the AMT online sound labelling study was completed, each soundnail had been labelled by at least 100 (up to 174) participants. All labels were in sentence format. To facilitate analysis and evaluation, the semantic human label data were processed as follows.
Each sentence was broken down into bags of words. Function words that do not contain much information, such as "the," "and," and so forth, were filtered out. The raw data contained inflected words that we stemmed (reduced to their base forms) with the help of WordNet  and the Natural Language Toolkit . Each unstemmed word was first looked up in WordNet, an online lexicon database, to see if it has a meaning independent of the base form; if this was not the case, it was stemmed. For example, "woods" meaning "forest" was not reduced to "wood," since it has its own meaning, while "pens" was transformed back to "pen." Following these steps, each sound was associated with a set of validated words.
Top 10 and bottom 10 sounds in their average word count.
Sound (top 10)
Word count (average)
Sound (bottom 10)
Word count (average)
Ball, Table Tennis Ball
Reverse, Truck Backup
Zoo, Bird Dog and People
School, Classroom Bell
Cat, Persian Meowing
Weight, Off the Scale
Beer, Bottle Open
Farm, Hen House
Farm, Cattle in Shed
Move, Concrete Block
Within the "bag of words" for a given sound, different words were often used to denote the same or very similar concept. In this sense, it seemed meaningful to group those words together as a "sense set" (or concept group) when considering what concepts the sound evoke. To be convenient, in the following sections, a sense set will be referred as a "label" to be distinguished from "word." If not specified, all the calculations and evaluations described below are based on labels instead of words.
Synonym sets, in which words have the same meaning. For example, "baby," "infant," and "newborn" are group into the "baby" sense set, labelled with the most frequently used word "baby."
Similar senses expressed by words from different parts of speech. For example, "rain (noun)," "raining (verb)," and "rainy (adj.)" are grouped into the "rain" sense set.
Hyponym and hypernym (super- and subordinates). This varied case by case. For example, for the sound "ball", "basketball," "tennis ball", "ping pong ball" will all be put in the "ball" sense set, while for the sound "basketball", the word "basketball" had its own sense set.
A weight is calculated for each member word in the sense set based on their actual word count. In this process, misspelled words were corrected and taken into account.
5. Evaluation Metrics
Since a word count depends on the number of participants who labelled the sound and thus varied across sound, a relative score, referred as "Sense Score" is calculated for each sense set per sound. It is the average number of times across all labellers with which a sense set (label) is generated for a sound. Thus, the sense score shows how much participants agree on a label
Mean score and standard diviation: mean and stdev of the sense scores.
Steepness: this measure shows quickly the sense scores drop across labels. Usually, the flatter the sense score distribution is, the less clearly the sound is associated with a single concept.
6. Primary Analysis
6.1. Audio Expressiveness
Top ten and bottom ten sounds in top sense score.
Sound (top ten)
Top score (label number)
Sound (bottom ten)
Top score (label number)
Spring, Door Spring Vibrate
Stop, Hose Pipe
Cold, Teeth Chatter
Cat, Persian Meowing
Bucket, Throw Can into Bucket
Window, Window Slide Open
Cry, Baby Girl Cry
Gym, Intensive Workout Breathing
Telephone, Ring Pick Up
Bike, Wheel Turning
Horn, Car Horn
Ball, Croquet Hit
Farm, Hen House
Dryer, Hairdryer Stop
Young, Baby Talk
Umbrella, Opening Umbrella
6.2. Effectiveness in Illustrating Target Concepts
Examples of situations of how well sounds convey target concepts.
Cat, Persian Meowing
Farm, Cattle in Shed
Day, Rooster Clock Crickets
Floor, Walk in Classroom
Television, Change Channel
Slice, Cut Bread
Umbrella, Open Umbrella
Bike, Wheel Turn
For those sounds whose target word shows the highest agreement, the results confirm that they successfully convey the target concept. There are about ninety sounds in this category. These soundnails are effective and can likely be utilized to assist language comprehension and communication.
For the sounds of which the label with the highest agreement (different from the target word) matches the sound description (given in the sound file name), it can be said that the sound (scene) is distinctive and can convey a concept, though different from what is desired. About 150 sounds are in this category. Two possible reasons can be cited for this result. (a) The desired concept requires extra linkage to the sound scene; (b) the participants focused on different objects or aspects related to the sound.
The sounds where participants provided labels different from the sound description with high agreement are suggestive of a concept, though not the a priori one. About 52 sounds fall into this category.
In the case of the sounds for which participants in general did not agree (low top scores), we conclude that they lack the necessary characteristics for people to identify and associate them with specific concepts. About 35 fall into this category.
Of course, cases (2–4) may simply suggest problems with the scripting and sound selection. Further analysis on why people came up with different labels than what was desired can guide our future refinement of the construction of a network of concept-nonlinguistic audio connections.
6.3. Audio Categorization
Source: source of the sound (Table 7).
Event: complexity in terms of the number of interacting participants (Table 8).
- (3)Scene: location where the sounds are likely to take place (Table 9).Table 7
Descriptions of sounds divided by category.
Vocal sounds made by human, such as coughing and laughing
Actions performed by human, such as walking on the snow and knocking on the door.
Complex event that involves humans, such as football game.
Sounds made by animals such as birds and crickets.
Sounds generated by natural phenomena such as wind and waves (excluding sounds made by animals).
Sounds resembling sounds occurring in nature, such as people blowing air or splashing water.
Sounds made by contact between two objects, such as a ball hitting a bat.
Sounds made by object, such as a rustling plastic bag.
Sounds related to vehicles (cars, boats, planes) as well as their parts.
Sounds made by mechanical tools, such as scissors and handsaw.
Sounds made by a machine or electric device, such as a drill.
Electronic devices such as television and radio.
All kinds or alarms and sirens.
Bells such as doorbells and church bells.
All synthesized sounds.Table 8
Descriptions of sounds divided by event.
Sounds initiated and completed by a single source can be divided into finer groups: SingleNature and SingleArtifact.
Single source sounds made by living beings or natural phenomena.
Single source sounds made by bells, machines, and artifacts.
Sounds of human manipulating one object, such as rustling a bag.
Sounds of two objects interacting, such as pen scratching paper.
Complex sound scenes or sounds with multiple entities involved.Table 9
Descriptions of sounds divided by Scene.
Sounds evoking a nonspecific outdoor location (e.g., wind).
Sounds evoking a nonspecific indoor location (e.g., walking on a floor).
Bathroom sounds such as flushing.
Kitchen sounds such as washing dishes.
School sounds such as suggestive of a classroom.
Office sounds such as printing.
Workshop/factory sounds taking such as hammering.
Transportation-related sounds such as car sounds and stations.
Sports-related sounds such as a basketball game and jogging.
Commercial transaction-related sounds such as a cash register.
Nature sounds such as birds singing.
Location-independent sounds coughing.
6.4. Audioability and Parts of Speech
Comparison of numbers of different parts of speech in target words and most agreed labels for all sounds.
Most-agreed upon labels
All agreed-upon words
Pairwise comparison between parts of speech of the target words and those of the most agreed-upon labels.
7.1. Sources of Discrepancies in Audio Interpretation
There is a number of reasons why people may interpret the sounds differently from one another and from the a priori labels. Here, we are discussing cases (3) and (4) mentioned in Section 6.2.
Concepts that are ambiguous from an audio characteristics perspective do not seem to have a unique sound associated with them, or at least not a sound distinctive enough at a finer level. For example, a "desk" does not have a characteristic sound of itself, because artifacts do not generate sounds by themselves unless they are deployed by a user; similarly, it seems difficult to distinguish the sound of an iron bell from that of a steel bell, which suggests that fine-grained differences among category members are not audible.
The participants' familiarity with the sound could be an important factor affecting their perception. For example, many people mistook the lion roaring sound to a bear sound and even a cow call. Life experience is a related factor. Comparing the AMT labels to the pilot study labels, we found that the young students in the pilot study made many more mistakes in identifying an old-style phone dialling sound.
A conceptual-linguistic perspective suggests that many abstract concepts are difficult to evoke with sounds. For example, we tried to represent the concept "day" (meaning a complete 24-hour cycle) by combining a rooster crowing, a clock ticking, and crickets chirping into one sequence. While most participants were able to identify one or more concepts in the sequence, none of them generated the label "day." Similarly, the sound for "winter" was in most cases labelled "Christmas." It suggests that for very abstract concepts, people tend to associate the sound with more specific events.
We tried to represent abstract concepts like "up" and "down" by changing the pitch of the sounds, similar to earcons . However, nearly all participants labelled these as synthetic sounds for games or alarms. Attempts to illustrate "left" and "right" failed in similar ways (we could not determine if users had proper stereo sound systems). This appears to support previous research that compared to actual environmental sounds, earcons need more learning .
7.2. Better Audio Categorization in Terms of Expressiveness
Our three criteria (sources, locations, and events) for accounting for sounds' audioability are not sufficient to explain the results. More relevant factors that impact the distinctiveness of nonspeech audio should be hypothesized investigated. For example, material (glass versus metal versus stone) might be a strong indicator .
A better categorization of sounds based on their expressiveness will provide guidance for designing improved nonspeech audio representations of concepts.
8. Summary and Conclusions
In this paper, we describe an experiment collecting a large number of human-generated semantic labels for a collection of nonspeech audio clips. The ultimate goal is to create effective auditory representations for commonly used concepts to assist language comprehension, acquisition, and communication. The audio clips are played to evoke intended concepts to rebuild/enhance the missing links between words and actual concepts for people with language disabilities or barriers in the context of Augmentative and Alternative Communication, language rehabilitation, and reading comprehension.
In the experiment, which was conducted online via the Amazon Mechanical Turk platform, 327 "soundnails" associated with 184 words from different parts of speech were labelled by over 100 participants each, addressing the source(s), location(s), and event(s) involved in the audio content. The soundnails had a maximal length of five seconds and were extracted from special sound effect collections using audio processing and machine learning schemes. Labels were normalized (stemmed) and regrouped into semantic units (sense set). A score based on word counts and the number of labellers was calculated per sense set per sound. Several evaluation metrics were proposed to further assess how well a sound can convey a concept.
Results showed that about a third of the soundnails evoked the a priori concepts. For another half of the sounds the auditory contents were correctly identified, though participants agreed on labels (sense sets) that differed from the target concepts. Those sounds were verified and they can be directly applied to our nonspeech audio enhanced semantic vocabulary network. The remaining sounds were either too similar to other auditory events, or too ambiguous to generate agreement among human labellers. Different possible reasons that affect the expressiveness and descriptiveness of a sound were discussed, from auditory complexity and characteristics, linguistic features to human-related factors.
Three categorizations of sounds, based on sources, locations, and events, respectively, were proposed in order to explore the factors bearing on the distinctiveness of sounds and their effectiveness in conveying specific concepts. However, only a few categories were strongly indicative of the expressiveness of the sounds. Future work will include analyses based on different criteria, such as the material make-up of objects involved in the sounds.
The authors would like to thank the Princeton Sound Lab and the Human Computer Interaction Group for assistance with the experimental design. They are grateful to the Kimberly and Frank H. Moss '71 Research Innovation Fund of the Princeton School of Engineering and Applied Science, and the Microsoft Intelligent Systems for Assistive Cognition Grants for their support.
- Ng S, Bradac J: Power in Language: Verbal Communication and Social Influence. Sage, Beverly Hills, Calif, USA; 1993.Google Scholar
- Ma X, Cook P: How well do visual verbs work in daily communication for young and old adults? In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI '09), 2009, Boston, Mass, USA. ACM Press, New York, NY, USA;Google Scholar
- Danielsson H, Jonsson B: Pictures as language. Proceedings of the International Conference on Language and Visualization, 2001, Stockholm, SwedenGoogle Scholar
- Saygin AP, Dick F, Wilson SW, Dronkers NF, Bates E: Neural resources for processing language and environmental sounds: evidence from aphasia. Brain 2003, 126(4):928-945. 10.1093/brain/awg082View ArticleGoogle Scholar
- Clarke S, Bellmann A, De Ribaupierre F, Assal G: Non-verbal auditory recognition in normal subjects and brain-damaged patients: evidence for parallel processing. Neuropsychologia 1996, 34(6):587-603. 10.1016/0028-3932(95)00142-5View ArticleGoogle Scholar
- Dick F, Bussiere J, Saygm A: The effects of linguistic mediation on the identification of environmental sounds. Center for Research in Language Newsletter 2002, 14(3):3-9.Google Scholar
- Yost W, Popper A, Fay R: Auditory Perception of Sound Sources. Springer, London, UK; 2007.View ArticleGoogle Scholar
- Watanabe K, Shimojo S: When sound affects vision: effects of auditory grouping on visual motion perception. Psychological Science 2001, 12(2):109-116. 10.1111/1467-9280.00319View ArticleGoogle Scholar
- Schnider A, Benson DF, Alexander DN, Schnider-Klaus A: Non-verbal environmental sound recognition after unilateral hemispheric stroke. Brain 1994, 117(2):281-287. 10.1093/brain/117.2.281View ArticleGoogle Scholar
- Ellis D: Prediction-driven computational auditory scene analysis, Ph.D. thesis. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass, USA; 1996.Google Scholar
- Harb H, Chen L: A general audio classifier based on human perception motivated model. Multimedia Tools and Applications 2007, 34(3):375-395. 10.1007/s11042-007-0108-9View ArticleGoogle Scholar
- Freesound Project, 2008, http://www.freesound.org
- Marcell MM, Borella D, Greene M, Kerr E, Rogers S: Confrontation naming of environmental sounds. Journal of Clinical and Experimental Neuropsychology 2000, 22(6):830-864. 10.1076/jcen.22.6.830.949View ArticleGoogle Scholar
- Lakatos S, Scavone G, Cook P: Obtaining perceptual spaces for large numbers of complex sounds: sensory, cognitive, and decisional constraints. In Proceedings of the 16th Annual Meeting of the International Psychophysics Society, 2000 Edited by: Bonnet C. 245-250.Google Scholar
- Lingraphica, 2005, http://www.lingraphicare.com
- BBC Sound Effects Library, 2007, http://www.sound-ideas.com/bbc.html
- Fellbaum C: WordNet: Electronic Lexical Database, A Semantic Network of English Verbs. MIT Press, Cambridge, Mass, USA; 1998.Google Scholar
- FindSounds, 2008, http://www.findsounds.com
- Tzanetakis G, Cook P: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 2002, 10(5):293-302. 10.1109/TSA.2002.800560View ArticleGoogle Scholar
- Amazon Mechanical Turk, 2007, https://www.mturk.com/mturk/welcome
- Natural Language Toolkit (NLTK), 2007, http://www.nltk.org
- Brewster SA, Wright P, Edwards A: An evaluation of earcons for use in auditory human-computer interfaces. In Proceedings of the Conference on Human Factors in Computing Systems (INTERCHI '93), 1993. ACM Press; 222-227.View ArticleGoogle Scholar
- Blattner M, Sumikawa D, Greenberg R: Earcons and icons: their structure and common design principles. Human-Computer Interaction 1989, 4(1):11-44. 10.1207/s15327051hci0401_1View ArticleGoogle Scholar
- Rocchesso D, Fontana F: The Sounding Object. Mondo Estremo; 2003.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.