- Open Access
Music-aided affective interaction between human and service robot
© Park et al; licensee Springer. 2012
- Received: 2 April 2011
- Accepted: 19 January 2012
- Published: 19 January 2012
This study proposes a music-aided framework for affective interaction of service robots with humans. The framework consists of three systems, respectively, for perception, memory, and expression on the basis of the human brain mechanism. We propose a novel approach to identify human emotions in the perception system. The conventional approaches use speech and facial expressions as representative bimodal indicators for emotion recognition. But, our approach uses the mood of music as a supplementary indicator to more correctly determine emotions along with speech and facial expressions. For multimodal emotion recognition, we propose an effective decision criterion using records of bimodal recognition results relevant to the musical mood. The memory and expression systems also utilize musical data to provide natural and affective reactions to human emotions. For evaluation of our approach, we simulated the proposed human-robot interaction with a service robot, iRobiQ. Our perception system exhibited superior performance over the conventional approach, and most human participants noted favorable reactions toward the music-aided affective interaction.
- Facial Expression
- Facial Image
- Emotion Recognition
- Perception System
- Facial Expression Recognition
Service robots operate autonomously to provide useful services for humans. Unlike industrial robots, service robots interact with a large number of users in a variety of places from hospitals to home. As design and implementation breakthroughs in the field of service, robotics follow one another rapidly, people are beginning to take a great interest in these robots. An immense variety of service robots are being developed to perform human tasks such as educating children and assisting elderly people. In order to coexist in humans' daily life and offer services in accordance with a user's intention, service robots should be able to affectively interact and communicate with humans.
Affective interaction provides robots with human-like capabilities for comprehending the emotional states of users and interacting with them accordingly. For example, if a robot detects a negative user emotion, it might encourage or console the user by playing digital music or synthesized speech and by performing controlled movements. Accordingly, the primary task for affective interaction is to provide the robot with the capacity to automatically recognize emotional states from human emotional information and produce affective reactions relevant to user emotions.
Human emotional information can be obtained from various indicators: speech, facial expressions, gestures, pulse rate, and so forth. Although many researchers have tried to create an exact definition of emotions, the general conclusion that has been drawn is that emotions are difficult to define and understand [1, 2]. Because of this uncertainty in defining emotions, identifying human emotional states via a single indicator is not an easy task even for humans . For this reason, researchers began to investigate multimodal information processing, which uses two or more indicators simultaneously to identify emotional states.
In the conventional approaches, speech and facial expression have successfully been combined for multimodality, since they both directly convey human emotions [4, 5]. Nevertheless, these indicators have several disadvantages for service robots. First, users need to remain in front of the robots while expressing emotions through either a microphone or a camera. Once a user moves out of sight, the robot may fail to monitor the emotional states. Second, the great variability in characteristics of speech or facial expression with which humans express their emotions might deteriorate the recognition accuracy. In general, different humans rarely express their emotional states in the same way. Thus, some people who express emotions with unusual characteristics may fail to achieve satisfactory performance on standard emotion recognition systems .
To overcome these disadvantages of the conventional approaches, this study proposes a music-aided affective interaction technique. Music is oftentimes referred to as a language of emotion . People commonly enjoy listening to music that presents certain moods in accordance with their emotions. In previous studies, researchers confirmed that music greatly influences the affective and cognitive states of users [8–10]. For this reason, we utilize the mood of music that a user is listening to, as a supplementary indicator for affective interaction. Although the musical mood conveys the emotional information of humans in an indirect manner, the variability of emotional states that humans experience while listening to music is relatively low, as compared with that of speech or facial expression. Furthermore, the music-based approach has a smaller limitation with respect to the distance between a user and a robot.
The remainder of this article is organized as follows. Section 2 reviews previous studies that are relevant to this study. Section 3 proposes a framework for affective interaction between humans and robots. Section 4 provides specific procedures of music-aided affective interaction. Section 5 explains the experimental setup and results. Finally, Section 6 presents our conclusions.
An increasing awareness of the importance of emotions leads the researchers to attempt to integrate affective computing into a variety of products such as electronic games, toys, and software agents . Many researchers in robotics also have been exploring affective interaction between humans and robots in order to accomplish the intended goal of human-robot interaction.
For example, a sociable robot, 'Kismet', understands human intention through facial expressions and engages in infant-like interactions with human caregivers . 'AIBO', an entertainment robot, behaves like a friendly and life-like dog that responds to either the touch or sound of humans . A conversational robot called 'Mel' introduced a new paradigm of service robots that leads human-robot interaction by demonstrating practical knowledge . A cat robot was designed to simulate emotional behavior arising from physical interactions between a human and a cat . Tosa and Nakatsu [16, 17] have concentrated on the technology of speech emotion recognition to develop speech-based robot interaction. Their early studies, 'MUSE' and 'MIC', were capable of recognizing human emotions from speech and expressing emotional states through computer graphics on a screen. They have consistently advanced their research directions and developed more applications.
In efforts to satisfy the requirements for affective interaction, researchers have explored and advanced various types of software functions. Accordingly, it is necessary to integrate those functions and efficiently manage systematic operations according to human intentions. The best approach for this is to organize a control architecture or a framework for affective interaction between a human and a robot.
The primary function of the perception system is to obtain human emotional information from the outside world through useful indicators such as facial expression and speech. The memory system records the emotional memories of users and corresponding information in order to utilize them during the interaction with humans. Finally, the expression system executes the behavior accordingly and expresses emotions of the robot.
In the conventional approaches to achieve affective interaction, both speech and facial expression have mostly been used as representative indicators to obtain human emotional information. Those indicators, however, have several disadvantages when operated in robots, as addressed in Section 1. In addition, most of the conventional approaches convey the robot's emotional states in monotonous ways, using a limited number of figures or synthesized speech. Thus, users easily predict the robot's reactions and can lose interest in affective interaction with the robot. To overcome these drawbacks, we adopt music information in the framework of affective interaction.
Music is an ideal cue for identifying the internal emotions of humans and also has strong influences on the change of human emotion. Hence, we strongly believe that music will enable robots to more naturally and emotionally interact with humans. For the music-aided affective interaction, the mood of the music is recognized in the perception system and is utilized in the determination of the user's emotional state. Furthermore, our expression system produces affective reactions to the user emotions in more natural ways by playing music that the robot recommends or songs that the user previously listened to while exhibiting that emotion. The music-aided affective reaction is directly supported by the memory system. This system stores information on the music the user listens to with a particular emotional state. This section describes further specific features of each system in the framework of music-aided affective interaction.
4.1. Perception system
4.1.1. Musical mood recognition
One of the essential advantages of music-based emotion recognition is that monitoring of human emotion can be accomplished in the background without the user's attention. Users do not need to remain in front of the robot, since the musical sound can be loud enough to be analyzed in the perception system. For this reason, the module of the musical mood recognition is operated independently from the other modules in the perception system. Even though the musical mood provides a conjectured user emotion, the recognition result sufficiently enables the robot to naturally proceed with affective and friendly interaction with the user as long as the user plays music. For instance, if a user is listening to sad music, the robot can express concern, using a display or sound.
Compared to other tasks for musical information retrieval, such as genre identification, research on musical mood recognition is still in an early stage. General approaches have concentrated on acoustic features representing the musical mood and criteria for the classification of moods [19–21]. A recent study focused on a context-based approach that uses contextual information such as websites, tags, and lyrics . In this study, we attempt to identify the musical mood without consideration of contextual information to extend the range of music to instrumental music such as sound-tracks of films. Thus, we follow the general procedure of non-linguistic information retrieval from speech or sound [23, 24].
The mood recognition module is activated when the perception system detects musical signals. Audio signals transmitted through a microphone of a robot can be either musical signals or human voice signals. Thus, the audio signals need to be classified into music and voice, since the system is programmed to process voice signals in the speech emotion recognition module. For the classification of audio signals, we employ the standard method of voice activity detection based on the zero crossing rate (ZCR) and energy . When the audio signals indicate relatively high values in both ZCR and energy, the signals are regarded as musical signals. Otherwise, the signals are categorized as voice signals and submitted to the speech processing module.
The first step of the musical mood recognition is to extract acoustic features representing the musical mood. Several studies have reported that Mel-frequency cepstral coefficients (MFCC) provide reliable performance on musical mood recognition, as this feature reflects the nonlinear frequency sensitivity of the human auditory system [19, 20]. Linear prediction coefficients (LPC) are also known as a useful feature that describes musical characteristics well . These two features are commonly used as short-term acoustic features, non-linguistic characteristics of which are effectively defined with probability density functions such as a Gaussian distribution [26, 27]. For this reason, we use these features as primary features. After extracting these features from each frame of 10-40 ms in music streams, their first and second derivatives are added to the feature set of the corresponding frame in order to consider temporal characteristics between consecutive frames.
where refers to a vector sequence of acoustic features that are extracted from the music stream, and GMM λ i (i = 1,...,M if there are M musical moods) indicates the mood model. M log-likelihood results are then submitted to the emotion decision process.
4.1.2. Bimodal emotion recognition from facial expression and speech
Facial expression and speech are the representative indicators that directly convey human emotional information. Because those indicators provide emotional information that is supplementary and/or complementary to each other, they have successfully been combined in terms of bimodal indicators. The bimodal emotion recognition approach integrates the recognition results, respectively, obtained from face and speech.
In facial expression recognition, accurate detection of the face has an important influence on the recognition performance. A bottom-up, feature-based approach is widely used for the robust face detection. This approach searches an image through a set of facial features indicating color and shape, and then groups them into face candidates based on the geometric relationship of the facial features. Finally, a candidate region is decided as a face by locating eyes in the eye region of a candidate's face. The detected facial image is submitted to the module for facial expression recognition.
The first step of facial expression recognition is to normalize the captured image. Two kinds of features are then extracted on the basis of Ekman's facial expression features . The first feature is a facial image consisting of three facial regions: the lips, eyebrows, and forehead. By applying histogram equalization and the threshold of the standard distribution of the brightness of the normalized facial image, each of the facial regions is extracted from the entire image. The second feature is an edge image of those three facial regions. The edges around the regions are extracted by using histogram equalization.
Next, the facial features are trained according to a specific classifier in order to determine explicitly distinctive boundaries between emotions. The boundary is used as a criterion to decide an emotional state for a given facial image. Various techniques already in use for conventional pattern classification problems are likewise used for such emotion classifiers. Among them, neural network (NN)-based approaches have widely been adopted for facial emotion recognition, and have provided reliable performance [29–31]. A recent study on NN-based emotion recognition  reported the efficiency of the back-propagation (BP) algorithm proposed by Rumelhart and McClelland in 1986 . In this study, we follow a training procedure introduced in  that uses an advanced BP algorithm called error BP.
Once audio signals transmitted through a robot microphone are determined to be human voice signals, the speech emotion recognition module is activated. In the first step, several acoustic features representing emotional characteristics are estimated from the voice signals. Two types of acoustic features are extracted: a phonetic feature and a prosodic feature. MFCC and LPC pertaining to musical mood recognition are also employed for speech emotion recognition in terms of phonetic features, while spectral energy and pitch are used as prosodic features. As in musical mood recognition, the first and second derivatives of all features are added to the feature set.
Next, the acoustic features are recognized through a pattern classifier. Even though various classifiers such as HMM and SVM have been fed into speech emotion recognition tasks, we employ the neural network-based classifier used in the facial expression recognition module in order to efficiently handle the fusion process in which the recognition results of two indicators are integrated. We organize a sub-neural network for each emotion. The construction of each sub-network has basically the same architecture. A sub-network comprises input nodes corresponding to the dimension of the acoustic features, hidden nodes, and an output node. The number of hidden nodes varies according to the distinctness of respective emotions. When there are M emotions, acoustic features extracted from the voice signals are simultaneously fed into M sub-networks, and thus an M-dimensional vector is obtained for the recognition result. The configuration of the neural network is similar to that adopted in , but we adjust internal learning weights of each sub-network and the normalization algorithm in consideration of the characteristics of the acoustic features.
where Wf and Ws are the weights for the respective indicators.
The weights are appropriately determined by reference to the recognition results for each indicator.
In general, the performance of standard emotion recognition systems substantially depends on the user characteristics in expressing emotional states . Thus, systems occasionally demonstrate the common error of a rapid transition of human emotional states. To address this problem, we consider the general tendency that human emotional states rarely change quickly back and forth. Hence, the proposed fusion process in (2) uses two recognition results obtained just before the current time t in order to reflect the emotional state demonstrated during the previous time.
4.1.3. Emotion decision
The final procedure in the perception system is to determine an emotion on the basis of the bimodal fusion vector calculated in (2) and the mood recognition result estimated in (1). These two results indicate different scales about the values but have the same dimension corresponding to the number of emotions and moods. Let us denote Rbimodal(e) and Rmusic(e) as the value of the e th emotion in the fusion vector and that of the e th mood in the mood recognition result, respectively.
where Wb, Wm, and refer to the weights or scaling factors for the corresponding results. This decision criterion is available only if either the bimodal indicators or musical indicator is activated. When only the musical signals are detected, Rbimodal(e) is automatically set to zero. If musical signals are not detected, the music-based results, Rmusic(e) and , are set to zero.
4.2. Memory system
The memory system consecutively records several specific types of emotions such as happy, sad, or angry from among the emotions that the perception system detects from three kinds of indicators. The system creates emotion records including the emotion type and time. Such emotion records can naturally be utilized for affective interaction with the user. For example, the robot can express concern the day after the user has been angered by something. When a negative user emotion is sustained for a long time, the memory system may attempt to change the user's negative feeling, forcing the expression system to control the degree of expression.
In addition to emotional information, the memory system records the information of music detected by the perception system. The system obtains and accumulates musical information such as the genre, title, and musician of the detected music, supported by an online music information retrieval system. The accumulated musical information is used to organize a music library directly oriented to the user, which provides explicit information of the user's favorite genres and musicians as well as the musical mood.
Although music is non-verbal information, the music library enables the robot to have more advanced and intelligent interaction skills. On the basis of this library, the robot may offer several songs befitting the user's emotion or recommend other songs similar to the music that the user is listening to. While a recommended song is played, the perception system monitors the user's response through the bimodal indicators.
The feedback, either negative or positive, on the song is then recorded in the memory system to be utilized in future interactions. The music library is continuously updated whenever the user plays a new song or provides feedback through emotional expression.
History of user emotions for each type of musical mood
4.3. Expression system
The expression system is an intermediate interface between the memory system and robot hardware devices. This system executes behavior operations and/or emotion expressions in order to react to the user emotions. Both operations basically depend upon robot hardware devices, since every service robot has different hardware capacity to process expression operations.
The second type of expression operations utilizes acoustic properties. The expression system can naturally produce emotional reactions through synthesized speech or music. Whenever the expression system determines either the context of the synthesized speech or music to play, the historical records of user emotion and music information provided by the memory system are utilized. If the perception system detects a certain type of emotion from a user, the expression system can recommend several songs that the user has previously listened to while experiencing that emotion. Since the memory device of robots can store a great number of music data, users hardly predict which song will be played. Thus, the music-aided expression system provides a more interesting and natural way of interaction between users and robots.
This article proposed a framework for music-aided affective interaction between humans and robots. For evaluation of our approach, we implemented the proposed framework on a service robot, iRobiQ, and simulated the human-robot interaction. We attempted to evaluate the efficiency of two fundamental systems that play the most important roles in the framework: the perception and the expression systems. We first introduce the technical specifications of iRobiQ used as the robot platform in our research, and experimental results are subsequently presented.
5.1. iRobiQ: a home service robot
This robot has its own hardware system for facial expression, and five types of facial expressions can be displayed: shy, disappointed, neutral, happy, and surprised. In addition, iRobiQ has two eyes designed on a segmented LCD displaying eye expressions. It also has LED dot matrices on its cheeks and mouth with which various emotions are expressed.
On the LCD screen located on the robot's chest region, a variety of graphical images can be displayed. For this study, we implemented several graphical face images and used them to represent the robot emotions more directly while interacting with a user. When compared to existing mechanical face robots, which require very complex motor-driven mechanisms and artificial skin, this kind of facial expression can deliver robot emotions in a more intimate manner .
5.2. Evaluation of the perception system
For the evaluation of the proposed perception system, we first conducted three kinds of emotion-recognition experiments independently: facial expression recognition, speech emotion recognition, and musical mood recognition. We then investigated the performance improvement in bimodal emotion recognition based on the proposed fusion process. Finally, music-aided multimodal emotion recognition was evaluated.
5.2.1. Experimental setups
To fairly verify each recognition module and to simulate bimodal and multimodal emotion recognition, we used four kinds of emotions or musical moods in each experiment: neutral, happy, angry, and sad. We chose 'angry' and 'sad' as the most representative negative emotions, whereas 'neutral' and 'happy' were chosen as non-negative emotions.
A typical difficulty in a standard multimodal emotion recognition task is data collection. In general, people of different countries have their own characteristic ways of expressing emotions facially and vocally. Thus, there are few standard multimodal databases collected from people around the world. Instead, most research studies depend on facial images and speech data obtained from nationals of a country [36, 37]. We prepared training and evaluation data from ten Korean participants (five men and five women) who were asked to express emotions while making an emotional face and speaking short phrases of ordinary dialogue. Each participant generated five facial images and spoken data, respectively, for each emotion, while changing the contents of the dialogue. Consequently, 200 facial images and 200 spoken data were collected. All data were recorded in a quiet environment without any background noise.
All experiments were conducted by k-fold cross validation to fairly assess the recognition performance for respective persons. In k-fold cross validation, the original sample is partitioned into k subsamples and each subsample is retained in turn as evaluation data for testing while the remaining k-1 subsamples are used as training data. The cross validation is thus repeated k times, with each of the k subsamples used exactly once as validation data. Hence, we repeated the evaluations ten times in accordance with a tenfold cross validation.
Musical mood categories in which similar types of moods were combined and used for the clip selection from AMG
In the human listening test, 30 participants (native speakers of Korean) listened to music clips chosen randomly from the website. The participants then classified each clip into one type from among the four types of musical moods provided in Table 2. Because all the participants are native Koreans, they could concentrate on the musical mood, ignoring the lyrics of the clips that were in English.
5.2.2. Experimental results of unimodal and bimodal emotion recognition
Performance (%) of neural network-based facial expression recognition module
Neural network #1
Neural network #2
Performance (%) of neural network-based speech emotion recognition module
Neural network for men
Neural network for women
These two kinds of experimental results indicate that the two indicators (face and speech) do not operate in the same way. For example, the facial expression module best categorized the happy expression images, whereas the speech module determined the speech of sad emotion with the best accuracy. Meanwhile, the angry emotion was better detected in the speech than the facial expression. Such results emphasize the general necessity of bimodal emotion recognition.
To simulate bimodal recognition experiments, we asked each participant to make an emotional face and simultaneously to emotionally vocalize several given sentences for respective emotions. An emotion was then determined for each trial in real time based on the bimodal fusion process. We investigated the efficiency of the proposed fusion process described in (2), which considers the general tendency that human emotional states rarely change quickly back and forth. For this evaluation, emotions that the participants were requested to express were given sequentially without rapid changes. For the purpose of comparison, we also investigated the results on a simple fusion process that uses the sum of two unimodal results without consideration of the previous emotion.
Performance (%) of bimodal emotion recognition
With a simple fusion process
With the proposed fusion process
Confusion matrix of the musical mood recognition
5.2.3. Experimental results of music-aided multimodal emotion recognition
In order to investigate whether the results of musical mood recognition can complement the emotion results from the bimodal indicators, two types of multimodal experiments were conducted. In the first experiment, we virtually simulated multimodal recognition in iRobiQ, directly utilizing the evaluation data prepared for the unimodal experiments. We assumed that the evaluation data, that is, the audio-visual data, and music clips, are entered into the affective system of the robot. Each recognition module in the perception system computed the results from the data independently and emotions were determined by the decision criterion described in Section 4.1.3. In this section, we introduced the use of previously recorded bimodal recognition results. When three recognition modules in the perception system are activated at the same time, bimodal recognition results are recorded in relation to the musical mood. The records are later utilized in an emotion-decision process along with the results of bimodal and musical mood recognition. We attempted to verify the efficiency of this approach, but could not naturally perform the evaluation since the evaluation data are obtained from participants who should imitate the emotional expression and the music clip is selected without consideration of personal musical preference. Thus, in the first experiment, the previous record of bimodal results, , was set to zero so as to be ignored.
The second experiment was conducted in a more natural way, supported by human participants. Each participant was asked to carry out the same behavior as in the bimodal recognition experiments, looking at iRobiQ and making an emotional face and speech for respective emotions. At the same time, we played several songs categorized into a mood similar to the corresponding emotion, at a slight distance from the robot. At that instance, the robot receives three kinds of emotional data and proceeds to compute the recognition results via respective recognition modules in the perception system. It should be noted that if both musical and human voice signals are entered into a single microphone, two types of signals act as noise signals for each other, deteriorating the recognition accuracy. The ideal solution for this problem is to operate a process of blind source separation that divides audio signals into music and voice . However, the correctness of the separation task would naturally affect the performance of the two audio recognition modules. For this reason, this study does not consider the problem caused by a single microphone in order to concentrate on the performance evaluation of respective recognition modules and their multimodality. Thus, iRobiQ receives two different types of audio signals, respectively, from two different microphones, one of which is an ear microphone the participant wears for the voice input and the other is a general microphone equipped on the robot for the music input. The musical signals that the robot-equipped microphone receives while the ear microphone is activated are regarded as music-mixed voice signals and are excluded from the musical mood recognition task.
In this experiment, we attempted to use the previous bimodal results in the emotion decision. Once the results of bimodal and musical mood recognition were computed in respective modules, the bimodal recognition results were used to update the average value of the bimodal results on the determined musical mood. The average value was then utilized in the emotion decision process along with the bimodal and musical mood recognition results, on the basis of (4).
From these experiments, we conclude that musical mood information can effectively be utilized as a supplementary and complementary indicator in standard emotion recognition tasks based on speech and facial expression. Nevertheless, we need to further consider that different users might enjoy different types of musical moods while being in a certain emotional state. In the human listening test conducted for the verification of mood categorization, at least 70% of the participants determined an identical mood for each clip. This result indicates that humans tend to feel similar emotions while listening to music. Even when people feel different emotions prior to listening to music, they will experience the same emotion due to the certain mood of the music. This conclusion is closely associated with the general knowledge addressed in Section 1 that music greatly influences the affective states of humans. Consequently, musical mood recognition has strong possibility of improving the reliability of affective interaction between humans and robots.
5.3. Evaluation of the expression system
In the proposed affective interaction, the expression system provides a natural and intermediate interface between humans and service robots. As addressed in Section 4.3, the system enables service robots to appropriately react to the user emotion through visual and acoustic expressions. In particular, the proposed expression system produces an affective reaction in more natural ways by playing music that the robot recommends or several songs that the user listened to while being in that emotional state in the past.
This study proposed an efficient framework for affective interaction between humans and robots. The framework comprises three systems: perception, memory, and expression. In each system, musical moods are utilized as important information. The perception system recognizes the mood of the music that the user listens to and uses it to determine an emotional state of the user along with facial expression and speech. The memory system records musical moods corresponding to emotional states of the user and submits the information to the expression system, which enable the robot to produce more natural reaction by playing music relevant to the user emotion. On emotion recognition experiments conducted with a service robot, iRobiQ, the music-aided multimodal approach demonstrated superior performance over unimodal and bimodal approaches. Moreover, human participants reported favorable reactions toward the music-aided interaction with a robot.
In future study, we will evaluate our approach on more amounts of emotional data. In addition, we will investigate an ideal combination of emotional features and classifiers including SVM and HMM in the proposed approach.
This study was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST) (2011-0013776 and 2010-0025642).
- Richins M: Measuring emotions in the consumption experience. J Consum Res 1997, 24: 127-146. 10.1086/209499View ArticleGoogle Scholar
- Cowie R, Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor J: Emotion recognition in human-computer interaction. IEEE Signal Process Mag 2001, 18: 32-80. 10.1109/79.911197View ArticleGoogle Scholar
- Nwe T, Foo S, Silva L: Speech emotion recognition using hidden Markov models. Speech Commun 2003, 41(4):603-623. 10.1016/S0167-6393(03)00099-2View ArticleGoogle Scholar
- Paleari M, Huet B, Chellali R: Towards multimodal emotion recognition: a new approach. Proc Conf Image Video Retrieval, Xi'an, China 2010, 174-181.Google Scholar
- De Silva LC, Ng PC: Bimodal emotion recognition. In Proc of Fourth IEEE Int Conf Automatic Face Gesture Recog. Grenoble, France; 2000:332-335.View ArticleGoogle Scholar
- Ignacio LM, Carlos OR, Joaquin GR, Daniel R: Speaker dependent emotion recognition using prosodic supervectors. In Proc of Interspeech. Brighton, UK; 2009:1971-1974.Google Scholar
- Pratt CC: Music as the Language of Emotion: a Lecture delivered in the Whittall Pavilion of the Library of Congress. US Govt. Print. Off., Washington; 1952.Google Scholar
- Scherer KR, Zentner MR: A Schacht, Emotional states generated by music: an exploratory study of music experts. Musicae Scientiae 2002, 2001-2002: 149-171.Google Scholar
- Makiko A, Toshie N, Satoshi K, Chika N, Tomotsugu K: Psychological research on emotions in strong experiences with music. Human Interface 2003, 2003: 477-480.Google Scholar
- Gabrielsson A: Some reflections on links between music psychology and music education. Res Higher Music Educ 2002, 2002(2):77-86.Google Scholar
- Bartneck C, Okada M: Robotic user interfaces. In Proc Human Comp Conf. Aizu-Wakamatsu, Japan; 2001:130-140.Google Scholar
- Breazeal C, Scassellati B: A context-dependent attention system for a Social Robot. In Proc Sixteenth Int Joint Conf Art Intel. Stockholm, Sweden; 1999:1146-1151.Google Scholar
- Arkin RC, Fujita M, Takagi T, Hasegawa R: An ethological and emotional basis for human-robot interaction. Robot Autonomous Syst 2003, 42: 191-201. 10.1016/S0921-8890(02)00375-5View ArticleGoogle Scholar
- Sidner CL, Lee C, Kidds CD, Lesh N, Rich C: Explorations in engagement for humans and robots. Artif Intell 2005, 166: 140-164. 10.1016/j.artint.2005.03.005View ArticleGoogle Scholar
- Shibata T, Tashima T, Tanie K: Emergence of emotional behavior through physical interaction between human and artificial emotional creatures. In Proc Int Conf Robotics Automation. San Francisco, USA; 2000:2868-2873.Google Scholar
- Tosa N, Nakatsu R: Life-like communication agent-emotion sensing character "MIC" & feeling session character "MUSE". In Proc Int Conf Multi Comp Syst. Hiroshima, Japan; 1996:12-19.Google Scholar
- Nakatsu R, Nicholson J, Tosa N: Emotion recognition and its application to computer agents with spontaneous interactive capabilities. In Proc IEEE Int Workshop Multi Signal Process. Copenhagen, Denmark; 1999:439-444.Google Scholar
- Ledoux J: The Emotional Brain: The Mysterious Underpinning of Emotional Life. Simon & Schuster, New York; 1996.Google Scholar
- Schmidt EM, Turnbull D, Kim YE: Feature selection for content-based, time-varying musical emotion regression. In Proc ACM SIGMM Int Conf Multimedia Info Retrieval. Philadelphia, USA; 2010:267-274.Google Scholar
- Schmidt EM, Kim YE: Prediction of time-varying musical mood distributions from audio. In Proc Int Soc Music Inform Retrieval Conf. Utrecht, Netherlands; 2010:465-470.Google Scholar
- Yang YH, Lin YC, Su YF, Chen HH: A regression approach to music emotion recognition. IEEE Trans Audio Speech Lang Process 2008, 16(2):448-457.View ArticleGoogle Scholar
- Kim YE, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, Speck JA, Turnbull D: Music emotion recognition: a state of the art review. In Proc Int Soc Music Inform Retrieval Conf. Utrecht, Netherlands; 2010:255-266.Google Scholar
- Ahrendt P: Music genre classification systems--a computational approach. Ph.D. dissertation, Technical University, Denmark 2006.Google Scholar
- Park JS, Kim JH, Oh YH: Feature vector classification based speech emotion recognition for service robots. IEEE Trans Consum Electron V 2009, 55(3):1590-1596.View ArticleGoogle Scholar
- Yang X, Tan B, Ding J, Zhang J, Gong J: Comparative study on voice activity detection algorithm. In Proc Int Conf Elect Control Eng. Wuhan, China; 2010:599-602.Google Scholar
- Kwon O, Chan K, Hao J, Lee T: Emotion recognition by speech signals. In Proc Eurospeech. Geneva, Switzerland; 2003:125-128.Google Scholar
- Huang R, Ma C: Toward a speaker-independent real time affect detection system. In Proc Int Conf Pattern Recog. Hong Kong, China; 2006:1204-1207.Google Scholar
- Ekman P, Friesen WV: Facial Action Coding System: Investigator's Guide. Consulting Psychologists Press, Palo Alto; 1978.Google Scholar
- Giripunje S, Bajaj P, Abraham A: Emotion recognition system using connectionist models. In Proc Int Conf Cog Neural Syst. Boston, USA; 2009:1-2.Google Scholar
- Franco L, Treves A: A neural network facial expression recognition system using unsupervised local processing. In Proc Int Symposium Image Signal Process. Anal. Pula, Croatia; 2001:628-632.Google Scholar
- Rowley HA, Baluja S, Kanade T: Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell 1998, 20(2):23-38. 10.1007/BF03025294View ArticleGoogle Scholar
- Zhu X: Emotion recognition of EMG based on BP neural network. In Proc Int Symposium Network. Network Security. Jinggangshan, China; 2010:227-229.Google Scholar
- Rumelhart DE, McClelland JL: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge; 1986.Google Scholar
- Han J, Lee S, Hyun E, Kang B, Shin K: The birth story of robot, IROBIQ for children's tolerance. In 18th IEEE Int Symposium Robot Human Inter Comm. Toyama, Japan; 2009:318.Google Scholar
- Lee HG, Baeg MH, Lee DW, Lee TG, Park HS: Development of an android for emotional communication between human and machine: EveR-2. In Proc Int Symposium Adv Robotics Machine Intell. Beijing, China; 2006:41-47.Google Scholar
- Ververidis D, Kotropoulos C: Emotional speech recognition: resources, features, and methods. Speech Commun 2006, 48(9):1162-1181. 10.1016/j.specom.2006.04.003View ArticleGoogle Scholar
- Cowie E, Campbell N, Cowie R: Roach P, Emotional speech: towards a new generation of databases. Speech Commun 2003, 40(1):33-60. 10.1016/S0167-6393(02)00070-5View ArticleGoogle Scholar
- Vanroose P: Blind source separation of speech and background music for improved speech recognition. In Proc of the 24th Symposium on Information Theory. Yokohama, Japan; 2003:103-108.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.