Recognizing emotion from Turkish speech using acoustic features
© Oflazoglu and Yildirim; licensee Springer. 2013
Received: 14 May 2013
Accepted: 14 October 2013
Published: 5 December 2013
Affective computing, especially from speech, is one of the key steps toward building more natural and effective human-machine interaction. In recent years, several emotional speech corpora in different languages have been collected; however, Turkish is not among the languages that have been investigated in the context of emotion recognition. For this purpose, a new Turkish emotional speech database, which includes 5,100 utterances extracted from 55 Turkish movies, was constructed. Each utterance in the database is labeled with emotion categories (happy, surprised, sad, angry, fearful, neutral, and others) and three-dimensional emotional space (valence, activation, and dominance). We performed classification of four basic emotion classes (neutral, sad, happy, and angry) and estimation of emotion primitives using acoustic features. The importance of acoustic features in estimating the emotion primitive values and in classifying emotions into categories was also investigated. An unweighted average recall of 45.5% was obtained for the classification. For emotion dimension estimation, we obtained promising results for activation and dominance dimensions. For valence, however, the correlation between the averaged ratings of the evaluators and the estimates was low. The cross-corpus training and testing also showed good results for activation and dominance dimensions.
Recognizing the emotional state of the interlocutor and changing the way of communicating accordingly play a crucial role for the success of human-computer interaction. However, many technical challenges need to be resolved before integrating a real-time emotion recognizer into human-computer interfaces. These challenges include, as in any pattern recognition problem, data acquisition and annotation, feature extraction and finding the most salient features, and building a robust classifier. In this paper, we address each of these problems in the context of emotion recognition from Turkish speech and perform a cross-corpus evaluation.
The lack of data is a major challenge in emotion recognition. Even though great efforts have been made to collect emotional speech data in recent years, there is still a need for emotional speech recordings to cope with the problem of data sparseness. One way to obtain emotional speech data is to use human subjects reading utterances, generally with a certain number of pre-determined and emotionally neutral sentences, in specified emotional states. Berlin database of emotional speech , Danish Emotional Speech , LDC Emotional Prosody Speech and Transcripts , and Geneva Multimodal Emotion Portrayals (GEMEP)  are examples of studio-recorded emotional speech databases.
Even though studio-recorded (acted) databases provide us more balanced data in terms of the number of utterances per emotion, emotions are less natural and realistic compared to those we encounter in real life. One way to overcome this problem is to create environments so that the subjects produce the desired emotions. Sensitive Artificial Listener (SAL) , Airplane Behaviour Corpus (ABC) , Speech Under Simulated and Actual Stress (SUSAS) , TUM Audiovisual Interest Corpus (AVIC) , Interactive Emotional Dyadic Motion Capture (IEMOCAP) , the SEMAINE database , and the FAU Aibo emotion corpus  are examples of such databases. For example, the FAU Aibo emotion corpus consists of 9 h of German spontaneous speech of 51 children interacting with Sony’s pet robot Aibo. A Wizard-of-Oz technique was used for data collection and then the speech data was annotated with 11 emotion categories by five annotators at word level . Audio-visual recordings obtained from TV shows and movies are also used for data acquisition, e.g., Vera-Am-Mittag (VAM) database , Situation Analysis in Fictional and Emotional Corpus (SAFE) , and the Belfast Naturalistic Database . For example, the VAM corpus consists of audio-visual recordings taken from German TV talk show called Vera-Am-Mittag. The corpus contains 946 spontaneous speech from 47 participants of the show. The SAFE corpus  contains 7 h of audio-visual data extracted from English fiction movies and is mainly constructed for the purpose of fear-type emotion recognition system. In this paper, we utilized Turkish movies and TV shows to obtain speech data since the emotional speech extracted from movies is more realistic than studio-recorded emotions expressed by actors reading some pre-defined sentences.
An important requirement of most data-driven systems is the availability of annotated data. The goal of annotation is to assign a label to data. For the emotion recognition task, the annotation is needed to determine the true emotion expressed in the collected speech data. Largely motivated from psychological studies, two approaches were employed within the emotion recognition research for emotion annotation. The classical approach is to use set of emotion words (categories) to describe emotion-related states. Even though there are ongoing debates concerning how many emotion categories exist, the emotion categories (fear, anger, happiness, disgust, sadness, and surprised) defined by Ekman  are commonly used in most of the studies on automatic emotion recognition. However, the main disadvantage of the categorical approach is that it fails to represent a wide range of real-life emotions. The second approach is to use continuous multidimensional space model to describe emotions. In this approach, the emotion is defined as points in multidimensional space rather than a small number of emotion categories. Dimensions in this approach are called emotion primitives. The most commonly used dimensions are valence, activation, and dominance. Valence represents negative to positive axis, activation represents calm to excited axis, and dominance represents weak to strong axis in 3D space. The most common databases such as the FAU Aibo emotion corpus, Situation Analysis in Fictional and Emotional Corpus (SAFE), Airplane Behaviour Corpus (ABC), and TUM Audiovisual Interest Corpus (AVIC) were annotated with the categorical approach. Only a few databases exist where emotions are represented by emotion primitives. Sensitive Artificial Listener (SAL)  and Vera-Am-Mittag (VAM)  are labeled with the dimensional approach. To our knowledge, among the common databases, only a few of them includes both categorical and dimensional labeling such as IEMOCAP  and Belfast Naturalistic Database .
Many previous efforts have addressed emotion recognition by employing pattern recognition techniques using segmental and/or supra-segmental information obtained from speech [6, 16–26]. Acoustic parameters of speech signal have been used extensively to separate emotional coloring present in the speech. Acoustic features are obtained from low-level descriptors (LLDs) such as pitch, energy, duration, Mel-frequency cepstral coefficients (mfcc), and voice quality parameters by applying functionals (mean, median, percentiles, etc.). Comprehensive list of LLDs and functionals is given in . Linguistic information can also be used for emotion recognition especially when the speech data is spontaneous [16, 18, 22, 25, 28–31]. In this study, we only considered acoustic features and used the same feature set given in the INTERSPEECH 2010 Paralinguistic Challenge .
In this paper, we also performed cross-corpus evaluations where the system is trained on one corpus and tested on another. Only a few studies provide such cross-corpus results [33, 34]. In , cross-corpus evaluation results of six well-known emotional speech databases were provided. In this work, we provided cross-corpus results using the VAM database.
This paper is organized as follows. Section 2 describes Turkish emotional speech database. Section 3 explains the feature extraction and selection procedures. Experimental setup and results are given in Section 4. Section 5 concludes the paper.
2 Turkish emotional speech database
In recent years, several corpora in different languages have been collected; however, Turkish is not among the languages that has been investigated in the context of emotion recognition. As an attempt to create a TURkish Emotional Speech databasea (TURES), we have recently extracted and annotated a large amount of speech data from 55 Turkish movies .
Distribution of utterances over speakers
Number of utterances
Number of speakers
50 to 99
25 to 49
10 to 24
2 to 9
2.2 Annotation of emotional content
The annotation is needed to determine the true emotion expressed in the speech data. In this study, we employed both categorical and dimensional approaches for emotion annotation. In categorical approach, a set of emotion words are used to describe emotion-related states. On the other hand, in the dimensional approach, the emotion is defined as points in multidimensional space rather than a small number of emotion categories.
The emotion in each utterance was evaluated in a listener test by a large number of annotators (27 university students) independently of one another. The annotators were asked to listen to the entire speech recordings (randomly permuted) and assign an emotion label (both categorical and dimensional) for each utterance. The annotators only took audio information into consideration.
2.2.1 Categorical annotation
where Pa is the proportion of times that the n evaluators agree, and Pc is the proportion of times we would expect the n evaluators to agree by chance. The details of how Pa and Pc can be calculated are as in . If there is no agreement among the evaluators, κ=0, and κ=1 when there is full agreement. The kappa score computed for the agreement level of the emotion categories between the 27 annotators is 0.32. A score between 0.2 and 0.4 may be considered moderate inter-evaluator agreement.
2.2.2 Annotation in 3D space
Statistics from the distribution of the correlation coefficients between the annotator’s ratings and the average ratings
Comparison of emotion class centroids, mean, and standard deviations (stdv) in the 3D emotion space
3 Acoustic features
In this study, we used the same feature set, a set of 1,532 acoustic features based on several acoustic low-level descriptors (LLDs) and statistics (functionals), used in the INTERSPEECH 2010 Paralinguistic Challenge . We extracted these features using the openSMILE toolkit . The LLDs include fundamental frequency (F0), loudness, voicing probability, 0-14 mfcc, 0 to 7 logarithmic power of Mel-frequency bands (logMelFreqBand), 0 to 7 line spectral pair frequencies computed from 8 LPC coefficients (lspFreq), and the voice quality measures (shimmer and jitter). Delta coefficients of each these LLDs are also included.
Overview of low-level descriptors and functionals
amean, stddev, skewness, kurtosis
linregc1, linregc2, linregerrA, linregerrQ
quartile1, quartile2, quartile3, iqr1-2, iqr2-3, iqr1-3
percentile1.0, percentile99.0, pctlrange0-1
Shimmer and jitter
3.1 Feature selection
where k is the number of features in subfeature space S, r cf is the average class-feature correlation, and r ff is the average feature-feature inter-correlation. CFS calculates r cf and r ff using a symmetric information gain.
Since exhaustive search through all possible feature subsets is not feasible, sub-optimal but faster search functions such as hill climbing, genetic, best first, and random are usually chosen. In this work, we used best-first search method.
4 Emotion classification
In this paper, we focused on four major emotion classes neutral, sad, happy, and angry, and did not include surprised, fear, and other classes in the classification experiments. We evaluated performances of support vector machine (SVM) with radial basis kernel function (SVM-RBF) implemented in the LIBSVM library  and Bayesian Networks (BayesNet) provided by the Weka pattern recognition tool [40, 41]. The performance of SVM highly depends on the parameters used. In order to optimize the SVM performance, we used grid search with fivefold cross validation to select the penalty parameter for mislabeled examples C and Gaussian parameter γ. We also linearly scaled each attribute to the range [0,1]. The scaling parameters for each attribute were calculated from the training data of each fold, and same scaling factors are applied to both corresponding training and testing data.
The performance of the classifiers was evaluated by tenfold cross-validation. To ensure the speaker independence, no instance of a test subject is allowed to be in the train dataset of each fold. For each experiment, the feature selection is performed using CFS to the training set of each fold. The results are presented in terms of confusion matrix, weighted average (WA) recall, and unweighted average (UA) recall. WA recall is defined as the ratio of the number of correctly classified instances to the total number of instances in the database. As classes are unbalanced in the databases, we also reported UA recall. UA recall is the average of per-class accuracies and more useful than WA recall when the distribution of classes is highly skewed.
4.1 Categorical classification results
Performances for categorical emotion classification
4.2 Emotion primitives estimation
where N is the number of speech data in the database, x is the true value, and y is the predicted value.
For the cross-corpus experiments, we employed the VAM database . The VAM corpus consists of audio-visual recordings taken from German TV talk show called Vera-Am-Mittag. The corpus contains 946 spontaneous utterances from 47 participants of the show, and each utterance was labeled using a discrete five-point scale for three-dimensional emotion space of valence, activation, and dominance by 6 to 17 labelers.
4.2.1 Estimation results
The estimation performances for TURES and VAM databases
Research shows that language and culture play an important role in how vocal emotions are perceived . Recently, a few studies present results on cross-corpus evaluations, i.e., training on one and testing on a different one . However, most of the work employed either different databases of the same language or Germanic languages. Turkish is an agglutinative language, i.e. new words can be formed from existing words using a rich set of affixes. In this study, we performed cross-corpus experiments between the Turkish emotional speech database and the VAM corpus. The cross-corpus results were given in Table 6. It can be seen from the table that the cross-corpus training and testing seems to work especially for activation and dominance dimensions. For example, when TURES is chosen for training and VAM for testing, the correlation coefficients of 0.743 and 0.717 with 0.189 and 0.178 mean absolute errors were obtained for activation and dominance, respectively. For valence, like intra-corpus experiments, the cross-corpus results were not promising. This result indicates that acoustic information alone is not enough to discriminate emotions in valence dimension. These results are consistent with previous research . Other sources of information, such as linguistic information, are needed in order to obtain better discrimination results in valence dimension .
4.2.2 Emotion classification from the emotion primitives
Emotion classification results from three-dimensional emotion primitives
In this work, we carried out a study on emotion recognition from Turkish speech using acoustic features. In recent years, several corpora in different languages have been collected; however, Turkish is not among the languages that has been investigated in the context of emotion recognition. In this paper, we presented the Turkish Emotional Speech Database and reported the baseline results. Categorical representations and dimensional descriptions are two common approaches to define emotion present in speech. In categorical approach, a fixed set of words is used to describe an emotional state, whereas in the dimensional approach, emotion is defined as points in the multidimensional space. The three most common dimensions used are valence, activation, and dominance which represent the main properties of emotional states. In this work, both categorical evaluation and emotion primitive estimation were performed. An unweighted average recall of 45.5% was obtained for the classification. For emotion dimension estimation, the regression results in terms of correlation coefficient are promising for activation and dominance, with 0.739 and 0.743, respectively. For valence, however, the correlation between the averaged ratings of the evaluators (reference values) and the SVR estimates was low (only 0.288). In this study, we also performed cross-corpus evaluations, and the results were promising especially for activation and dominance dimensions. This result indicates that acoustic information alone is not enough to discriminate emotions in valence dimension. Future work includes the use of linguistic information to improve the classification and regression results especially for valence.
a The Turkish emotional speech database is available to the research community through the website http://www.turesdatabase.com.
This work was supported by the Turkish Scientific and Technical Research Council (TUBITAK) under project no. 109E243.
- Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B: A database of, German emotional speech. Paper presented at the Interspeech 9th European conference on speech, communication and technology Lisbon, Portugal, 4–8 Sept 2005Google Scholar
- Engberg IS, Hansen AV: Documentation of the Danish Emotional Speech Database. Aalborg: Aalborg University; 1996.Google Scholar
- Liberman M, Davis K, Grossman M, Martey N, Bell J: Emotional Prosody, Speech and Transcripts. Philadelphia: Linguistic Data Consortium; 2002.Google Scholar
- Banziger T, Mortillaro M, Scherer K: Introducing the Geneva multimodal expression corpus for experimental research on emotion perception. Emotion 2012, 12: 1161-1179.View ArticleGoogle Scholar
- Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, Mcrorie M, Claude Martin J, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K: The HUMAINE Database: addressing the collection and annotation of naturalistic and induced emotional data. In Affective Computing and Intelligent Interaction: Lecture Notes in Computer Science. Edited by: Paiva ACR, Prada R, Picard RW. Berlin: Springer; 2007:488-500.View ArticleGoogle Scholar
- Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A: Acoustic emotion recognition: a benchmark comparison of performances. In IEEE Workshop on Automatic Speech Recognition Understanding. Merano, Italy: IEEE; 13 Nov–17 Dec 2009.Google Scholar
- Hansen JHL, Bou-Ghazale S: Getting started with SUSAS: a speech under simulated and actual stress database. Paper presented at the fifth European conference on speech communication and technology, EUROSPEECH 1997 Rhodes, Greece 22–25 Sept 1997Google Scholar
- Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang J, Lee S, Narayanan S: IEMOCAP: Interactive emotional dyadic motion capture database. J. Lang. Resour. Eval 2008, 42(4):335-359. 10.1007/s10579-008-9076-6View ArticleGoogle Scholar
- McKeown G, Valstar M, Cowie R, Pantic M: The SEMAINE corpus of emotionally coloured character interactions. In IEEE ICME. Suntec City: ; 19–23 Jul 2010.Google Scholar
- Steidl S: Automatic Classification of Emotion Related User States in Spontaneous Children’s Speech. Germany: University of Erlangen-Nuremberg; 2009.Google Scholar
- Grimm M, Kroschel K, Narayanan S: The Vera am Mittag German audio-visual emotional speech database. In IEEE International conference on multimedia and expo (ICME). Hannover, Germany: IEEE; 23 Jun–26 Apr 2008.Google Scholar
- Clavel C, Vasilescu I, Devillers L, Ehrette T, Richard G: The SAFE Corpus: fear-type emotions detection for surveillance applications. In LREC. Genoa, Italy: ; 24–26 May 2006.Google Scholar
- Douglas-Cowie E, Campbell N, Cowie R, Roach P: Emotional speech: towards a new generation of databases. Speech Commun. Spec. Issue, Speech and Emotion 2003, 40: 33-60.Google Scholar
- Ekman P: Basic emotions. In Handbook of Cognition and Emotions. Edited by: Dalgleish L, Power M. New York: Wiley; 1999:409-589.Google Scholar
- Douglas-Cowie E, Cowie R, Schroder M: A new emotion database: considerations, sources and scope. In ISCA Workshop on speech and emotion. UK: Newcastle; 5–7 Sept 2000.Google Scholar
- Ang J, Dhillon R, Krupski A, Shriberg E, Stolcke A: Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In ICSLP 2002. Denver, Colorado: ISCA; 16–20 Sept 2002.Google Scholar
- Nwe TL, Foo SW, De Silva L: Speech emotion recognition using hidden Markov models. Speech Commun 2003, 41(4):603-623. 10.1016/S0167-6393(03)00099-2View ArticleGoogle Scholar
- Lee CM, Narayanan S: Towards detecting emotions in spoken dialogs. IEEE T Speech Audi. P 2005, 13(2):293-303.View ArticleGoogle Scholar
- Grimm M, Kroschel K, Mower E, Narayanan S: Primitives-based evaluation and estimation of emotions in speech. Speech Commun 2007, 49: 787-800. 10.1016/j.specom.2007.01.010View ArticleGoogle Scholar
- Schuller BS, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In eighth conference on InterSpeech. Antwerp, Belgium: ISCA; 27–31 Aug 2007.Google Scholar
- Clavel C, Vasilescu I, Devillers L, Richard G, Ehrette T: Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun 2008, 50(6):487-503. 10.1016/j.specom.2008.03.012View ArticleGoogle Scholar
- Yildirim S, Narayanan S, Potamianos A: Detecting emotional state of a child in a conversational computer game. Comput. Speech and, Lang 2011, 25: 29-44. 10.1016/j.csl.2009.12.004View ArticleGoogle Scholar
- Albornoz EM, Milone DH, Rufiner HL: Spoken emotion recognition using hierarchical classifiers. Comput. Speech and Lang 2011, 25(3):556-570. 10.1016/j.csl.2010.10.001View ArticleGoogle Scholar
- Lee CC, Mower E, Busso C, Lee S, Narayanan S: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun 2011, 53(9-10):1162-1171. [Special issue: Sensing Emotion and Affect - Facing Realism in Speech Processing] 10.1016/j.specom.2011.06.004View ArticleGoogle Scholar
- Polzehl T, Schmitt A, Metze F, Wagner M: Anger recognition in speech using acoustic and linguistic cues. Speech Commun 2011, 53(9-10):1198-1209. 10.1016/j.specom.2011.05.002View ArticleGoogle Scholar
- Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Aharonson V, Kessous L, Amir N: Whodunnit - searching for the most important feature types signalling emotion-related user states in speech. Comput. Speech Lang 2011, 25: 4-28. 10.1016/j.csl.2009.12.003View ArticleGoogle Scholar
- Eyben F, Wöllmer M, Schuller B: openSMILE: the Munich versatile and fast open-source audio feature extractor. In international conference on multimedia. Firenze, Italy: ACM; 25–29 Oct 2010.Google Scholar
- Arunachalam S, Gould D, Andersen E, Byrd D, Narayanan S: Politeness and frustration language in child-machine interactions. In InterSpeech. Denmark: Aalborg; 3–7 Sept 2001.Google Scholar
- Batliner A, Steidl S, Schuller B, Seppi D, Laskowski K, Vogt T, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V: Combining efforts for improving automatic classification of emotional user states. In fifth Slovenian and first international language technologies conference. Ljubljana, Slovenia: IS-LTC’06; 9–10 Oct 2006.Google Scholar
- Schuller B, Batliner A, Steidl S, Seppi D: Emotion recognition from speech: putting ASR in the loop. In IEEE international conference on acoustics, speech, and signal processing. Taipei, Taiwan: IEEE; 19–24 Apr 2009.Google Scholar
- Schuller B: Recognizing affect from linguistic information in 3D continuous space. IEEE Trans. Affect. Comput 2012, 2(4):192-205.View ArticleGoogle Scholar
- Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Muller C, Narayanan S: The INTERSPEECH 2010 paralinguistic challenge. In InterSpeech. Japan: Makuhari; 26–30 Sept 2010.Google Scholar
- Shami M, Verhelst W: Automatic classification of expressiveness in speech: a multi-corpus study. In Speaker Classification II LNCS. Edited by: Müller C. Berlin: Springer; 2007:43-56.View ArticleGoogle Scholar
- Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput 2010, 1(2):119-131.View ArticleGoogle Scholar
- Oflazoglu C, Yildirim S: Turkish emotional speech database. In IEEE 19th conference signal processing and communications applications. Antalya, Turkey: IEEE; 20–22 Apr 2011.Google Scholar
- Fleiss J: Measuring nominal scale agreement among many raters. Psychol. Bull 1971, 76(5):378-382.View ArticleGoogle Scholar
- Bradley M, Lang PJ: Measuring emotion: the self-assessment manikin and the semantic differential. J. Behav. Ther. Exp. Psychiatry 1994, 25: 49-59. 10.1016/0005-7916(94)90063-9View ArticleGoogle Scholar
- Hall M: Correlation-based feature selection for machine learning. New Zealand: PhD thesis, University of Waikato; 1999.Google Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol 2011, 2: 1-27.View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor. Newsl 2009, 11: 10-18. 10.1145/1656274.1656278View ArticleGoogle Scholar
- Bouckaert R: Bayesian Network Classifiers in Weka for Version 3-5-7, Technical Report. Hamilton, NZ: Waikato University; 2008.Google Scholar
- Smola AJ, Schölkopf B: A tutorial on support vector regression. Stat. Comput 2004, 14(3):199-222.MathSciNetView ArticleGoogle Scholar
- Scherer KR, Banse R, Wallbott H: Emotion inferences from vocal expression correlate across languages and cultures. J Cross Cult, Psychol 2001, 32: 76-92. 10.1177/0022022101032001009View ArticleGoogle Scholar
- Grimm M, Kroschel K, Narayanan S: Support vector regression for automatic recognition of spontaneous emotions in speech. In IEEE international conference on acoustics, speech and signal processing. Honolulu, HI; 15–20 Apr 2007.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.