Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

The performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.


Introduction
Speech is the natural medium of communication for humans. In recent years, improvements in speech technology have led to a considerable enhancement in human-computer interaction. The applications of such technology are numerous, including speech and speaker recognition, interactive voice response (IVR), dictation systems, and voice-based command and control for robots, etc.
Despite all the recent advances in speech processing systems, often these systems struggle with issues caused by speech variabilities. Such variabilities in speech can occur due to speaker-dependent characteristics (e.g., the shape of the vocal tract, age, gender, health, and emotional states), environmental noise, channel variability, speaking rate (e.g., changes in timing and realization of phonemes), speaking style (e.g., read speech vs. spontaneous speech), and accent variabilities (e.g., regional accents or non-native accents) [1].
Over the last decades, automatic speech recognition (ASR) systems have progressed significantly. The function of these systems is to recognize the sequence of words uttered by a speaker. Speech recognizers could be used in many applications for more convenient humanmachine interaction, including mobile phones, smart home devices, intelligent vehicles, medical devices, and educational tools.
It is known that speech variabilities such as emotions could affect speech recognition performance considerably. Although most of the research in this area has been focused on the recognition of speech emotions, limited works have been performed in the area of emotionaffected speech recognition (EASR).
Generally, in real-life applications, there is an incompatibility between training and testing conditions. The current approaches for the reduction of the mismatch between the speech sets of neutral training and emotional testing of the EASR system can be categorized into three main classes.
In the first class of approaches, called model adaptation, a re-training of acoustic models is achieved. The adaptation techniques in this group include maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR) [2]. Vlasenko et al. [3] employed MLLR and MAP and applied them to the German emotional database (EMO-DB) [4] to improve the recognition performance. In an attempt to use a fast adaptation method, Pan et al. [5] used the MLLR technique to construct emotion-dependent acoustic models (AMs) by employing a small portion of the Chinese emotional database. Here, first, the Gaussian mixture model (GMM)-based emotion recognition is performed to improve the performance of the speech recognition by selecting an appropriate emotion-match model. Another study was accomplished by Schuller et al. [6] in the framework of model adaptation. They employed adaptation methods in a hybrid ASR which is constructed by a combination of an artificial neural network (ANN) and a hidden Markov model (HMM) to form the ANN-HMM ASR structure. Compared to a static adaptation strategy, it is observed that maximum improvement in recognition is obtained by a dynamic adaptation method. To remedy the influence of emotion in recognition, Ijima et al. [7] have involved the paralinguistic information into the HMM process, which resulted in style estimation and adaptation using the multiple regression HMM (MRHM M) technique.
In the second group of methods, some knowledge clues are added to the language model (LM) of the ASR system to reduce the mismatch between the neutral training and emotional testing conditions. This strategy was taken by Athanaselis et al. [8] who explained how an emotion-oriented LM can be constructed from the existing British national corpus (BNC). In their method, an increased representation of emotional utterances is obtained by, first, identifying emotional words in BNC using an emotional lexicon. Then, sentences containing these words are recombined with BNC to construct a corpus with a raised proportion of emotional material. The corpus is then used to design emotionally enriched LM to improve recognition performance with emotional utterances.
The third class of approaches to compensate mismatch of acoustic characteristics between neutral utterances and emotionally affected speech materials involves those that study acoustic and prosodic features intending to provide robust features to overcome the performance degradation in the EASR systems. In [9], the performance of the HMM-based emotional speech recognizer is improved by using 12 cepstral coefficients, the logarithm of energy, their first-and second-order delta parameters, and additional features, such as the pitch frequency, its slope, and per-speaker-syllable-z-normalization (PSSZN). The study in [10] examines the changes in formant frequencies and pitch frequency due to emotional utterances. The HMM-based speech recognizer uses one logenergy coefficient and cepstral coefficients plus their delta and delta-delta parameters as its typical feature vector. In this work, an emotion recognition process is first conducted to find more appropriate parameters to be included in the feature vector. The results show that adding supplementary features such as pitch and formant frequencies to the feature vector is useful in improving emotional speech recognition. In another study, Sun et al. [11] proposed a new feature, called F-ratio scale frequency cepstral coefficients (FFCCs) that employed Fisher's F-ratio to analyze the importance of frequency bands for enhancing the mel-frequency cepstral coefficient (MFCC)/perceptual linear prediction (PLP) filterbank design in emotional condition. The simulation results show that employing the optimized features increases the recognition performance of EASR as compared to the conventional features of MFCC or PLP in the sense of sentence error rate (SER).
The performance of a speech recognition system degrades when there is a mismatch between the set of speakers used to train the system and that used to recognize it. This mismatch arises due to the anatomical differences of various speakers, as reflected in the vocal tract structures among different speakers. The result is that the system trained on specific speakers will perform poorly in the presence of other speakers. Vocal tract length normalization (VTLN) is one of the approaches to reduce the mismatch between training data and recognition data in an ASR system. Lee et al. [12] performed pioneering works in utilizing the VTLN technique for diminishing the performance reduction in an ASR system which is caused by variation of vocal tract length among different speakers. The procedure of speaker normalization is based on warping the frequency axis of mel-filterbank linearly in the process of extracting mel-frequency cepstrum features. To this aim, first, the warping factor is estimated efficiently in a model-based maximum likelihood framework. Then, the normalization process is conducted by scaling the frequency axis of the speech signal with the calculated warping factor. The recognition results show the effectiveness of this procedure for telephone-based connected digit databases. In another approach to implement the frequency warping for VTLN, Panchapagesan et al. [13] proposed a novel transformation matrix to perform warping in the discrete-cosine transform (DCT) calculation stage of the MFCC feature for speaker normalization in the ASR system. Compared with other linear transformation approaches, employing the proposed transformation matrix had a lower computational load without modifying the standard MFCC feature extraction procedure. For presenting the effectiveness of the new linear transformation method for VTLN, the DARPA resource management (RM1) database was used [14].
Conceptually, for a person speaking emotionally, the anatomical features of the speaker regarding the structure of his/her vocal tract are changed compared to those of a neutral speaking person. This fact implies that compensating the emotion-related variabilities on a speech by the technique of VTLN could increase the speech recognizer performance in emotional conditions. To improve the recognition rate of emotional speech, Sheikhan et al. [15] neutralized the MFCC features by applying the VTLN technique for the emotional states of Anger and Happy. The frequency warping of MFCCs is accomplished after finding the most emotion-affected frequency range. Finally, the neutralized MFCCs are employed in an HMM-based speech recognizer trained with neutral speech utterances. The simulation results demonstrate that applying the frequency warping to both modules of mel-filterbank and DCT yields better recognition performance as compared to the case in which the warping is applied only to the individual modules.
The previous studies have focused on applying VTLN as a normalization tool to MFCCs as the most popular acoustic feature in the speech recognition framework. The strategy taken in the present work is the same as in [15] in that VTLN is used to normalize the acoustical features extracted from an emotional utterance. However, our work differs from [15] in some aspects. Here, the robustness of different features including MFCCs is investigated in various emotional states with/without employing the cepstral mean normalization (CMN) in the EASR system. Next, a study was conducted to find an optimal frequency range in which warping is performed in the VTLN method. Also, the technique of VTLN is applied to other acoustical features than MFCCs to develop more robust features which can be used in improving the performance of the EASR system. Another aspect of the present work concerns the use of the deep neural network (DNN) in the structure of speech recognizer. Due to the high performance of DNNs in the acoustic modeling over the classical GMMs, the VTLN method is also employed with the state-of-the-art DNN-HMM speech recognizer.
The paper is organized as follows. In Section 2, the proposed EASR system and the technique of VTLN are presented, which describe the concept of warping or normalizing speech features in detail. Section 3 provides the experiments and recognition results for speech materials from two known databases with neutral and different emotional states. The simulation results presented in this section include examining the effect of applying CMN in the feature extraction process, investigating the influence of using different ranges of frequency warping, and evaluating the performance of various frequency warping methods for the GMM-HMM/DNN-HMM EASR system. The concluding remarks are given in Section 4.

Emotion-affected speech recognition system
The overall structure of the proposed EASR system is depicted in Fig. 1. Here, different emotional utterances serve as input to the system which is then converted into a sequence of acoustic features by the unit of feature extraction. However, the recognition rate of an ASR system trained with neutral speech degrades when features of emotional speech are fed into the system. This calls for a procedure for the normalization of acoustic features before they are used by the speech recognizer system. To this aim, the technique of VTLN is adopted in feature extraction to alleviate the effects of emotion in the speech recognition process. The VTLN approach can be performed either by frequency warping in filterbank, DCT unit, or both. After the process of feature normalization, the features are given to a speech recognizer as the back-end processing stage. In this work, the Kaldi speech recognizer [16] trained with neutral utterances of the Persian and English datasets [17,18] is employed as the baseline ASR system.

Feature extraction
Feature extraction aims to find a set of feature vectors that are capable to capture the essential information as much as possible from the input speech signal. An ideal feature vector for emotional speech recognition application should maximize the discriminating ability of speech classes (e.g., phonemes) while it should not be affected by speaker-specific characteristics such as shape and length of the vocal tract.
It has been shown that M-MFCC and ExpoLog have performed better than MFCC in speech recognition under stress conditions [20]. The extraction procedure of these features is similar to MFCC, but it differs from that of MFCC in the frequency scaling of the filterbank.
GFCC was introduced as a robust feature for speech recognition in a noisy environment [21]. The process of GFCC feature extraction is based on the gammatone filterbank, which is derived from psychophysical observations of the auditory periphery.
PNCC is one of the acoustic features which provides notable results for the recognition of speech in noisy and reverberant environments [24]. The extraction of the PNCC feature is inspired by human auditory processing.
In this paper, MFCC, M-MFCC, ExpoLog, GFCC, and PNCC are employed as auditory features in the EASR system.

Feature normalization
As it was pointed out earlier, the mismatch between training and recognition phases causes performance degradation in ASR systems [25]. One of the sources of this mismatch can be associated with various speakers having vocal tracts with different anatomical features. Previous studies have shown that the acoustic and articulatory characteristics of speech are affected by the emotional content of the speech. There is evidence that, when a typical speaker speaks emotionally, the position of the tip of the tongue, jaw, and lips are changed, and this, in turn, modifies the acoustic features such as formant frequencies [26,27]. This implies that the vocal tract length variation can be considered as a function of the emotional state of a person [28]. This, in turn, means that during the recognition of emotional speech, techniques are needed to decrease the influence of vocal tract length variations that arise from the emotional state of the speaker. Among different approaches that can be considered, the VTLN method is employed in this work as a way to remedy the mismatch problem in speech recognition applications.

Methods of VTLN
The VTLN technique views the main difference between two speakers as a change in the spectral content of acoustical features due to the differences in vocal tract length between speakers [29]. The idea of VTLN in speech recognition can be extended to the emotional speaking task, where the difference between emotional and neutral speech is associated with the variation of the frequency axis, originating from the vocal tract length differences of emotional and neutral speaking styles.
To cope with the mismatch problem between neutrally training and emotionally testing of an ASR system, the VTLN technique provides a warping function by which the frequency axis of the emotional speech spectrum is transformed to the frequency axis of the neutral speech spectrum. The normalization procedure can be performed by linear or nonlinear frequency warping functions such as piecewise linear, exponential functions, etc. [12]. These functions operate based on a warping parameter, which compresses or expands the speech spectra as follows [29]: Here, f ′ is the frequency warping operation applied to the frequency axis of the emotional speech spectrum using α as the warping factor.
Most auditory-based acoustic features employ some sort of frequency decomposition (e.g., using filterbanks) and decorrelation of spectral features (e.g., using DCT processing) in their computations. This means that the warping of frequencies can be applied in the filterbank and/or DCT processing stage(s) of the acoustic feature extraction. Figure 2 represents the general block diagram of employing frequency warping in the filterbank and/or DCT domain(s) to compute the corresponding warped features of MFCC [19], M-MFCC [20], ExpoLog [20], GFCC [21], and PNCC [24] in one or both of the domains. For comparison purposes, the dashed boxes represent the optional cases of no-frequency warping which are used to generate the conventional features in their unwarped form. Intermediate operations performed for each feature extraction method are also illustrated. The warping strategies are discussed in detail below.

Frequency warping in the filterbank domain
In this section, the procedure of applying frequency warping to normalize filterbank-based acoustic features is discussed. Generally, frequency warping in the melfilterbank is a well-known technique that was utilized in speech recognition tasks for speaker normalization [12]. Sheikhan et al. [15] also used this approach for the normalization of the MFCC feature for the emotional states of Anger and Happy in EASR. This strategy is also adopted in the present work for the normalization of other filterbank-based features for different emotional states (see Fig. 2).
Based on this strategy, frequency warping is applied to the frequencies of a typical filterbank to change the positions of frequency components. In this work, the distinction between vocal tract length of the emotional and neutral speech is modeled by a linear frequency warping function. The warping is performed by a piecewise linear function to preserve the bandwidth of the original signal. Motivated by the approach introduced in [12], the following warping function is proposed to perform the frequency warping in the filterbank stage of extracting acoustic features: In this equation, f(n) are the frequency bins of the n th frame, f warped (n) are the corresponding warped frequencies, and the parameter α is the warping factor that controls the amount of warping. Here, formant frequencies are considered to determine the warping intervals, where f 2l and f 2h represent, respectively, the lowest and highest values of second formants, and f 3h depicts the highest value of third formants. These values are obtained as the average values among all second and third formants extracted from the whole sentences of a particular emotional state. The warping factor α for a specific emotional state is computed as the ratio of the average value of the second formants obtained from neutral utterances to that obtained from emotional utterances. The frequency warping is performed in the range of (f 2l , f 2h ), and the linear transformation in the (f 2h , f 3h ) gap is utilized to compensate the spectral changes caused by the frequency warping and return the warping factor to 1. As an example, Fig. 3 shows the distribution of formant frequencies obtained from all utterances of the male speaker in the Persian ESD database [17] for the emotional state of Disgust along with the values for the warping intervals and warping factor. Figure 4 illustrates the piecewise linear warping function obtained for a sample utterance of Disgust in the database using Eq. (2). Here, the horizontal axis represents the unwarped (i.e., emotional) frequencies whereas the vertical axis indicates the warped (i.e., neutral) frequencies.

Frequency warping in the DCT domain
Here, the procedure of applying frequency warping is examined in the DCT domain to normalize acoustic features. The frequency warping in DCT was employed in speech recognition tasks for speaker normalization [13]. The same approach was utilized by Sheikhan et al. [15] for the normalization of MFCCs extracted from the emotional utterances of Anger and Happy in EASR. The approach is also adopted in the present work for the normalization of other features in different emotional states. Referring to Fig. 2, after the processing performed in the units of "Filterbank" and "Intermediate Operations," the DCT operation is applied to the input signal L to compute the cepstral coefficients as: where C is the DCT matrix with the components given as: Here, M represents the number of filters in the filterbank, N is the number of cepstral coefficients, and α k is a factor calculated as: In the following, the linear frequency warping in the DCT domain is described for those features that have the DCT calculation in their extraction process [13]. Step 1: The signal L is retrieved from the cepstral coefficients using the inverse DCT (IDCT) operator: Here, we consider the unitary type-2 DCT matrix for which C −1 = C T . With this assumption L can be written in the expanded form as: where c(k) (k = 0, 1, …, N − 1) are the cepstral coefficients.
Step 2: Considering ψ(u) as the warping function of the continuous variable u, the warped discrete output is obtained by: The warping function ψ(u) is computed as: where θ p (λ) is the normalized frequency warping function given as: Here, λ represents the normalized frequency, λ 0 is the normalized reference frequency specifying the range (0, λ 0 ) in which frequency warping is performed, and p is the warping factor that controls the amount of warping.
By rewriting Eq. (8) in vector form, we obtain: whereC represents the warped IDCT matrix given as: By rearranging Eq. (9), the warped IDCT matrix can be written in terms of normalized frequency warping function θ p (λ): Step 3: Finally, by putting the warped discrete output L in Eq. (3), the warped cepstral coefficientsĉ are computed as: where the matrix TðT¼C:CÞ is a linear transformation that transforms the initial cepstral coefficients into the warped coefficients.
In the present work, the above approach is applied to acoustical features to obtain the DCT-warped MFCC, M-MFCC, ExpoLog, GFCC, and PNCC. An example of the warping function employed in the DCT unit is depicted in Fig. 5.
Notably, the warping factors p used in the DCT warping for different emotions are calculated in the same manner as α obtained for the filterbank warping (refer to Eq. (2)).

Applying VTLN to acoustic features
In this section, based on the model for the extraction of various acoustical features, different VTLN warping methods are employed in the filterbank and/or DCT domain(s) to obtain warped (i.e., normalized) features which are finally fed into the Kaldi ASR system. To this aim, the filterbank warping is implemented by employing the warping function given in Eq. (2), whereas, in the DCT warping procedure, the steps given in Frequency warping in the DCT domain are adopted. The combined frequency warping is obtained by concatenating the respective frequency warping operations in both filterbank and DCT domains.

Experimental setup
To examine the effectiveness of the frequency warping for MFCC, M-MFCC, ExpoLog, GFCC, and PNCC in the speech recognition system, the performances of these features and their corresponding warped features are evaluated in the Kaldi baseline ASR system for different emotional states.
To this aim, the Persian ESD [17] and CREMA-D [18] datasets are used to train and test the GMM-HMM/ DNN-HMM Kaldi speech recognizer.
The baseline system is trained using MFCC, M-MFCC, ExpoLog, GFCC, and PNCC extracted from neutral utterances of databases. The extracted features have all 13 dimensions, except for GFCC which is 23dimensional. The delta and delta-delta features are also calculated and added to the previously extracted features to construct a complete acoustic feature. The training procedure of the Kaldi baseline consists of constructing appropriate lexicons, generating language models, and training the acoustic models of the corresponding databases. First, the lexicons of the Persian ESD and CREMA-D are generated. Then, the corresponding language models are produced according to the constructed lexicons. In the training of the acoustic models in the GMM-HMM-based system, 3state monophone HMMs are used to model all phonemes in the datasets (30 in Persian ESD, 38 in CREMA-D), including silences and pauses. In contrast, the training of the acoustic models in the DNN-HMM-based system requires triphone HMMs. In this paper, the training of the DNN-HMM EASR system is performed based on Karel Vesely's method [30] in the Kaldi toolkit. The performance of the proposed EASR system is assessed in three experiments. In the first experiment, the effectiveness of CMN [12] is inspected without employing the VTLN method in the EASR system. Here, first, the GMM-HMM-based system is trained with/ without employing the CMN technique in extracting features from neutral utterances. Then, speech recognition is performed based on the features extracted both from the emotional and neutral utterances of the corpora.
In the second experiment, the impact of employing different values of the normalized reference frequency, λ 0 , is studied on the performance of the GMM-HMM Kaldi for different acoustic features. The optimal λ 0 is then chosen for the later frequency warping experiments.
In the last experiment, the advantage of using warped emotional speech features in the GMM-HMM/DNN-HMM speech recognition system is explored. Here, the simulations are conducted with different structures of the Kaldi speech recognizer. First, both the Persian ESD and CREMA-D datasets are used to train and test the GMM-HMM Kaldi system. Then, CREMA-D with sufficient utterances and speakers is employed to evaluate the recognition performance of the warped features with the state-ofthe-art DNN-HMM Kaldi recognizer. By considering the benefits of employing the CMN technique in the EASR system as observed in the first experiment, the CMN procedure is applied to all features to compensate for speaker variability in the Kaldi system. Here, the performances of warped features in the filterbank and/or DCT domain(s) are compared with those of unwarped features in the Kaldi baseline.
The evaluation experiments of the proposed EASR system are conducted for five emotional states, including Anger, Disgust, Fear, Happy, and Sad.

Databases
The experimental evaluations are carried out by the Persian emotional speech database (Persian ESD) [17] and crowd-sourced emotional multi-modal actors dataset (CREMA-D) [18]. Table 1 illustrates briefly the specifications of the Persian ESD and CREMA-D databases.
The Persian ESD is a script-fixed dataset that encompasses comprehensive emotional speech of standard Persian language containing a validated set of 90 sentences. These sentences were uttered in different emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad) and neutral mode by two native Persian speakers (one male and one female). The recording of the database was accomplished in a professional recording studio in Berlin under the supervision of acoustic experts. As shown in Table 1, the Persian ESD comprises 472 speech utterances, each with a duration of 5 s on average, which are classified into five aforementioned basic emotional groups. The database was articulated in three situations: (1) congruent: emotional lexical content spoken in a congruent emotional voice (76 sentences by two speakers), (2) incongruent: neutral sentences spoken in an emotional voice (70 sentences by two speakers), and (3) baseline: all emotional and neutral sentences spoken in neutral voice (90 sentences by two speakers). In general, sentences with different emotions do not have the same lexical content. The validity of the database was evaluated by a group of 34 native speakers in a perception test. Utterances having a recognition rate of 71.4% or better were regarded as valid descriptions of the target emotions. The recordings are available at a sampling rate of 44.1 kHz and mono channel.
The CREMA-D is an audio-visual script-fixed dataset for the study of multi-modal expression and perception of basic acted emotions. As shown in Table  1, the dataset consists of a collection of 7442 original clips of 91 actors (48 males, 43 females, age 20-74) of various races and ethnicities with facial and vocal emotional expressions in sentences. The actors uttered 12 sentences with various emotional states (Anger, Disgust, Fear, Happy, Neutral, and Sad) and different emotion levels (Low, Medium, High, and Unspecified). Sentences with different emotions have the same lexical content. Using perceptual ratings from crowd sourcing, the database was submitted for validation by 2443 raters to evaluate the categorical emotion labels and real-value intensity values for the perceived emotion. Participants assessed the dataset in audio-only, visual-only, and audio-visual modalities with recognition rates of 40.9%, 58.2%, and 63.6%, respectively. In this paper, audio-only data with a sampling rate of 16 kHz are used.

Evaluation criterion
The performance of an ASR system for a particular task is often measured by comparing the hypothesized and test transcriptions. In this context, the percentage of word error rate (WER), as the most widely used metric, is used to evaluate the recognition performance of the proposed EASR system. After the alignment of the twoword sequences (i.e., hypothesis and test), WER is calculated by the rate of the number of errors to the total number of words in the test utterances.

Results and discussions 3.4.1 Effect of CMN on the recognition performance
In the first experiment, the effect of employing CMN in the EASR system is explored. For this purpose, the baseline ASR system trained with neutral utterances is tested with emotional speech inputs (i.e., unmatched scenario) with/without applying CMN to the extracted features. For comparisons, the performance of the baseline system is also evaluated in the case of neutrally trained/neutrally tested inputs (i.e., matched scenario). For the simulations of this scenario, the neutral utterances are split into two sets; 80% for training and the remaining 20% for testing. The experimental results are shown in Figs. 6 and 7 for the Persian ESD [17] and CREMA-D [18] datasets, respectively.
The outcomes of this experiment for both databases reveal clearly that using the CMN method for the unmatched scenario yields, on average, superior performance of the recognizer in terms of WER. This implies that CMN decreases the destructive effects of emotional speech. The evaluation results in the case of matched scenario show no considerable effect of CMN on the recognizer efficiency. Also, the recognition results in Figs. 6 and 7 show significantly different recognition performances for the emotional and neutral input utterances. It is obvious that in the case of the neutrally trained/neutrally tested experiment, the WER values are small for the baseline system trained with different features. However, when the neutrally trained system is tested with emotional utterances, in general, WERs are increased extremely. This fact indicates that the emotion-affected speech represents a significant mismatch condition of the ASR systems trained with neutral utterances. Furthermore, by comparing the average WER scores among all emotional states obtained for both databases, one realizes that PNCC and GFCC have the best and worst performance, respectively, introducing PNCC as the robust feature in the EASR system.
Due to the benefits of employing CMN in decreasing WER, in the following experiments, we use the CMN technique in the construction of the features.

Investigation of λ 0 values in frequency warping
Here, the impact of the normalized reference frequency, λ 0 , on the recognition performance of different acoustic features in the DCT domain and the combined filterbank and DCT domain is examined. The results of this analysis for three values of λ 0 are represented in Tables  2 and 3, respectively, for the Persian ESD and CREMA-D datasets. Here, among all features investigated, it can be observed that PNCC has the highest performance in the GMM-HMM EASR system due to its robustness against different emotional states. Also, changing the value of λ 0 has no sensible impact on improving the recognition results of this feature. By comparing the average WER values shown in Tables 2 and 3, it can be seen (except for ExpoLog in "DCT Warping" and ExpoLog and GFCC in "Filterbank & DCT Warping") that the best performance is achieved by λ 0 = 0.4 for all acoustic features. This value of λ 0 is considered in the following experiments to specify the range of frequency warping.

Frequency warping in the EASR system
The last experiment concerns evaluating the efficiency of frequency warping in the filterbank and/or DCT domain(s) for the neutrally trained/emotionally tested GMM-HMM/DNN-HMM EASR system. Tables 4 and 5 represent the performance scores of the GMM-HMM EASR system for the warped emotional features of MFCC, M-MFCC, ExpoLog, GFCC, and PNCC in the filterbank and/or DCT domain(s) as compared with those of the unwarped features for the Persian ESD [17] and CREMA-D [18] datasets, respectively. Comparing the results of both tables shows, in general, that the WER values for the CREMA-D database are lower than the corresponding values for the Persian ESD dataset. This observation can be justified by the fact that the number of neutral utterances used in CREMA-D to train the GMM-HMM EASR is much higher than those given in Persian ESD (1087 utterances vs. 180 utterances). The evaluations presented in the tables can be interpreted from two perspectives; from the aspect of the applied warping methods, and the aspect of the acoustic features used in the construction of the EASR system. By applying different warping methods to the acoustic features, the results of both tables show that employing all variants of the frequency warping methods improves the recognition rates in terms of WER. In the case of PNCC, WER values are close to each other for all warping methods, showing no advantage of any warping technique over others. Also, by comparing the average values of WER, it is observed, in general, that (except for ExpoLog) the effectiveness of applying the DCT warping to the features is more superior to the filterbank warping and the combined filterbank and DCT warping procedure. Considering the success of applying the DCT warping in decreasing the destructive effect of emotion in the EASR system, this can be interpreted as saying that no further improvement is reached by adding the capability of filterbank warping to the DCT normalization process. The results given in the tables can also be interpreted based on the performances of different acoustical features used in the implementation of the EASR system. Comparing the average WER scores obtained for various warped features in all emotional states indicates that PNCC attains the lowest WER score, whereas GFCC achieves the highest score. Accordingly, among different acoustical features, the warped PNCC can be employed as a robust feature in the EASR system. This confirms the results obtained for PNCC in the first experiment concerning the benefits of applying CMN to the features. The high performance of PNCC is associated with the use of different processing stages in the implementation of PNCC, including the use of a medium-time processing, power-law nonlinearity, a noise suppression algorithm based on asymmetric filtering, and a module that accomplishes temporal masking [24]. Especially, in the medium-time processing of PNCC, longer analysis window frames are considered, which are proved to provide better performance for noise modeling and/or environmental normalization. The use of PNCC has been verified successfully in emotion recognition [31] and emotional speech recognition [32] tasks. To recognize the importance of the warping methods as compared with "No Warping" for each feature and emotional state, the t test [33] is employed as a statistical analysis tool. The symbol * in Tables 4 and 5 indicates the significant cases (i.e., p value < 0.05). According to the results of the statistical analysis, except Disgust, significant values of the WER are observed in most cases of warping methods applied to the corresponding emotional states.

DNN-HMM EASR system
Speech recognition systems employ HMMs to deal with speech temporal variations. Generally, such systems use GMMs to determine how each state of an HMM fits a frame or a window of frames of coefficients representing the acoustic input. A feed-forward neural network is an alternative way to estimate the fit. This neural network takes several frames of coefficients and generates posterior probabilities over HMM states. Research on speech recognizers shows that the use of DNN in acoustic modeling outperforms the traditional GMM on a variety of databases [34,35]. This is partly due to the accurate estimation of the statespecific probabilities and better distinguishing of the Table 2 The effect of modifying the normalized reference frequency, λ 0 , on the recognition performance of the proposed GMM-HMM EASR system (in terms of WER (%)) for Persian ESD. The values of WER are obtained by applying different warping methods to various acoustic features extracted from different emotional utterances class boundaries which result in higher state-level classification performance of HMMs. In this section, the performance of frequency warping is examined with the state-of-the-art DNN-HMM Kaldi speech recognizer using CREMA-D as the database.
The results of the emotional speech recognition for the warped features of MFCC, M-MFCC, ExpoLog, GFCC, and PNCC in the filterbank and/or DCT domain(s) are depicted in Table 6 as compared with those of unwarped features for the CREMA-D dataset. The results illustrate, in general, that using all variants of the frequency warping methods increases the recognition performance of the EASR system. A comparison of the average recognition performances for various warped features in all emotional states reveals that PNCC acquires the lowest WER score, whereas GFCC obtains the highest score. Hence, among various acoustical features, the warped PNCC can be considered as a robust feature in the EASR system. These findings are consistent with the results obtained in the GMM-HMM EASR experiments.
As in the GMM-HMM-based system, in this experiment, the statistical analysis tool of t test [33] is used for identifying the importance of the various warping Table 3 The effect of modifying the normalized reference frequency, λ 0 , on the recognition performance of the proposed GMM-HMM EASR system (in terms of WER (%)) for CREMA-D. methods in comparison with "No Warping" for each feature and emotional state. The significant cases of the test (i.e., p value < 0.05) are specified with the symbol * in Table 6. Comparing the cases of "No Warping" for different features and emotional states in Tables 5 and 6, it can be observed that the DNN-HMM EASR system outperforms the GMM-HMM-based system in terms of the WER values. This could be expected, since, as it was explained before, DNN has a higher performance than the traditional GMM as to the acoustic modeling in speech recognition systems. Also, comparing the WER values in both tables shows that the number of significant cases in Table 6 is lower than that in Table 5. This again can be justified by the fact that the DNN-HMM Kaldi has a better performance than the GMM-HMM Kaldi which prevents the warping methods from having a large impact in reducing the mismatch between the neutral training and emotional testing conditions. However, in contrast with the GMM-HMM Kaldi system, the DNN-HMM Kaldi speech recognizer requires a larger database and more computational time and complexity in training/testing phases.

Comparisons between different frequency warping methods
As a further evaluation process, a comparison has been performed between the results obtained by the proposed GMM-HMM/DNN-HMM EASR system with Persian ESD and CREMA-D as databases and those obtained by Sheikhan et al. [15]. In this context, it is noteworthy that the experiments conducted by Sheikhan et al. [15] were limited only to MFCC as the feature and Anger and Happy as the emotions. In contrast, our simulations consider more emotional states and acoustic features. Table 7 gives a summary of the performance comparisons between different warping methods for the specified features and emotions, where the symbol ">" is interpreted as "better" and " " means "not better."

Conclusion
In this paper, the improvement of the ASR system for emotional input utterances is investigated, where the mismatch between training and recognition conditions results in a significant reduction in the performance of the system. The main objective of the proposed EASR system is to mitigate the effects of emotional speech and to enhance the efficiency of the recognition system. For this purpose, the VTLN method is employed in the feature extraction stage to decrease the effects of emotion in the recognition process. This goal is achieved by applying the frequency warping in the filterbank and/or DCT domain(s). Accordingly, it is expected that the performance of the warped emotional features approaches that of the corresponding neutral features. The proposed system incorporates the Kaldi ASR as the back end which is trained with the different acoustical features (i.e., MFCC, M-MFCC, ExpoLog, GFCC, and PNCC) extracted from neutral utterances. The EASR system trained with neutral utterances is tested with emotional speech inputs in emotional states of Anger, Disgust, Fear, Happy, and Sad. The Persian emotional speech dataset (Persian ESD) and crowd-sourced emotional multimodal actors dataset (CREMA-D) are used for the simulations.
In the experiments, first, the effectiveness of the CMN method is investigated in the recognition performance of the emotional utterances for the neutrally trained/emotionally tested GMM-HMM ASR system. The results of this experiment show that employing this technique improves the recognition scores. Then, the influence of using different values of the normalized reference frequency, λ 0 , is inspected on the performance of the GMM-HMM-based system. The results of this experiment lead to the selection of an optimal λ 0 for the later experiments. To evaluate the performance of the proposed EASR system, the last experiment explores the advantage of using warped features in the GMM-HMM/ DNN-HMM speech recognition system. It is observed, in general, that employing all variants of the frequency warping methods improves the recognition performance of both EASR systems in terms of WER. Also, the experimental results show that the DNN-HMM EASR system achieves higher performance than the GMM-HMM-based system in reducing the mismatch between the neutral training and emotional testing conditions. The higher performance of the DNN-HMM-based system is due to the use of DNN for acoustic modeling in the structure of Kaldi ASR. A comparison of different warped features in both GMM-HMM and DNN-HMM EASR systems confirms that the best WER score is attained for PNCC, whereas the worst score is achieved Table 5 The recognition performance of the proposed GMM-HMM EASR system (in terms of WER (%)) for CREMA-D. The values of WER are obtained by applying different warping methods to various acoustic features extracted from different emotional utterances. The symbol * shows statistically significant cases (i.e., p value < 0.05). for GFCC. The high performance of PNCC can be justified in the use of different processing stages in the PNCC extraction method, which makes this feature robust against various emotional states. The focus of this research is based on the normalization of segmental or vocal tract-specific features. However, the speech signal consists of both segmental and supra-segmental (i.e., prosodic) information. It is known that prosodic features such as pitch and intonation can also be influenced by the emotional states of a speaker. As future work, new compensation methods can be devised to normalize such prosodic features together with the vocal tract-related features before feeding them to an ASR system. Furthermore, since emotional speech is generally produced in a real environment, this work can also be extended to operate in scenarios such as reverberant and noisy conditions. Table 6 The performance of the proposed DNN-HMM EASR system (in terms of WER (%)) for CREMA-D by applying different warping methods to various acoustic features and emotional states. The average WER values are given in the last column. The symbol * shows statistically significant cases (i.e., p value < 0.05)  Table 7 The performance comparisons between different frequency warping methods used in the proposed GMM-HMM/DNN-HMM EASR system for the Persian ESD and CREMA-D datasets and the system of Sheikhan et al. [15] for various acoustic features and emotional states