Emotion-affected speech recognition system
The overall structure of the proposed EASR system is depicted in Fig. 1. Here, different emotional utterances serve as input to the system which is then converted into a sequence of acoustic features by the unit of feature extraction. However, the recognition rate of an ASR system trained with neutral speech degrades when features of emotional speech are fed into the system. This calls for a procedure for the normalization of acoustic features before they are used by the speech recognizer system. To this aim, the technique of VTLN is adopted in feature extraction to alleviate the effects of emotion in the speech recognition process. The VTLN approach can be performed either by frequency warping in filterbank, DCT unit, or both. After the process of feature normalization, the features are given to a speech recognizer as the back-end processing stage. In this work, the Kaldi speech recognizer [16] trained with neutral utterances of the Persian and English datasets [17, 18] is employed as the baseline ASR system.
Feature extraction
Feature extraction aims to find a set of feature vectors that are capable to capture the essential information as much as possible from the input speech signal. An ideal feature vector for emotional speech recognition application should maximize the discriminating ability of speech classes (e.g., phonemes) while it should not be affected by speaker-specific characteristics such as shape and length of the vocal tract.
Little research has been conducted on the robustness and suitability of different acoustic features in the emotional speech recognition framework. However, the studies made in the field of automatic speech recognition reveal that the most notable acoustic features are MFCC [19], modified mel-scale cepstral coefficient (M-MFCC) [20], exponential logarithmic scale (ExpoLog) [20], gammatone filterbank cepstral coefficient (GFCC) [21], linear prediction cepstral coefficient (LPCC) [22], relAtive specTrAl perceptual linear prediction (RASTA-PLP) [23], and power normalized cepstral coefficient (PNCC) [24].
It has been shown that M-MFCC and ExpoLog have performed better than MFCC in speech recognition under stress conditions [20]. The extraction procedure of these features is similar to MFCC, but it differs from that of MFCC in the frequency scaling of the filterbank.
GFCC was introduced as a robust feature for speech recognition in a noisy environment [21]. The process of GFCC feature extraction is based on the gammatone filterbank, which is derived from psychophysical observations of the auditory periphery.
PNCC is one of the acoustic features which provides notable results for the recognition of speech in noisy and reverberant environments [24]. The extraction of the PNCC feature is inspired by human auditory processing.
In this paper, MFCC, M-MFCC, ExpoLog, GFCC, and PNCC are employed as auditory features in the EASR system.
Feature normalization
As it was pointed out earlier, the mismatch between training and recognition phases causes performance degradation in ASR systems [25]. One of the sources of this mismatch can be associated with various speakers having vocal tracts with different anatomical features. Previous studies have shown that the acoustic and articulatory characteristics of speech are affected by the emotional content of the speech. There is evidence that, when a typical speaker speaks emotionally, the position of the tip of the tongue, jaw, and lips are changed, and this, in turn, modifies the acoustic features such as formant frequencies [26, 27]. This implies that the vocal tract length variation can be considered as a function of the emotional state of a person [28]. This, in turn, means that during the recognition of emotional speech, techniques are needed to decrease the influence of vocal tract length variations that arise from the emotional state of the speaker. Among different approaches that can be considered, the VTLN method is employed in this work as a way to remedy the mismatch problem in speech recognition applications.
Methods of VTLN
The VTLN technique views the main difference between two speakers as a change in the spectral content of acoustical features due to the differences in vocal tract length between speakers [29]. The idea of VTLN in speech recognition can be extended to the emotional speaking task, where the difference between emotional and neutral speech is associated with the variation of the frequency axis, originating from the vocal tract length differences of emotional and neutral speaking styles.
To cope with the mismatch problem between neutrally training and emotionally testing of an ASR system, the VTLN technique provides a warping function by which the frequency axis of the emotional speech spectrum is transformed to the frequency axis of the neutral speech spectrum. The normalization procedure can be performed by linear or nonlinear frequency warping functions such as piecewise linear, exponential functions, etc. [12]. These functions operate based on a warping parameter, which compresses or expands the speech spectra as follows [29]:
$${S}_{\mathrm{neutral}}(f)={S}_{\mathrm{emotional}}\left({f}^{\prime}\left(\alpha, f\right)\right)$$
(1)
Here, f′ is the frequency warping operation applied to the frequency axis of the emotional speech spectrum using α as the warping factor.
Most auditory-based acoustic features employ some sort of frequency decomposition (e.g., using filterbanks) and decorrelation of spectral features (e.g., using DCT processing) in their computations. This means that the warping of frequencies can be applied in the filterbank and/or DCT processing stage(s) of the acoustic feature extraction. Figure 2 represents the general block diagram of employing frequency warping in the filterbank and/or DCT domain(s) to compute the corresponding warped features of MFCC [19], M-MFCC [20], ExpoLog [20], GFCC [21], and PNCC [24] in one or both of the domains. For comparison purposes, the dashed boxes represent the optional cases of no-frequency warping which are used to generate the conventional features in their unwarped form. Intermediate operations performed for each feature extraction method are also illustrated. The warping strategies are discussed in detail below.
Frequency warping in the filterbank domain
In this section, the procedure of applying frequency warping to normalize filterbank-based acoustic features is discussed. Generally, frequency warping in the mel-filterbank is a well-known technique that was utilized in speech recognition tasks for speaker normalization [12]. Sheikhan et al. [15] also used this approach for the normalization of the MFCC feature for the emotional states of Anger and Happy in EASR. This strategy is also adopted in the present work for the normalization of other filterbank-based features for different emotional states (see Fig. 2).
Based on this strategy, frequency warping is applied to the frequencies of a typical filterbank to change the positions of frequency components. In this work, the distinction between vocal tract length of the emotional and neutral speech is modeled by a linear frequency warping function. The warping is performed by a piecewise linear function to preserve the bandwidth of the original signal. Motivated by the approach introduced in [12], the following warping function is proposed to perform the frequency warping in the filterbank stage of extracting acoustic features:
$${f}_{\mathrm{warped}}(n)=\left\{\begin{array}{l}f(n)\kern16em f\le {f}_{2l}\\ {}\alpha \left(f(n)-{f}_{2l}\right)+{f}_{2l}\kern10.5em {f}_{2l}\le f\le {f}_{2h}\\ {}\frac{\left({f}_{3h}-{f}_{2l}\right)-\alpha \left({f}_{2h}-{f}_{2l}\right)}{f_{3h}-{f}_{2h}}\left(f(n)-{f}_{3h}\right)+{f}_{3h}\kern1.25em {f}_{2h}\le f\le {f}_{3h}\\ {}f(n)\kern16em f\ge {f}_{3h}.\end{array}\right.$$
(2)
In this equation, f(n) are the frequency bins of the nth frame, fwarped(n) are the corresponding warped frequencies, and the parameter α is the warping factor that controls the amount of warping. Here, formant frequencies are considered to determine the warping intervals, where f2l and f2h represent, respectively, the lowest and highest values of second formants, and f3h depicts the highest value of third formants. These values are obtained as the average values among all second and third formants extracted from the whole sentences of a particular emotional state. The warping factor α for a specific emotional state is computed as the ratio of the average value of the second formants obtained from neutral utterances to that obtained from emotional utterances. The frequency warping is performed in the range of (f2l, f2h), and the linear transformation in the (f2h, f3h) gap is utilized to compensate the spectral changes caused by the frequency warping and return the warping factor to 1. As an example, Fig. 3 shows the distribution of formant frequencies obtained from all utterances of the male speaker in the Persian ESD database [17] for the emotional state of Disgust along with the values for the warping intervals and warping factor. Figure 4 illustrates the piecewise linear warping function obtained for a sample utterance of Disgust in the database using Eq. (2). Here, the horizontal axis represents the unwarped (i.e., emotional) frequencies whereas the vertical axis indicates the warped (i.e., neutral) frequencies.
Frequency warping in the DCT domain
Here, the procedure of applying frequency warping is examined in the DCT domain to normalize acoustic features. The frequency warping in DCT was employed in speech recognition tasks for speaker normalization [13]. The same approach was utilized by Sheikhan et al. [15] for the normalization of MFCCs extracted from the emotional utterances of Anger and Happy in EASR. The approach is also adopted in the present work for the normalization of other features in different emotional states.
Referring to Fig. 2, after the processing performed in the units of “Filterbank” and “Intermediate Operations,” the DCT operation is applied to the input signal L to compute the cepstral coefficients as:
$$\mathbf{c}=\mathbf{C}.\mathbf{L},$$
(3)
where C is the DCT matrix with the components given as:
$${C}_{km}={\left[{\alpha}_k\cos\ \left(\frac{\pi \left(2m-1\right)k}{2M}\right)\right]}_{\begin{array}{l}0\le k\le N-1\\ {}1\le m\le M\end{array}}$$
(4)
Here, M represents the number of filters in the filterbank, N is the number of cepstral coefficients, and αk is a factor calculated as:
$${\alpha}_k=\left\{\begin{array}{l}\sqrt{\ \frac{1}{M}}\kern0.5em ,\kern1.25em \left(k=0\right)\\ {}\sqrt{\ \frac{2}{M}}\kern0.5em .\kern1.25em \left(k=1,2,\dots, N-1\right)\end{array}\right.$$
(5)
In the following, the linear frequency warping in the DCT domain is described for those features that have the DCT calculation in their extraction process [13].
Step 1: The signal L is retrieved from the cepstral coefficients using the inverse DCT (IDCT) operator:
$$\mathbf{L}={\mathbf{C}}^{-1}.\mathbf{c}.$$
(6)
Here, we consider the unitary type-2 DCT matrix for which C−1 = CT. With this assumption L can be written in the expanded form as:
$$\mathbf{L}(m)=\sum \limits_{k=0}^{N-1}c(k)\ {\alpha}_k\cos \left(\frac{\pi \left(2m-1\right)k}{2M}\right),\kern0.5em \left(m=1,2,\dots, M\right)$$
(7)
where c(k) (k = 0, 1, …, N − 1) are the cepstral coefficients.
Step 2: Considering ψ(u) as the warping function of the continuous variable u, the warped discrete output is obtained by:
$${\displaystyle \begin{array}{c}\hat{\mathbf{L}}(m)=\mathbf{L}\left(\psi (u)\right)\left|{}_{u=m}\right.,\kern0.5em \left(m=1,2,\dots, M\right)\\ {}\kern2.5em =\sum \limits_{k=0}^{N-1}c(k)\ {\alpha}_k\cos \left(\frac{\pi \left(2\psi (m)-1\right)k}{2M}\right).\end{array}}$$
(8)
The warping function ψ(u) is computed as:
$$\psi (u)=\frac{1}{2}+M.{\theta}_p\left(\frac{u-1/2}{M}\right),$$
(9)
where θp(λ) is the normalized frequency warping function given as:
$${\theta}_p\left(\lambda \right)=\left\{\begin{array}{l} p\lambda, \kern9.5em \left(0\le \lambda \le {\lambda}_0\right)\\ {}p{\lambda}_0+\left(\frac{1-p{\lambda}_0}{1-{\lambda}_0}\right)\left(\lambda -{\lambda}_0\right).\kern1.25em \left({\lambda}_0\le \lambda \le 1\right)\end{array}\right.$$
(10)
Here, λ represents the normalized frequency, λ0 is the normalized reference frequency specifying the range (0, λ0) in which frequency warping is performed, and p is the warping factor that controls the amount of warping.
By rewriting Eq. (8) in vector form, we obtain:
$$\hat{\mathbf{L}}=\tilde{\mathbf{C}}.\mathbf{c},$$
(11)
where \(\tilde{C}\) represents the warped IDCT matrix given as:
$${\tilde{C}}_{m,k}={\left[{\alpha}_k\cos \left(\frac{\pi \left(2\psi (m)-1\right)k}{2M}\right)\right]}_{\begin{array}{l}1\le m\le M\\ {}0\le k\le N-1\end{array}}$$
(12)
By rearranging Eq. (9), the warped IDCT matrix can be written in terms of normalized frequency warping function θp(λ):
$$\frac{2\psi (u)-1}{2M}={\theta}_p\left(\frac{2u-1}{2M}\right),$$
(13)
$${\tilde{C}}_{m,k}={\left[{\alpha}_k\cos \left(\pi k{\theta}_p\left(\frac{2m-1}{2M}\right)\right)\right]}_{\begin{array}{l}1\le m\le M\\ {}0\le k\le N-1\end{array}}$$
(14)
Step 3: Finally, by putting the warped discrete output \(\hat{\mathbf{L}}\) in Eq. (3), the warped cepstral coefficients \(\hat{\mathbf{c}}\) are computed as:
$${\displaystyle \begin{array}{l}\hat{\mathbf{c}}=\mathbf{C}.\hat{\mathbf{L}}=\left(\mathbf{C}.\tilde{\mathbf{C}}\right)\;\mathbf{c},\\ {}\kern3.25em =\mathbf{T}.\mathbf{c},\end{array}}$$
(15)
where the matrix T\(\left(\mathbf{T}=\mathbf{C}.\tilde{\mathbf{C}}\right)\) is a linear transformation that transforms the initial cepstral coefficients into the warped coefficients.
In the present work, the above approach is applied to acoustical features to obtain the DCT-warped MFCC, M-MFCC, ExpoLog, GFCC, and PNCC. An example of the warping function employed in the DCT unit is depicted in Fig. 5.
Notably, the warping factors p used in the DCT warping for different emotions are calculated in the same manner as α obtained for the filterbank warping (refer to Eq. (2)).
Applying VTLN to acoustic features
In this section, based on the model for the extraction of various acoustical features, different VTLN warping methods are employed in the filterbank and/or DCT domain(s) to obtain warped (i.e., normalized) features which are finally fed into the Kaldi ASR system. To this aim, the filterbank warping is implemented by employing the warping function given in Eq. (2), whereas, in the DCT warping procedure, the steps given in Frequency warping in the DCT domain are adopted. The combined frequency warping is obtained by concatenating the respective frequency warping operations in both filterbank and DCT domains.