Skip to main content

Table 5 The recognition performance of the proposed GMM-HMM EASR system (in terms of WER (%)) for CREMA-D. The values of WER are obtained by applying different warping methods to various acoustic features extracted from different emotional utterances. The symbol * shows statistically significant cases (i.e., p value < 0.05).

From: Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

Feature type

Warping type

Emotional states

Anger

Disgust

Fear

Happy

Sad

Average WER

MFCC

No Warping

8.11

5.54

11.20

6.39

6.50

7.55

Filterbank Warping

7.80

5.09

9.69*

5.40*

6.06

6.81

DCT Warping

7.78*

5.47

9.20*

4.92*

5.71*

6.62

Filterbank & DCT Warping

7.78*

5.44

9.25*

4.67*

5.65*

6.56

M-MFCC

No Warping

8.11

6.53

10.33

6.42

6.97

7.67

Filterbank Warping

7.38*

6.26

9.32*

6.06

6.77

7.16

DCT Warping

7.28

6.35

9.50*

5.79

6.09*

7.00

Filterbank & DCT Warping

7.42*

6.16

9.32*

5.90

6.50

7.06

ExpoLog

No Warping

8.25

7.25

12.32

7.88

9.63

9.07

Filterbank Warping

7.17*

6.63

11.22*

6.49*

8.75

8.05

DCT Warping

7.47

6.87

11.48*

6.41*

8.73

8.19

Filterbank & DCT Warping

6.97*

7.00

10.56*

6.11*

8.52*

7.83

GFCC

No Warping

55.73

23.07

44.39

39.48

27.97

38.13

Filterbank Warping

54.95

22.33

43.28*

38.36*

26.89*

37.16

DCT Warping

55.33

22.68

43.67*

35.85*

27.97*

37.10

Filterbank & DCT Warping

55.99

23.02

44.31

36.82*

29.32

37.89

PNCC

No Warping

4.49

1.96

6.25

2.17

2.26

3.43

Filterbank Warping

3.97*

1.83

5.96

1.71*

1.99*

3.09

DCT Warping

4.52

2.03

6.75

3.08

2.37

3.75

Filterbank & DCT Warping

4.44

1.98

6.21

2.56

2.09

3.46