Skip to main content

Table 4 Performance comparisons to SER models trained using traditional audio features; best performances are in bold

From: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

 

RECOLA

IEMOCAP

 

CCC_A

CCC_V

CCC Mean

WA

UA

Sahu et al. [47] (GeMAPS)

-

-

-

56.85%

-

Jiang et al. [48] (GeMAPS)

-

-

-

-

41%

Jiang et al. [48] (MFCCs)

-

-

-

-

35%

Valstar et al. [44] (GeMAPS)

0.683

0.375

0.529

-

-

Ringeval et al.[3]

     

(low-level descriptors)

0.757

0.26

0.509

-

-

CNN-LSTM Tairakis et al. [11]

     

(raw samples)

0.681

0.5

0.591

58.6%

52.6%

CNN-LSTM Tairakis et al. [11]

     

(log mel-spectrogram)

0.705

0.473

0.589

64.7%

56.6%

Proposed DiCCOSER-CS

     

(raw samples)

0.746

0.506

0.626

64.1%

53.6%

Proposed DiCCOSER-CS

     

(log mel-spectrogram)

0.751

0.498

0.6245

65.8%

56.7%