Skip to main content

Table 4 Performance comparisons to SER models trained using traditional audio features; best performances are in bold

From: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

  RECOLA IEMOCAP
  CCC_A CCC_V CCC Mean WA UA
Sahu et al. [47] (GeMAPS) - - - 56.85% -
Jiang et al. [48] (GeMAPS) - - - - 41%
Jiang et al. [48] (MFCCs) - - - - 35%
Valstar et al. [44] (GeMAPS) 0.683 0.375 0.529 - -
Ringeval et al.[3]      
(low-level descriptors) 0.757 0.26 0.509 - -
CNN-LSTM Tairakis et al. [11]      
(raw samples) 0.681 0.5 0.591 58.6% 52.6%
CNN-LSTM Tairakis et al. [11]      
(log mel-spectrogram) 0.705 0.473 0.589 64.7% 56.6%
Proposed DiCCOSER-CS      
(raw samples) 0.746 0.506 0.626 64.1% 53.6%
Proposed DiCCOSER-CS      
(log mel-spectrogram) 0.751 0.498 0.6245 65.8% 56.7%