EURASIP Journal on Audio, Speech, and Music Processing

Table 4 Performance comparisons to SER models trained using traditional audio features; best performances are in bold

From: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

	RECOLA			IEMOCAP
	CCC_A	CCC_V	CCC Mean	WA	UA
Sahu et al. [47] (GeMAPS)	-	-	-	56.85%	-
Jiang et al. [48] (GeMAPS)	-	-	-	-	41%
Jiang et al. [48] (MFCCs)	-	-	-	-	35%
Valstar et al. [44] (GeMAPS)	0.683	0.375	0.529	-	-
Ringeval et al.[3]
(low-level descriptors)	0.757	0.26	0.509	-	-
CNN-LSTM Tairakis et al. [11]
(raw samples)	0.681	0.5	0.591	58.6%	52.6%
CNN-LSTM Tairakis et al. [11]
(log mel-spectrogram)	0.705	0.473	0.589	64.7%	56.6%
Proposed DiCCOSER-CS
(raw samples)	0.746	0.506	0.626	64.1%	53.6%
Proposed DiCCOSER-CS
(log mel-spectrogram)	0.751	0.498	0.6245	65.8%	56.7%

Back to article page