AUC optimization for deep learning-based voice activity detection

EURASIP Journal on Audio, Speech, and Music Processing

Table 2 AUC results of the comparison VADs with the feedforward neural network and STFT acoustic feature on the English Noisy-CHiME-4 test dataset. We use the names of the objectives of the VADs to represent the VADs for short

Noise type	SNR	MCE	MMSE	MaxAUC_sigm	MaxAUC_hinge
Babble	− 10 dB	0.5319	0.5381	0.5631	0.5561
	− 5 dB	0.6006	0.6097	0.6450	0.6359
	0 dB	0.7092	0.7109	0.7431	0.7363
	5 dB	0.8036	0.8046	0.8226	0.8187
	10 dB	0.8652	0.8673	0.8762	0.8726
	15 dB	0.9028	0.9021	0.9071	0.9044
	20 dB	0.9208	0.9191	0.9214	0.9204
Factory	− 10 dB	0.6321	0.6303	0.6399	0.6400
	− 5 dB	0.7275	0.7260	0.7314	0.7341
	0 dB	0.8078	0.8072	0.8071	0.8114
	5 dB	0.8616	0.8611	0.8587	0.8628
	10 dB	0.8967	0.8955	0.8936	0.8968
	15 dB	0.9162	0.9139	0.9132	0.9151
	20 dB	0.9263	0.9235	0.9236	0.9247
Volvo	− 10 dB	0.8910	0.8793	0.9002	0.8968
	− 5 dB	0.9109	0.9042	0.9136	0.9132
	0 dB	0.9217	0.9177	0.9214	0.9218
	5 dB	0.9276	0.9242	0.9260	0.9260
	10 dB	0.9311	0.9275	0.9285	0.9280
	15 dB	0.9329	0.9292	0.9299	0.9292
	20 dB	0.9338	0.9302	0.9306	0.9301