Exploiting spectro-temporal locality in deep learning based acoustic event detection

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Experimental setup parameters

Deep neural networks settings
FFT resolutions	10 ms (129 bins), 20 ms (257 bins)
(multi-resolution)	30 ms (257 bins), 40 ms (513 bins)
	50 ms (513 bins), 60 ms (513 bins)
Patch lengths	10, 20, and 30 frames
Convolutional neural networks settings
Filter shapes (CNN)	5× 5, 7× 7, 9× 9 (bins × frames)
Number of filters (CNN)	10, 20, and 40 filters
Pooling (CNN)	1×1 (no pooling)
	2× 1 (frequency pooling)
	1× 2 (time pooling)
	2× 2 (both axes)