From: Exploiting spectro-temporal locality in deep learning based acoustic event detection
Deep neural networks settings | |
---|---|
FFT resolutions | 10 ms (129 bins), 20 ms (257 bins) |
(multi-resolution) | 30 ms (257 bins), 40 ms (513 bins) |
50 ms (513 bins), 60 ms (513 bins) | |
Patch lengths | 10, 20, and 30 frames |
Convolutional neural networks settings | |
Filter shapes (CNN) | 5× 5, 7× 7, 9× 9 (bins × frames) |
Number of filters (CNN) | 10, 20, and 40 filters |
Pooling (CNN) | 1×1 (no pooling) |
2× 1 (frequency pooling) | |
1× 2 (time pooling) | |
2× 2 (both axes) |