Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

EURASIP Journal on Audio, Speech, and Music Processing

Table 10 Top 10 negative distractor events for speech (event labels related to false negative decisions of the network about the “speech” class)

Event	Event ID	Ratio	\(d^{-}_{sp}\)
Whispering	/m/02rtxlg	24/30	0.301
Male singing	/t/dd00003	22/52	0.216
Musical instrument	/m/04szw	66/293	0.193
Female singing	/t/dd00004	19/50	0.191
Singing	/m/015lz1	17/45	0.179
Violin, fiddle	/m/07y_7	13/23	0.179
Music	/m/04rlf	810/5636	0.143
Disco	/m/026z9	10/23	0.137
Bass guitar	/m/018vs	10/23	0.137
Guitar	/m/0342h	34/204	0.134

\(d^{-}_{sp}\) score (Eq. 12) is used to rank the events. The ratio column shows the number of false negatives for speech where the distractor event label is found (numerator) and the number of speech segments that contain the distractor event (denominator)