Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

EURASIP Journal on Audio, Speech, and Music Processing

Table 4 Speech event detection results with different network architectures

Model	L	N	p	Train		Validation		Test
				Cost	Acc.%	Cost	Acc.%	Cost	Acc.%
FConn	6	512	6.23	0.489	77.03	0.510	76.45	0.518	75.58
CNN 3×3	7	128	6.04	0.322	86.86	0.383	83.65	0.387	83.72
CNN 7×7	6	64	6.17	0.362	85.02	0.380	84.07	0.390	83.21
LSTM	1	64	4.70	0.547	73.69	0.544	73.51	0.547	73.41
C1-LSTM	3	256	6.40	0.406	82.56	0.436	80.96	0.437	80.80
C2-LSTM	6	256	6.59	0.377	84.30	0.375	84.34	0.382	83.99

The Model column refers to the network architecture, L and N are the number of hidden layers and nodes in each layer (the detailed function of these parameters in each structure can be found in Section 3.3). p is a base-10 logarithmic measure of the number of parameters. The value of the cost or loss function and the clasiffication accuracy is included for the training, validation, and test subsets. The best model in terms of validation cost is highlighted in italics