Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

EURASIP Journal on Audio, Speech, and Music Processing

Table 5 Music event detection results with different network architectures

Model	L	N	p	Train		Validation		Test
				Cost	Acc.%	Cost	Acc.%	Cost	Acc.%
FConn	4	2048	7.15	0.518	74.73	0.552	72.50	0.554	72.74
CNN3x3	7	256	6.60	0.362	85.28	0.386	84.14	0.396	83.51
CNN7x7	6	128	6.69	0.355	85.46	0.379	84.19	0.379	84.20
LSTM	3	32	4.57	0.559	72.39	0.553	72.98	0.554	72.65
C1-LSTM	3	256	6.40	0.431	81.08	0.466	79.48	0.460	79.75
C2-LSTM	6	128	6.00	0.333	86.61	0.383	84.34	0.380	84.49

The Model column refers to the network architecture, L and N are the number of hidden layers and nodes in each layer (the detailed function of these parameters in each structure can be found in Section 3.3). p is a base-10 logarithmic measure of the number of parameters. The value of the cost or loss function and the clasiffication accuracy is included for the training, validation and test subsets. The best model in terms of validation cost is highlighted in italics