Skip to main content

Table 6 Simultaneous speech-music event detection results with different network architectures

From: Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Model

L

N

p

Train

Validation

Test

    

Cost

Acc.%

Cost

Acc.%

Cost

Acc.%

FConn

6

256

5.77

0.977

58.93

1.038

56.19

1.043

55.80

CNN3x3

6

256

6.68

0.726

71.10

0.740

70.39

0.746

70.37

C1-LSTM

4

256

6.53

0.788

67.58

0.877

64.82

0.886

64.04

C2-LSTM

6

256

6.59

0.651

74.43

0.726

71.48

0.733

70.98

  1. The model column refers to the network architecture, L and N are the number of hidden layers and nodes in each layer (the detailed function of these parameters in each structure can be found in Section 3.3). p is a base-10 logarithmic measure of the number of parameters. The value of the cost or loss function and the clasiffication accuracy is included for the training, validation and test subsets. The best model in terms of validation cost is highlighted in italics