Skip to main content

Table 5 Music event detection results with different network architectures

From: Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Model

L

N

p

Train

Validation

Test

    

Cost

Acc.%

Cost

Acc.%

Cost

Acc.%

FConn

4

2048

7.15

0.518

74.73

0.552

72.50

0.554

72.74

CNN3x3

7

256

6.60

0.362

85.28

0.386

84.14

0.396

83.51

CNN7x7

6

128

6.69

0.355

85.46

0.379

84.19

0.379

84.20

LSTM

3

32

4.57

0.559

72.39

0.553

72.98

0.554

72.65

C1-LSTM

3

256

6.40

0.431

81.08

0.466

79.48

0.460

79.75

C2-LSTM

6

128

6.00

0.333

86.61

0.383

84.34

0.380

84.49

  1. The Model column refers to the network architecture, L and N are the number of hidden layers and nodes in each layer (the detailed function of these parameters in each structure can be found in Section 3.3). p is a base-10 logarithmic measure of the number of parameters. The value of the cost or loss function and the clasiffication accuracy is included for the training, validation and test subsets. The best model in terms of validation cost is highlighted in italics