Skip to main content

Table 4 Speech event detection results with different network architectures

From: Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Model

L

N

p

Train

Validation

Test

    

Cost

Acc.%

Cost

Acc.%

Cost

Acc.%

FConn

6

512

6.23

0.489

77.03

0.510

76.45

0.518

75.58

CNN 3×3

7

128

6.04

0.322

86.86

0.383

83.65

0.387

83.72

CNN 7×7

6

64

6.17

0.362

85.02

0.380

84.07

0.390

83.21

LSTM

1

64

4.70

0.547

73.69

0.544

73.51

0.547

73.41

C1-LSTM

3

256

6.40

0.406

82.56

0.436

80.96

0.437

80.80

C2-LSTM

6

256

6.59

0.377

84.30

0.375

84.34

0.382

83.99

  1. The Model column refers to the network architecture, L and N are the number of hidden layers and nodes in each layer (the detailed function of these parameters in each structure can be found in Section 3.3). p is a base-10 logarithmic measure of the number of parameters. The value of the cost or loss function and the clasiffication accuracy is included for the training, validation, and test subsets. The best model in terms of validation cost is highlighted in italics