Skip to main content

Table 4 Architecture blocks for four networks inspired by previous work. Music segmentation: CNN-GrillSchlüter: adaptation after [62]. Instrument recognition: CNN-AlexNet: adaptation after [63]; CNN-Han: adaptation after [64, 65]; CNN-VGG16: adaptation after [66]. Keras layers: C(i,j): Conv2D(i,j); P(m,n): MaxPooling2D(m,n); d: Dropout(0.25); D: Dropout(0.5); F: Flatten; G: GlobalMaxPooling2D; \(\mathrm {R_r}\): Dense(activation=‘relu’); \(\mathrm {R_s}\): Dense(activation=‘sigmoid’). The numbers of output neurons are provided in square brackets

From: AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks

CNN-GrillSchlüter

C(8,6)[16] p(3,6)

C(6,3)[32]

F D \(\mathrm {R_s}\)[128] D \(\mathrm {R_s}[1]\)

CNN-AlexNet

C(11,11)[32] P(1,2) d

C(5,5)[64] P(1,2) d

C(3,3)[128] C(3,3)[128] C(3,3)[128] G d

\(\mathrm {R_r}\)[1024] D \(\mathrm {R_s}\)[9]

CNN-Han

C(3,3)[32] C(3,3)[32] P(1,2) d

C(3,3)[64] C(3,3)[64] P(1,2) d

C(3,3)[128] C(3,3)[128] P(2,2) d

C(3,3)[256] C(3,3)[256] G

\(\mathrm {R_r}\)[1024] D \(\mathrm {R_s}\)[9]

CNN-VGG16

C(3,3)[32] C(3,3)[32] P(1,2) d

C(3,3)[64] C(3,3)[64] P(1,2) d

C(3,3)[128] C(3,3)[128] C(3,3)[128] P(2,4) d

C(3,3)[256] C(3,3)[256] C(3,3)[256] G d

\(\mathrm {R_r}\)[1024] D \(\mathrm {R_s}\)[9]