Skip to main content

Table 3 Configuration of different model structures. The symbol C indicates the number of convolutional kernels and the symbol - - - indicates that the model structure does not have this part

From: Neural network-based non-intrusive speech quality assessment using attention pooling function

Model

BLSTM

CNN

CNN-LSTM

Input layer

Log mel spectrogram (bs ×800 frames ×64 mel bins)

Convolution layer

- - -

\({\left \{ \begin {array}{l} \left (\begin {array}{l} 3 \times 3@{\mathrm {C}}\\ {\text {BN,ReLU}} \end {array} \right) \times {\mathrm {2}}\\ \quad {\text {avg}}{\text {. pooling}} \end {array} \right \} \times 4}\)

Recurrent layer

BLSTM-32

- - -

BLSTM-32

FC layer

FC-1, ReLU (frame-level score)

Output layer

Max pooling, average pooling, attention pooling, linear softmax (utterance-level score)