Skip to main content

Table 3 Configuration of different model structures. The symbol C indicates the number of convolutional kernels and the symbol - - - indicates that the model structure does not have this part

From: Neural network-based non-intrusive speech quality assessment using attention pooling function

Model BLSTM CNN CNN-LSTM
Input layer Log mel spectrogram (bs ×800 frames ×64 mel bins)
Convolution layer - - - \({\left \{ \begin {array}{l} \left (\begin {array}{l} 3 \times 3@{\mathrm {C}}\\ {\text {BN,ReLU}} \end {array} \right) \times {\mathrm {2}}\\ \quad {\text {avg}}{\text {. pooling}} \end {array} \right \} \times 4}\)
Recurrent layer BLSTM-32 - - - BLSTM-32
FC layer FC-1, ReLU (frame-level score)
Output layer Max pooling, average pooling, attention pooling, linear softmax (utterance-level score)