Neural network-based non-intrusive speech quality assessment using attention pooling function

EURASIP Journal on Audio, Speech, and Music Processing

Table 3 Configuration of different model structures. The symbol C indicates the number of convolutional kernels and the symbol - - - indicates that the model structure does not have this part

Model	BLSTM	CNN	CNN-LSTM
Input layer	Log mel spectrogram (bs ×800 frames ×64 mel bins)
Convolution layer	- - -	\({\left \{ \begin {array}{l} \left (\begin {array}{l} 3 \times 3@{\mathrm {C}}\\ {\text {BN,ReLU}} \end {array} \right) \times {\mathrm {2}}\\ \quad {\text {avg}}{\text {. pooling}} \end {array} \right \} \times 4}\)
Recurrent layer	BLSTM-32	- - -	BLSTM-32
FC layer	FC-1, ReLU (frame-level score)
Output layer	Max pooling, average pooling, attention pooling, linear softmax (utterance-level score)