From: Neural network-based non-intrusive speech quality assessment using attention pooling function
Model | BLSTM | CNN | CNN-LSTM |
---|---|---|---|
Input layer | Log mel spectrogram (bs ×800 frames ×64 mel bins) | ||
Convolution layer | - - - | \({\left \{ \begin {array}{l} \left (\begin {array}{l} 3 \times 3@{\mathrm {C}}\\ {\text {BN,ReLU}} \end {array} \right) \times {\mathrm {2}}\\ \quad {\text {avg}}{\text {. pooling}} \end {array} \right \} \times 4}\) | |
Recurrent layer | BLSTM-32 | - - - | BLSTM-32 |
FC layer | FC-1, ReLU (frame-level score) | ||
Output layer | Max pooling, average pooling, attention pooling, linear softmax (utterance-level score) |