Skip to main content
# Table 3 Configuration of different model structures. The symbol *C* indicates the number of convolutional kernels and the symbol - - - indicates that the model structure does not have this part

From: Neural network-based non-intrusive speech quality assessment using attention pooling function

Model | BLSTM | CNN | CNN-LSTM |
---|---|---|---|

Input layer | Log mel spectrogram (bs ×800 frames ×64 mel bins) | ||

Convolution layer | - - - | \({\left \{ \begin {array}{l} \left (\begin {array}{l} 3 \times 3@{\mathrm {C}}\\ {\text {BN,ReLU}} \end {array} \right) \times {\mathrm {2}}\\ \quad {\text {avg}}{\text {. pooling}} \end {array} \right \} \times 4}\) | |

Recurrent layer | BLSTM-32 | - - - | BLSTM-32 |

FC layer | FC-1, ReLU (frame-level score) | ||

Output layer | Max pooling, average pooling, attention pooling, linear softmax (utterance-level score) |