Neural network-based non-intrusive speech quality assessment using attention pooling function

Recently, the non-intrusive speech quality assessment method has attracted a lot of attention since it does not require the original reference signals. At the same time, neural networks began to be applied to speech quality assessment and achieved good performance. To improve the performance of non-intrusive speech quality assessment, this paper proposes a neural network-based assessment method using attention pooling function. The proposed systems are based on the convolutional neural networks (CNNs), bidirectional long short-term memory (BLSTM), and CNN-LSTM structure. Comparing four types of pooling functions both theoretically and experimentally, we find the attention pooling function performs the best among the four. Experiments are conducted in a dataset containing various degraded speech signals with corresponding subjective quality scores. The results show that the proposed CNN-LSTM model using attention pooling function achieves state-of-the-art correlation coefficient (R) and root-mean-square error (RMSE) of 0.967 and 0.269, outperforming the performance of standardization ITU-T P.563 and autoencoder-support vector regression method.


Introduction
Speech quality assessment has become an important part of speech systems, which can be used to detect the quality of speech enhancement [1], speech synthesis [2], and other speech systems. Therefore, it is necessary to develop an effective, reliable, and flexible speech quality assessment method.
At present, the main challenge facing the speech quality assessment task is how to improve the prediction accuracy of non-intrusive methods to approach or even surpass the intrusive methods on the basis of objective evaluation. So far, P.563 [3] is the only published standard in ITU-T to evaluate no-reference speech quality. It was proposed relatively early and its accuracy is far from intrusive methods. With the rapid development of *Correspondence: wangjing@bit.edu.cn 1 School of Information and Electronics, Beijing Institute of Technology, Beijing, China Full list of author information is available at the end of the article deep learning technology, many researchers have applied deep neural networks to speech quality assessment [4][5][6][7], which greatly improved the accuracy of non-intrusive methods. But none of them paid attention to the pooling function before the output of neural networks in speech quality assessment task.
In this paper, we propose a neural network-based nonintrusive speech quality assessment using attention pooling function [8]. We analyzed four existing pooling functions on speech quality assessment task and conducted experiments on the convolutional neural network (CNN), bidirectional long short term memory (BLSTM), and CNN-LSTM structure. The experiment results verified that the CNN-LSTM structure using the attention pooling function has a great performance on this task. As far as we know, this is the first analysis of the pooling function on non-intrusive speech quality assessment.
The rest of this paper is organized as follows. Section 2 introduces the related works about speech quality assessment. Section 3 introduces the related neural networks. Section 4 presents the pooling function used in speech quality assessment system. Section 5 introduces the contrast methods. Section 6 introduces the experiment setups. Section 7 shows the evaluation experiments to assess the system performance. Finally, Section 8 states the conclusions.

Related works
The speech quality assessment methods contain subjective tests and objective tests. Subjective tests, based on listeners' feeling to the heard speech, generally use the mean opinion score (MOS) described in ITU-T P.800 [9] to measure speech quality. This method is accurate but time-consuming and labor-intensive. Objective evaluation methods can be divided into two categories (intrusive and non-intrusive) according to the presence or absence of reference signals. Intrusive method not only requires the speech signal to be evaluated, but also the original clean signal without damage, as introduced in ITU-T P.862 [10] and P.863 [11] . The non-intrusive method doesn't require a clean signal and directly provides a quality score based on the signal to be evaluated. The ITU-T P.563 [3] standard algorithm is one of non-intrusive methods, which is widely used in the evaluation of narrowband speech. Although the non-intrusive method is not as accurate as the intrusive method, it has developed rapidly in recent years because of its simple implementation.
Many non-intrusive evaluation algorithms of speech quality have been proposed. ANIQUE [12] is based on the functional roles of human auditory systems and the characteristics of human articulation systems. Tiago H. Falk et al. [13] used gaussian mixture models (GMMs) to model the behavior of clean speech and compared features extracted from degraded speech signals to the artificial reference model signals. D. -S. Kim et al. [14] proposed a perceptually motivated algorithm based on a temporal envelope representation of speech to assess speech quality. Meet H. Soni [15] used the ideal ratio mask (IRM) for non-intrusive quality assessment of noise suppressed speech. Wang [16] applied autoencoder to extract bottleneck features of speech signals and mapped the features to the predicted MOS using support vector regression (SVR) [17].
Recently, deep learning methods have been well applied in the field of speech quality assessment due to their nonlinear fitting performance. Haemin et al. [4] proposed a deep neural network (DNN) based non-intrusive speech quality estimation method in real-time voice communication systems. Hakami and Kleijn [5] used augmented feature set and the neural network to improve the prediction accuracy of the single-ended quality assessment approach. Quality-Net [6], based on bidirectional long short term memory (BLSTM), combined the frame-level scores to the final estimated utterance-level quality score using average pooling method. Lo et al. [7] adopted the convolutional and recurrent neural network models to build a mean opinion score predictor. Gabriel and Sebastian [2] proposed a TTS naturalness prediction model which achieved promising results on unseen datasets.

Related neural networks
Non-intrusive speech quality assessment can be regarded as a weak labeled regression task. Only the utterance-level speech quality labels will be provided. The non-intrusive speech quality assessment system based on neural network is shown in Fig. 1. Many neural network based methods such as convolutional neural networks (CNNs) and long short term memory (LSTM) have been used to predict the speech quality scores. In this section, we will introduce the related neural networks.

CNNs
CNNs were first proposed in image classification [18]. Compared with traditional back-propagation NN, CNNs use the local connectivity and weight sharing methods to retain important parameters and remove a large number of redundant parameters in order to achieve better learning results. Because of its outstanding ability to characterize shallow features, CNNs have been introduced to speech related tasks such as speech recognition [19] and speech quality assessment [7,20]. A conventional CNN consists of convolution layers, pooling layers, and fully connected (FC) layers. When CNNs are used to process audio signals, the input data is often a two-dimensional or three-dimensional array. Before features are sent to the convolutional layer, they need to be normalized in time or frequency dimension. Each convolution layer is composed of multiple convolution kernels and each element of convolution kernels corresponds to a weight coefficient and a bias value. Convolution layers apply convolution operation to the input and pass the output to the next layer. The output of a convolutional layer is called feature maps. There are three main parameters of the convolution layer, namely the size of convolution kernels, the step size, and the padding. These three parameters determine the size of feature maps after the convolution operation. ReLU activation [21] is usually used to increase the nonlinearity of models. Recently, batch normalization [22] is adopted in CNN architectures after convolutional layers to stabilize training. Pooling layers can effectively reduce the size of feature maps, thereby reducing the number of parameters in networks. A time distributed fully connected layer is applied to the output of the last convolutional layer to predict the quality scores of each frames in time axis. Finally, the predicted scores are aggregated over the time axis to obtain the utterancewise score.

LSTM
Speech is a continuous signal in time axis that changes according to the context. It is not correct to consider it at a single moment in time axis. CNNs cannot capture long time dependency in a speech utterance, while recurrent neural networks (RNNs) [23] are types of neural networks that can store history information in their hidden states and thus capture long-term dependency of sequential data. Therefore, RNNs are more conducive to modeling the time information of speech signals than CNNs. A problem with the traditional RNN is that it cannot distinguish whether the information from previous moments is useful. In other words, any information from previous moments will be passed down, which may cause the gradient disappear or explode in training. Long short term memory (LSTM) [24] is a variation of RNN. It can filter information from previous moments by forget gates, input gates, and output gates, which makes it possible to overcome the problem of long time dependency. Bidirectional long short-term memory (BLSTM) considers the past and future information at the same time when calculating, that is, the output is determined by the previous inputs and the following inputs. BLSTM is applied to our speech quality assessment systems.

CNN-LSTM
CNN-LSTM is a neural network structure that combines CNN and LSTM and it has been recently used for speech quality assessment [2,7,25]. In this structure, CNNs extract deep features of speech and the CNN feature vectors are then used as input for LSTM network that models time dependencies, which means that CNN-LSTM has the advantages of both CNN and LSTM.

Pooling functions
As shown in Fig. 1, in the neural network-based speech quality assessment systems, the role of pooling function is to aggregate frame-level quality scores into utterancelevel quality scores. Therefore, the choice of pooling functions has a great influence on final results. However, most researchers only used the average pooling function [6,7] and did not make further attempts. Pooling function has been extensively experimented and applied to the weakly labeled sound event detection task [8,[26][27][28], where linear softmax [27] and attention pooling function [8] achieve a strong performance. In this section, we will introduce and analyze the max, average, linear softmax, and attention pooling function in speech quality assessment system.

Definition of the pooling functions
Let y i ∈[ 0, 5] be the prediction of a frame-level quality score at the ith frame and y ∈[ 0, 5] be the aggregated utterance-level score. We list the definitions of four pooling functions to be compared in Table 1.
The max pooling function takes the maximum of all frame-level scores y i 's as the utterance-level score y, which Table 1 Definition and gradient of four pooling functions  means that only the frame with the largest score will have an impact on the final utterance-level score.
The average pooling function [26] takes the average of all frame-level quality scores y i 's to get the utterance-level score y, which means it assigns an equal weight to all frames.
The linear softmax pooling function computes y as a weighted average of y i 's, where the weights are equal to the frame-level scores y i 's themselves. In this way, larger y i 's receive larger weights. Compared to average pooling, the utterance-level score is mainly determined by frames with the larger frame-level scores and the affect of frames with smaller scores will be reduced.
Finally, the attention pooling function is also a weighted average. Unlike linear softmax, the weights w i for each frame are learnable and modeled by a dedicated layer in neural network. The utterance-level score y is then computed using the general weighted average formula of y i 's. The attention pooling function appears to be most favored by researchers because of its flexibility in sound event detection task [29,30].

Analysis of the pooling functions
As stated before, we only have the utterance-level speech quality labels. When the overall quality of a speech utterance is good, listeners will give it a high score. But when only part of a speech utterance is bad, listeners will give a lower score. This means that a speech utterance with a high score should have high scores for each frame, and a speech utterance with a low score must have bad frames but may also have good frames. Based on this concept, we will analyze the gradient of the loss function w.r.t. the frame-level quality scores y i 's. And the weights w i 's also will be analyzed in the case of attention pooling.
Let t ∈[ 0, 5] be the utterance-level ground truth. The loss function we used is the mean squared error (MSE): The gradient of the loss function w.r.t. the utterance-level quality scores is represented as: It does not depend on the choice of the pooling function. It is negative when the utterance-level predicted score is smaller than the utterance label (y < t) and positive when the utterance-level predicted score is larger than the utterance label (y > t).
According to the chain rule, we can get the loss function w.r.t. the frame-level scores y i and the frame-level weights w i respectively: We can divide it into two terms to analyze. The second item, ∂y/∂y i (and ∂y/∂w i ), is calculated for four pooling function in Table 1.
With the max pooling function, ∂y/∂y i equals 1 for the frame with the largest score and 0 elsewhere. It will cause only this frame to receive a non-zero gradient during the back propagation of the network. Since we want to evaluate the utterance-level quality as a whole, it seems unreasonable that only the parameters related to the frame with the largest score are updated.
With the average pooling function, ∂y/∂y i is always positive and equals 1/n for each frame-level score y i , which means the gradient is distributed evenly across all frames. When the utterance predicted score is smaller than the utterance label (y < t), the gradient ∂L/∂y i is negative, and this will boost the scores y i 's of all frames. This is in line with our requirement for good speech utterances to have good quality in each frame. When the utterance predicted score is larger than the utterance label (y > t), the gradient ∂L/∂y i is positive and the scores of all frames have to be suppressed, which is not what we expected. It may cause the scores of many good frames to be incorrectly dropped.
With the linear softmax pooling function, ∂y/∂y i is positive when y i > y/2 , and negative when y i < y/2. When the utterance predicted score is smaller than the utterance label (y < t), the gradient ∂L/∂y i is negative when y i > y/2, and positive when y i < y/2. As a result, larger y i 's will be boosted, while smaller y i 's will be suppressed. It is wrong to make good frames better and bad frames worse. When the utterance label is larger than the utterance predicted score (y > t), the gradient is positive when y i > y/2, and negative when y i < y/2. As a result, larger y i 's will be suppressed, while smaller y i 's will be boosted. This is different from what we expected for low-score speech.
With the attention pooling function, the second term ∂y/∂y i is always positive because w i is always larger than zero. Therefore, the attention pooling function will boost all frames when y < t and suppress all frames when y > t. The strength of the boosting or suppression depends on the learned weight, which is different from the average pooling function. Because the weights w i 's are learned, we should also consider the gradient of the loss function w.r.t. the weights , ∂y/∂w i . The second term ∂y/∂w i is positive where y i > y, and negative where y i < y. When the utterance predicted score is smaller than the utterance label (y < t), the gradient ∂L/∂w i is negative when y i > y, and positive when y i < y. This will cause the weight w i to rise where the frame-level score y i is large and to drop where y i is small, which means frames with larger scores y i 's should get larger weights w i 's. It can help the weighted average result y to rise faster. When the utterance predicted score is larger than the utterance label (y > t), the weight w i will rise where the frame-level score y i is small and drop where y i is large, which means that larger weights will concentrate upon frames with smaller scores.
In the process of declining scores in all frames, this opposite phenomenon will cause scores of bad frames drop faster, but scores of good frames may avoid too much drop. This agrees with what we expected for low-score speech.

ITU-T P.563
ITU-T P.563 [3] is a non-intrusive speech quality evaluation standard proposed by ITU-T. P.563 includes three main modules, simulation module, speech reconstruction module, and estimation module. The simulation module extracts the feature parameters using the principles of speech and auditory perception. The speech reconstruction module uses parameters extracted from distorted speech to reconstruct speech in order to generate quasipure speech. The role of estimation module is to determine the type of distortion and give evaluation scores according to the gap between the input speech and the generated quasi-pure speech. If the input is a severely disturbed speech signal, the difference between the input and output signal will be large and the quality score will be low. On the contrary, if the input is a clean speech signal, the quality score will be high.

Autoencoder-SVR
Autoencoder-SVR, proposed in [16], uses autoencoder to extract bottleneck features of speech signals and then maps the features to the predicted MOS using SVR. The method trains the autoencoder and SVR in turn at first. First, autoencoder is trained from training speech signals represented by the log-power spectra features. Then, the parameters of the autoencoder are fixed. Next, bottleneck features extracted from the well-trained autoencoder and the corresponding MOS values are used to train the mapping model SVR. Autoencoder-SVR is not an end-to-end trained model since autoencoder and SVR are trained separately. Therefore, its final performance depends on the two parts of autoencoder and SVR and its training process will be more complicated and difficult than end-to-end methods' .

Database
We evaluate the proposed method on a narrowband MOS-labeled database including both clean and degraded speech signals, which come from subjective Chinese listening tests designed by Beijing Institute of Technology.
All speech signals are processed from data in the NTT-AT Chinese corpus. The database consists of 1248 speech pairs with subjective MOS ranging from 1 to 5. All the speech utterances are sampled at 8 kHz rate with 16 bits resolution and in the length of 8 s. Six professional listeners scored each sentence in the professional acoustics laboratory. After each speech utterance is scored, the final MOS is the average of scores of the six individuals. In the whole corpus, the average variance of all scores for each speech utterance is 0.7. The database contains many processing conditions including different standard codecs, acoustic noise background, and modulated noise reference unit (MNRU) of various levels. Table 2 shows the detail of the conditions. The number of speech utterances in each condition with each background noise is 24. Without considering the background noise, approximately 90% speech files (1100 samples) under each distortion condition were randomly selected as the training set, while the remaining data (148 samples) is used for testing. In [16] and [31], Shan and Wang conducted experiments on this database and achieved some results.

Feature
We use log mel spectrogram as input feature following previous work on deep learning-based speech quality assessment [2]. The short time Fourier transform (STFT) with a Hanning window of 256 samples with a hop size of 80 samples is applied to extract spectrogram. We apply 64 mel filter banks on the spectrogram to obtain log mel spectrogram. The mel filter banks have a lower cut-off frequency of 50 Hz to remove low frequency noise. We use the torchlibrosa [32] package to build log mel spectrogram extraction.

Data augmentation
We use SpecAugment [33] as our data augmentation method to prevent systems from overfitting. SpecAugment, a simple data augmentation method, is applied to the feature inputs of a neural network. The augmentation policy consists of warping the features, masking blocks of frequency channels and masking blocks of time steps. In our speech quality assessment systems, SpecAugment is applied to the log mel spectrogram of a speech utterance using frequency masking and time masking. Frequency masking is applied so that f consecutive mel frequency bins [ f 0 , f 0 + f ) are masked, where f is chosen from a uniform distribution from 0 to a frequency mask parameter f , and f 0 is chosen where F is the number of mel frequency bins [33]. More than one frequency mask can be applied to each log mel spectrogram. The frequency mask can improve the robustness of our systems to frequency distortion of speech utterances [33]. Time masking is (2021) 2021:20 Page 7 of 10 applied in the time domain, which is similar to frequency masking.

Model
The detailed configuration of different structures in our system, including CNN, BLSTM, and CNN-LSTM structure, is shown in Table 3. They have shown to perform well on speech quality assessment [7]. In CNN and CNN-LSTM structure, the convolution layer part includes 4 convolutional blocks. Each convolutional block consists of 2 convolutional layers with kernel sizes of 3×3. Batch normalization and ReLU function is applied after each convolutional layer. The convolutional block consists of 8, 16, 32, and 64 kernels, respectively. The symbol C following @ represents the number of kernels in Table 3. A 2 × 2 average pooling is applied after the first three convolutional blocks. A 1×8 average pooling is applied after the last convolutional block to average out frequency axis. In BLSTM and CNN-LSTM structure, BLSTM with 32 hidden states is applied in the recurrent layer part. Then, in three model structures, time distributed fully connected layer with ReLU function is applied to predict the quality score of each time frame. To obtain the utterance-level prediction for supervised learning, aggregation functions including max, average, linear softmax, and attention pooling along time frames are applied. For attention pooling function, a separate fully connected layer with softmax activation is used to generate the weights.

Training
In order to avoid experimental contingency, for each model structure, we trained ten models using the 10-fold cross-validation and got 10 corresponding results on the test set. We took the average of all the 10 results as the final result of each model structure.
During model training, we use the Adam [34] optimizer with the initial learning rate of 0.001. The learning rate is scaled by 0.1 times if there is no more decrease on the loss of validation set within 5 epochs and training stops if there is no more decrease on the loss of validation set within 20 epochs. The total number of training epochs is 80. The mini batch size is 32. The network was trained using the PyTorch toolkit.  Table 4 Liu et al. EURASIP Journal on Audio, Speech, and Music Processing

Evaluation metrics
To evaluate the performance of systems, we use the correlation coefficient (R) and root-mean-square error (RMSE) between the predicted score S k and the subjective score S k of each speech utterance k. The definition of correlation coefficient is as follows: where S is the average of S k and S the average of S k . N is the number of MOS labeled utterances in test set. RMSE of MOS is defined as: The R is the larger the better while the RMSE is the smaller the better. Table 4 shows R and RMSE results for different model structures. Comparing the results of the combined experiment of the twelve models, we can find that the performance of attention pooling is better than the other three pooling functions regardless of the model structure. This shows that attention pooling has great robustness in different model structures. The CNN-LSTM structure is slightly better than CNN and LSTM structures as a whole because of its good learning both in time domain and frequency domain. The highest R of 0.967 and the lowest RMSE of 0.269 can be achieved by CNN-LSTM model using attention pooling function. Figure 2 shows the scattered plots of predicted MOS versus subjective MOS of the test speech signals obtained from twelve models. The red diagonal line is the ideal situation that the objective MOSs are equal to the subjective MOSs. The blue dots represent the distribution of each test sample. Observing the alignment degree between data points and the diagonal line, we can see that the result distribution from the model using attention pooling function is closer to the diagonal line than that from the model using max, average and linear softmax pooling function.

Comparison of different methods
The results of different methods on the test set are shown in Table 5. On the one hand, the performance of the proposed method is much better than P.563, which means our proposed neural network-based non-intrusive assessment method has significantly improvement compared to traditional signal processing methods. On the other hand, the proposed method outperforms autoencoder-SVR method [16] with 1.4% relative increase in R and with 12.7% relative reduction in RMSE. This shows that our method has advantages over machine learning-based methods.

Conclusion
In this paper, we propose a neural network-based nonintrusive speech quality assessment using attention pooling function. We conduct experiments to compare four pooling functions among which attention pooling proved to be the best among them. From the experiment results, it can be seen that the proposed method has significant improvement in performance compared with the standardization ITU-T P.563 and autoencoder-SVR method. Specifically, the CNN-LSTM model using attention pooling function achieves the highest R of 0.967 and the lowest RMSE of 0.269. In the future, we will continue to research more on non-intrusive speech quality assessment methods considering the effects of different conditions and languages.