A speech enhancement algorithm based on a non-negative hidden Markov model and Kullback-Leibler divergence

In this paper, we propose a supervised single-channel speech enhancement method that combines Kullback-Leibler (KL) divergence-based non-negative matrix factorization (NMF) and a hidden Markov model (NMF-HMM). With the integration of the HMM, the temporal dynamics information of speech signals can be taken into account. This method includes a training stage and an enhancement stage. In the training stage, the sum of the Poisson distribution, leading to the KL divergence measure, is used as the observation model for each state of the HMM. This ensures that a computationally efficient multiplicative update can be used for the parameter update of this model. In the online enhancement stage, a novel minimum mean square error estimator is proposed for the NMF-HMM. This estimator can be implemented using parallel computing, reducing the time complexity. Moreover, compared to the traditional NMF-based speech enhancement methods, the experimental results show that our proposed algorithm improved the short-time objective intelligibility and perceptual evaluation of speech quality by 5% and 0.18, respectively.


Introduction
Single-channel speech enhancement technology is being widely used in our daily lives, such as in speech coding, teleconferencing, hearing aids, mobile communication, and automated robust speech recognition (ASR) [1,2]. In general, the purpose of speech enhancement is to remove background noise from an audio source while preserving clean speech. It aims to improve the quality and intelligibility of noisy speech [3]. Currently, single-channel speech enhancement is an active topic of research.
During the past decades, many different monaural speech enhancement approaches have been proposed [2,4]. In an environment with additive noise, the simplest approach to speech enhancement is the spectral subtraction algorithm [5], which subtracts the estimated noise spectrum from the observed signal to acquire the desired clean speech. Other unsupervised methods, such as the signal subspace algorithm [6][7][8][9], Wiener filtering [10], minimum mean square error (MMSE) spectral amplitude estimator [11], and log-MMSE spectral amplitude estimator [12], are effective strategies for speech enhancement when the noise is stationary. These methods have low computational complexity and have been widely applied in various areas. However, these approaches cannot always achieve satisfactory performance for non-stationary noise and usually introduce musical noise because they do not make the best use of the prior information of the speech and noise [13]. Moreover, most unsupervised methods are based on the statistical properties of the speech and noise signals. However, it is difficult to meet these properties in actual noisy scenarios [14]. Therefore, supervised speech enhancement approaches have been developed. For instance, Kavalekalam et al. [15] proposed a codebook-based Kalman filter speech enhancement method, which performs a listening test and shows significant improvement for speech intelligibility. In addition, Srinivasan et al. [16] proposed a codebook-driven speech enhancement algorithm for non-stationary noise. In this work, the auto-regressive (AR) spectrum shape codebooks of speech and noise were pre-trained. In the enhancement stage, the codebooks could be used to build a Wiener filter to conduct speech enhancement. Inspired by this research, many other codebook-based speech enhancement approaches have been developed [17,18]. Furthermore, an autoregressive hidden Markov model (ARHMM) [19,20] has also been shown to be an effective supervised speech enhancement method because it considers the temporal information of the speech signal.

Open Access
In recent years, advances in deep learning techniques [21,22], specifically, deep neural networks (DNNs), have significantly promoted the development of speech enhancement [23]. These methods usually rely on fewer assumptions [3,14,23] between the noise and clean speech, so they have huge potential to achieve better speech enhancement performance. Xu et al. [3,14] applied a feedforward multilayer perceptron (MLP) to map log-power spectrum (LPS) features of clean speech given noisy LPS input; the enhanced speech could be obtained directly by waveform reconstruction. Compared to the MMSE estimator [12], this method achieved better performance in various noisy environments. Wang et al. [24,25] also utilized an MLP to estimate the ideal ratio mask (IRM) and ideal binary mask (IBM) in conducting speech enhancement and also achieved satisfactory performance. Motivated by this work, researchers has used different DNN structures to conduct speech enhancement, such as a fully convolutional neural network (FCN) [26], deep recurrent neural networks (DRNN) [27,28], and generative adversarial networks (GANs) [29,30]. These methods could help ASR systems achieve higher recognition accuracy in noisy environments. However, generalization is always a problem that needs to be considered for these DNN-based algorithms [31,32].
A non-negative matrix factorization (NMF)-based speech enhancement algorithm [33][34][35] can also be viewed as a kind of supervised speech enhancement method. NMF-based methods usually include a training and enhancement stage. In [36], a mask-based NMF speech enhancement method was proposed. In the training stage, the basis matrix of clean speech and noise was trained. In the enhancement stage, the activation matrix could be acquired by combining the trained basis matrix and noisy signal. The mask was then estimated to conduct the speech enhancement. Additionally, an NMF-based denoising scheme was described in [37,38], which added a heuristic term to the cost function, so the NMF coefficients could be adjusted according to the long-term levels of the signals. A parametric NMF method for speech enhancement was proposed in [17]. This method applied the AR coefficient and codebook to build the basis matrix. This strategy effectively improved the speech intelligibility. Moreover, some DNN-based NMF methods represent an effective strategy for conducting speech enhancement [39,40]. In general, the basis matrix could be acquired using the traditional NMF method, and the activation matrix could be estimated by applying a DNN, which improved the accuracy of the estimated activation matrix. Thus, it could achieve a higher perceptual evaluation of speech quality (PESQ) [41] and short-time objective intelligibility (STOI) [42] scores than traditional NMF-based speech enhancement methods. The combination of DNN and NMF could also help the ASR system achieve a lower word error rate (WER) in noisy environments. In [43], a DNN-NMF-based method achieved excellent performance in the Computational Hearing in Multisource Environments (CHiME)-3 challenge. To capture temporal information, some HMM-based NMF speech enhancement methods have been proposed. Mohammadiha et al. [44] proposed a supervised and unsupervised NMF speech enhancement method. In [44], an HMM was used for modeling the temporal change of different noise types. In [45], a non-negative factorial HMM was used to model sound mixtures and showed superior performance in source separation tasks. In [46], an HMM-DNN NMF speech enhancement algorithm was proposed, which applied a clustering method to acquire the HMM-based basis matrix and used the Viterbi algorithm to obtain the ideal state label for the DNN training. In the enhancement stage, the DNN was used to find the corresponding state to conduct speech enhancement.
In this paper, we propose a novel NMF-HMM speech enhancement method based on the Kullback-Leibler (KL) divergence, expanding on our preliminary work [47]. Our preliminary work has briefly verified the effectiveness of an NMF-HMM for speech enhancement [47,48], but the effect of the parameters for the model was not considered. This is very important to optimize the algorithm performance. Additionally, its performance in various noisy environments was also not investigated. In this paper, we expand our preliminary research on these two aspects. Compared to other HMM-based methods [44,45,49], our method uses the HMM to capture the temporal dynamics of the speech and noise signal. Moreover, we use the sum of the Poisson distribution as the state-conditioned likelihood for the HMM, rather than the general Gaussian mixture model (GMM), because the sum of the Poisson distribution leads to the KL divergence measure. KL divergence is a mainstream measure in NMF, and its parameter update rule is identical to the multiplicative update rule. This ensures that the parameter update is computationally efficient during the training stage. In the enhancement stage, in contrast with previous works [44,45], we propose a novel NMF-HMMbased MMSE estimator to perform the online enhancement. A major benefit of the proposed algorithm is that the activation matrix could be updated by parallel computing in the online stage. This could effectively reduce computational time. In this paper, we also show a more detailed algorithm derivation towards the preliminary NMF-HMM-based algorithm [47]. Moreover, the proposed method was compared with other state-of-the-art speech enhancement algorithms, which further indicated the advantages of the proposed algorithm.
The rest of this paper is organized as follows. First, we will briefly review the general NMF-based speech enhancement method with KL divergence in Section 2. The proposed HMM-based signal model will be introduced in Section 3, and the more detailed offline parameter learning will be explained in Section 4. The details of the proposed MMSE estimator and online speech enhancement process will be given in Section 5. The experimental comparison and analysis of results will be illustrated in Section 5, and we will draw conclusions in Section 6.

NMF-based speech enhancement method with KL divergence
In this section, we will briefly review the NMF-based speech enhancement with KL divergence. Under the additive noise assumption, the noisy signal model can be expressed as: where y(t) , s(t) , and m(t) denote the noisy signal, clean speech, and noise, respectively, and t is the time index. With (1), the short-time Fourier transform (STFT) of y(t) can be written as: where Y(f, n), S(f , n) , and M(f , n) denote the frequency spectra of y(t) , s(t) , and m(t) , respectively. Here, f ∈ [1, F ] and n ∈ [1, N ] denote the frequency bin and time frame indices, respectively. Collecting the F frequency bins and N time frames, we define the magnitude spectrum matrices Y N , S N , and M N , where Y N = [y 1 , · · · , y n , · · · , y N ] and y n = [|Y (1, n)|, · · · , |Y (f , n)|, · · · , |Y (F , n)|] T and also s n and m n are defined similarly to y n . Additionally, S N (1) and M N are defined similarly to Y N ; we assume that Y N = S N + M N . The classical NMF-based speech enhancement has two stages: training and enhancement. In the training stage, the clean speech basis matrix W and noise basis matrix Ẅ are trained using clean speech and noise databases, respectively. Many cost functions have been proposed for NMF, such as KL divergence [34], Itakura-Saito (IS) divergence [50], β divergence, and Euclidean distance [51]. In this paper, we focus on using the KL divergence measure. There are two reasons for this choice. First, compared with other cost functions, the best speech enhancement performance can be achieved using the KL divergencebased NMF with the magnitude spectrum [52]. Second, the efficient multiplicative update (MU) rule of the KL divergence-based NMF can be also derived statistically using the expectation maximization (EM) algorithm [53]. For the two matrices B and B , the KL divergence measure is defined as: where b i,j and b i,j denote the elements from the ith row and jth column of the matrices B and B , respectively. Using speech basis matrix training as an example, the cost function of the KL divergence-based NMF for training W can be written as: Noise basis matrix training is similar to speech basis matrix training. In [34], it is derived that W and H can be obtained iteratively using the following multiplicative update rules: where ⊙ and all divisions are element-wise multiplication and division operations, respectively, and 1 is a matrix of ones with the same size as S N . In the enhancement stage, the noisy speech basis matrix W can be constructed by concatenating the speech and noise basis matrices, W = [W,Ẅ] . The activation matrix H of the noisy speech can be estimated iteratively by replacing S N , W , and H in (6) with Y N , W , and H , respectively. The enhanced signal can be obtained using various algorithms [36,37,44,45]. One popular approach is to  where (8) can be solved iteratively using (6). Apart from the gradient descent derivation of the MU update rules (5) and (6) presented in [34], it is further shown in [53] that the MU update rules can be derived from a statistical perspective. More specifically, the KL divergence-based NMF can be motivated from the following hierarchical statistical model: denotes the gamma function for positive integer x, K denotes the number of basis vectors, C(k) is the latent matrix, and c f ,n (k) denotes the element of C(k) in the f th row and nth column. Note that c f ,n (k) is assumed to have a Poisson distribution, which can only be used for discrete variables. However, in practice, this hierarchical statistical model is not limited to discrete variables because the gamma function for continuous variables can be used to replace the factorial calculation [53]. It has been shown in [53] that the iterative update of the parameters H and W using the EM algorithm is identical to the multiplicative update rules shown in (5) and (6).
One of the advantages of the classical NMF-based method for speech enhancement is that the computational efficient MU rules can be applied. However, the temporal dynamical aspects of speech and noise are not taken into account. To incorporate the temporal dynamical information of audio signals, the HMM model is used in [45] for source separation. However, the parameter update rules are computationally complex. Moreover, this method [45] can only perform the offline enhancement. In this paper, we propose an NMF-based speech enhancement algorithm using the HMM to take the temporal aspects of both the speech and noise into account. The proposed approach can achieve efficient parameter updates. Moreover, an online MMSE estimator for speech enhancement is derived. Although other methods also considered the temporal dynamical information for speech enhancement, such as simply stacking multiple frames to a vector [14,54], using the DRNN [28], and non-negative matrix deconvolution [55], the high computational complexity and the large model size lead to a high storage complexity. In this paper, the proposed method can achieve a higher PESQ score than the referenced DNN-based method for unseen noise and also has a lower complexity than it.

HMM-based signal models with the KL divergence
In this section, we present the details of the proposed signal models, including the speech and noise signal models and the noisy signal model.

Speech and noise signal models
In this work, the same signal model is used for both the clean speech and noise signals, so we will derive the equations using only the clean speech signal. Additionally, we use the overbar ( · ) and double dots ( · ) to represent the clean speech and noise, respectively. To consider the temporal dynamic information of the speech and noise, we use the HMM. Following the conditional independence property of the standard HMM [56], the likelihood function can be expressed as follows: where x N = [x 1 , · · · , x n , · · · , x N ] T is a collection of states, x n ∈ {1, 2, · · · , J } denote the state at the nth frame, and J denotes the total number of states. The function p(x n |x n−1 ) denotes the state transition probability from state x n−1 to x n with p(x 1 |x 0 ) being the initial state probability. p(S n |x n ) is the state-conditioned likelihood function, and is a collection of modeling parameters. Next, we describe the state transition probability and the stateconditioned likelihood function, respectively, for the proposed signal model. The state transition probability p(x n |x n−1 ) : Following the standard HMM, we use a first-order Markov chain to model the state transition, that is: where l(·) denotes an indicator function, which is one when the logic expression in the parentheses is true and zero otherwise. In addition, A i,j and π j denote the (11)  transition probability from state i to state j and the initial probability for the first frame's state x 1 being state j, respectively. Collecting all the initial and transition probabilities, we can write them into matrix forms, π = [π 1 , · · · , π j , · · · , π J ] T and A with A i,j being the element at the ith row and jth column. Therefore, the modeling parameters of the HMM can be expressed as � hmm = {A, π, J } . The modeling parameters A and π with a predefined J can be trained through the EM algorithm shown in the next section. In the experiments, we investigate the impact of the total number of states J . The state-conditioned likelihood function: Next, we present the proposed state-conditioned likelihood function. Motived by the good speech enhancement performance, the computationally efficient MU rule, and the equivalence between the gradient descent derivation and the EM algorithm for the KL divergence-based NMF, we propose to use the statistical model in (9) and (10) to build the state-conditioned likelihood function, that is: where K is the number of basis vectors, c n (k) contains the hidden variables, and W x n k,n and H x n k,n correspond to the elements of the basis and activation matrices, respectively. By writing c n = [c n (1) T , c n (2) T , · · · , c n (K ) T ] T and integrating c n , the state conditioned likelihood function can be written as: where we use the superposition property of the Poisson random variable [53].
section. In the experiments, we investigate the impact of the number of basis vectors K and J . It will also be shown that a multiplicative update rule can be derived for the basis and activation matrices update of the proposed state-conditioned likelihood function.
To summarize, five types of parameters in the parameter set = hmm ∪ like can be identified. They are the transition matrix A , initial state probabilities in π , basis matrices of different states {W j } , activation matrices of different states {H j } , and modeling parameters K and J .
In this paper, the modeling parameters K and J are predefined, the activation matrices {H j } are estimated by online speech enhancement, and the other three types of parameters are obtained using offline learning.

Noisy speech model
Based on the proposed clean speech and noise signal models (1) and (2), the noisy speech model can be defined. We assume that there are a total of J hidden states for the noise, and the hidden state of the noise is ẍ n (ẍ n ∈ {1, 2, · · · ,J }) . The notations π and Ä correspond to the initial state probability and transition probability matrix of the noise. Thus, there are a total of J ×J hidden states for the noisy speech. Each composite state consists of a pair of states of clean speech x n and noise ẍ n . Thus, if we list the state space for a noisy signal, we have (x n = 1,ẍ n = 1), (x n = 1,ẍ n = 2), ⋯ , (x n = 1,ẍ n =J );(x n = 2,ẍ n = 1), (x n = 2,ẍ n = 2), ⋯ , (x n = 2,ẍ n =J ); ⋯ ;(x n = J ,ẍ n = 1), (x n = J ,ẍ n = 2), ⋯ , (x n = J ,ẍ n =J ) . Moreover, the initial state and transition probability matrices of the noisy speech can be expressed as π ⊗π and A ⊗Ä , where ⊗ denotes the Kronecker product. Finally, the state conditioned likelihood function of the noisy speech can be written as follows: where K , {Ẅẍ n f ,k } , and {Ḧẍ n f ,k } represent the number of basis vectors, elements of the basis matrices, and the activation matrices for the noise, respectively. We can write {Ẅẍ n f ,k } and {Ḧẍ n k,n } into matrix forms as {Ẅ j } and {Ḧ j } , respectively. Note that we also used the superposition property of Poisson random variables to obtain (17).

Offline NMF-HMM-based parameter learning
In the offline training stage, the objective is to find the parameter set that maximizes the likelihood function (11). In general, the EM algorithm [56] can be used to address this problem. Because we use the same model (17) p(y n |x n ,ẍ n ) = for the speech and noise, here, we use the clean speech as an example to illustrate the offline parameter learning process. First, we define the complete data set (S N , x N , C N ) , where C N = [c 1 , c 2 , · · · , c N ] . Thus, using the conditional independence property, the complete data likelihood function can be written as: Next, we show how the parameter set can be obtained iteratively using the EM algorithm. Moreover, we propose an acceleration strategy to lower the computational and memory complexities. The traditional MU update algorithm for the KL divergence-based NMF can be seen as a special case of the proposed algorithm. Expectation step: We first calculate the posterior state probability and the joint posterior probability, which can be written as: where i is the iteration number. The calculation of (19) and (20) can be performed using the forward-backward algorithm [56]. Apart from this, we also need to evaluate the posterior expectation E c n |S N ,x n ; i−1 (c n ) , which will be used in the maximization step. By using the Bayes rule and the conditional independence property of the proposed model, we have: Combining (14) and (15) and following the derivation in [53], we have: where M(·) denotes the multinomial distribution and Using the properties of the multinomial distribution, the mean can be written as: (18) p(s n |c n )p(c n |x n )p(x n |x n−1 ).
q(c n |x n ) = p(c n |S N , x n ; i−1 ) = p(s n |c n )p(c n |x n ) p(S N , x n ) .
Maximization step: In this step, our objective is to find parameters to maximize the expectation of the logarithm of the complete data likelihood, that is, The estimators for A and π are the same as the traditional HMM [56]. For completeness, the results are shown below: where 1 ≤ o, j ≤ J . The estimated basis and activation matrices can be derived by setting the derivatives of (25) to zeros, and we can obtain: Acceleration strategy: Although we can directly use the above EM algorithm to update the parameter set, saving the conditional expectation of c f ,n (k) in (24) requires a great deal of memory. Like [53], we substitute (24) into (28) and (29) and can obtain: We can further write (30) and (31) in matrix forms: where (j) = diag(q(x 1 = j), q(x 2 = j), · · · , q(x N = j)) . By using the proposed acceleration strategy, the computing and saving of the conditional expectation of c f ,n (k) in (24) is not required. Moreover, the multiplicative update rules for the basis and activation matrices can be obtained, leading to fast computing. In other words, there are more than one basis and active matrices to be estimated in the proposed algorithm. Using acceleration strategy, the different basis and active matrices can be simultaneously estimated. We do not need to estimate them one by one. This reduces the time complexity. Comparing the update rules of the proposed method (32), (33) with the traditional NMF-based method (5), (6), the difference is that the basis vectors update rule (32) for the proposed method takes the posterior state information (j) into account. In fact, if the number of the state is set to one (i.e., J = 1 ), the proposed training method is identical to the traditional KL divergence-based NMF approach. Thus, the traditional NMF can be seen as a special case of the proposed algorithm. The entire flow of the offline parameter learning is shown in Algorithm 1. Note that, for stability reasons, each column of W j is normalized to have a unit norm during training.

Online speech enhancement using the MMSE estimator 4.2.1 MMSE estimator for the NMF-HMM
In this section, we provide a detailed derivation for the proposed MMSE-based online speech enhancement algorithm in the proposed NMF-HMM model. Our objective is to obtain the MMSE estimate of the desired clean speech signal from noisy observation: In (34), the posterior probability p(s n |Y n ) can be derived as: where we use the conditional independence property of the HMM. The term p(x n ,ẍ n |Y n−1 ) in (35) can be expressed as: where the first term after the summation is the state transition probability for a noisy signal, and the second term is the forward probability that can be acquired using the well-known forward algorithm [56]. By applying the Bayes rule, the term p(s n , y n |x n ,ẍ n ) in (35) can be further written as: Substituting (37) for (35), the posterior probability can be re-written as: where the weight 0 ≤ ω x n ,ẍ n ≤ 1 is defined as: Thus, by combining (34) and (38), the proposed HMMbased MMSE estimator can be expressed as: Instead of obtaining the posterior probability density function (PDF) p(s n |y n , x n ,ẍ n ) directly, we derive the formula for the joint posterior PDF of the clean speech and noise first, that is: (34) s n = E s n |Y n (s n ) = s n p(s n |Y n ) ds n .
(40) s n = x n ,ẍ n ω x n ,ẍ n s n p(s n |y n , x n ,ẍ n ) ds n . By using (1), we can express the likelihood function p(y n |s n , m n ) as p(y n |s n , m n ) = δ(y n − s n − m n ) , where δ(·) denotes the Dirac delta function, which is defined by δ(0) = +∞ , and δ(x) = 0 when x = 0 . Furthermore, The prior probability p(s n |x n ) and p(m n |ẍ n ) can be estimated by using (16). Following the derivation in [53], we can verify that the joint posterior PDF can be expressed in terms of the multinomial distribution as: where p f ,n (x n ,ẍ n ) and q f ,n (x n ,ẍ n ) are defined as: where q f ,n (x n ,ẍ n ) = 1 − p f ,n (x n ,ẍ n ) . Therefore, the integral term in (40) can be expressed as: where p n (x n ,ẍ n ) = [p 1,n (x n ,ẍ n ), · · · , p F ,n (x n ,ẍ n )] T , and we used the marginal mean property of the multinomial distribution. Combining (40) and (44), the MMSE estimator can be expressed as: where g n can be viewed as the spectral gain vector for the proposed model. Comparing the proposed gain vector g n with the traditional NMF-based gain vector [36], we find that the proposed gain vector is a weighted sum of each state's gain, which is in the Wiener filtering form as the traditional NMF gain (7). (41) p(s n , m n |y n , x n ,ẍ n ) = p(y n |s n , m n )p(s n , m n |x n ,ẍ n ) p(y n |x n ,ẍ n ) = p(y n |s n , m n )p(s n |x n )p(m n |ẍ n ) p(y n |x n ,ẍ n ) .

(42)
p(s n , m n |y n , x n ,ẍ n ) = s n p(s n |y n , x n ,ẍ n ) ds n = s n p(s n , m n |y n , x n ,ẍ n ) dm n ds n = y n ⊙ p n (x n ,ẍ n ), (45) s n = y n ⊙ g n , (46) g n = x n ,ẍ n ω x n ,ẍ n p n (x n ,ẍ n ),

Online estimation of activation matrices
After obtaining the trained basis matrices W x n f ,k and Ẅẍ n f ,k for both the clean speech and noise in the training stage, we need to obtain the online estimates of the activation parameters H x n f ,k and Ḧẍ n f ,k to acquire the gain in (45) and (46). The activation matrices are estimated by maximizing the logarithm of the state-conditioned likelihood function (17), which is equivalent to: where the clean and noise activation matrices for the state (x n ,ẍ n ) are defined as h n (x n ,ẍ n ) = [H x n 1,n , H x n 2,n , · · · , H x n K ,n ] T , and ḧ n (x n ,ẍ n ) = [Ḧẍ n 1,n ,Ḧẍ n 2,n , · · · , Hẍ n K ,n ] T . The activation matrix (48) can be obtained iteratively by using the multiplicative update rule in Eq. (6). Note that parallel computing can be used to reduce the time complexity when obtaining the activation matrices for different states. It can be readily shown that when J =J = 1 , the gain vectors for the proposed algorithm (46) and the standard NMF (7) are identical, that is, g n = g NMF n . The entire flow of the proposed MMSE-based online speech enhancement algorithm is illustrated by Algorithm 2.

Experimental results and discussion
In this section, we report on the investigation and evaluation of the proposed algorithm using various experiments. First, we investigated the effect of different parameter settings for the proposed model, that is, the number of states and basis vectors of clean speech and noise, respectively. Second, we compared the proposed NMF-HMM with other state-of-the-art speech enhancement methods to demonstrate the effectiveness of the proposed algorithm. In this work, the PESQ score [41], ranging from − 0.5 to 4.5, was used to quantify (47)  the enhanced speech quality. The version of the PESQ model used was the International Telecommunication Union (ITU) standard P.862 [57]. The implementation code was provided by [2]. The STOI score [42], ranging from 0 to 1, was used to measure speech intelligibility.

Experimental data preparation
In this study, the proposed algorithm was evaluated using the Texas Instruments/Massachusetts Institute of Technology (TIMIT) database [58], 100 environmental noises [59], office noise 1 , and the NoiseX-92 database [60]. During the training stage, all 4620 utterances from the TIMIT training database were used to train the proposed NMF-HMM model for clean speech. For the experiments in Section 5.2, the Babble, F16, Factory, and White noises from the NoiseX-92 database were used to train the NMF-HMM model. For the experiments in Section 5.2, 200 utterances from the TIMIT test set, including 1680 utterances, were randomly chosen to build the test database. Four types of noise were then added at four different SNR levels (− 5, 0, 5, and 10 dB). The noise types of the testing set were the same as the training set, but there was no overlap between the signals in the two sets. In total, 200 × 4 × 4 = 3200 utterances were used for the evaluation. For the experiments in Section 5.3, we conducted extensive experiments; the Babble and F16 noises from the NoiseX-92 database and 90 environmental noises (N1-N90 in [59]) were used to train the NMF-HMM model for the noise dictionary. In the test stage, 200 utterances from the TIMIT test set, including 1680 utterances, were randomly chosen to build three test databases. The first test database included 10 unseen environmental noises from [59] (N91-N100). The second included unseen office noise, and the third test database was built from 25 seen environmental noises in [59] (N18-N43). In all three test databases, the noise was added at four different SNR levels (− 5, 0, 5, and 10 dB). All the algorithms were evaluated using the same test dataset. In all experiments, the sound signals were down-sampled to 16 kHz. The frame length was set to 1024 samples (64 ms) with a frame shift of 512 samples (32 ms). The size of STFT was 1024 points with a Hanning window. Furthermore, the maximum number of iterations was set to 30 in the training stage and 15 in the online speech enhancement stage for the proposed NMF-HMM algorithm.

Analyses of the number of states and basis vectors
As explained in Sections 3 and 4, four parameters are needed to be pre-defined in our proposed NMF-HMM-based speech enhancement algorithm. These parameters were the number of states ( J and J ) and basis vectors ( K and K ) for the clean speech and noise. In this section, we report on the investigation of the effects of these parameters in our proposed method and the choice of suitable parameters for the later experiments.

HMM states analysis
First, before the state analysis, we want to indicate that using temporal dynamics can effectively help NMF obtain a better SE performance. To verify this point, we use the traditional NMF-based speech enhancement (T-NMF) [36] as the reference method. T-NMF is a special case of NMF-HMM when J = 1 and J = 1 . T-NMF does not include the temporal dynamics information. The transition matrix A is a non-informational matrix in T-NMF. For a fair comparison, we keep that the total numbers of clean speech basis vectors ( K × J ) for the NMF-HMM and T-NMF method [36] are the same. For the T-NMF, the number of clean speech basis vectors K is varied as 25, 125, 250, 500, and 1000. For the NMF-HMM, the K is fixed to 25 and J is varied as 1, 5, 10, 20, and 40. The number of noise basis vectors for both the proposed NMF-HMM and T-NMF is fixed to 70, and the number of noise states for the NMF-HMM is fixed to 1. In this experiment, we use the average STOI and PESQ scores of 3200 utterances as the performance metrics. The experimental results are shown in Fig. 1. As can be seen, the T-NMF can achieve the best performance when K = 25. However, its performance degraded with the increasing of number of basis vectors due to overfitting. By contrast, NMF-HMM achieves higher PESQ and STOI scores with an increasing number of the clean speech basis vectors by taking the temporal dynamics into account using the HMM model, which indicates that temporal dynamics can improve the NMF's SE performance.

States and basis vector analysis for clean speech
Next, we investigated the effect of the number of clean speech states J and basis vector K to the proposed model. The number of noise states was set to 1 (i.e., J = 1 ) for the proposed NMF-HMM. The number of basis vectors for the noise was fixed to K = 70 , respectively. The number of clean speech states was chosen as 1, 5, 10, 20, and 40. Additionally, the number of clean speech basis vector was chosen as 5, 10, 25, and 50. The enhancement performance was evaluated by the PESQ and STOI scores. Tables 1 and 2 show the average STOI and PESQ score in different SNRs. It can be seen that if the number of basis vectors K is fixed, there is a higher PESQ and STOI score with the increasing of clean state J . This indicated the benefits of using the temporal dynamics in NMF model. Additionally, if the clean state J is fixed, we can find that HMM can achieve the best speech enhancement performance when K = 25 . A higher K can lead to a worse speech enhancement performance due to overfitting. Therefore, based on these experimental results, we choose J = 40 and K = 25 to perform the following experiments.

States and basis vector analysis for noise
In this part, we evaluated the effect of noise states J and basis vector K to the proposed model. Here, the number of clean states and basis vectors was set to 40 and 25 ( J = 40 , K = 25 ), respectively, which is based on the previous experimental results. The number of noise states was chosen as 1, 2, 5, and 10. In addition, the number of noise basis vector was chosen as 10, 20, 40, and 70. Tables 3 and 4 show the experimental results for the average STOI and PESQ score in different SNRs. We can find that the PESQ and STOI have an increasing trend with the increasing of noise state J when the number of noise basis vectors K is fixed. Moreover, if the J is fixed, K = 70 can achieve the highest PESQ score but the STOI score is slightly lower than K = 40 . Based on the experimental results, we select J = 40,J = 10, K = 25,K = 40 for the rest of the experiments because the model have the less parameters when K = 40 . Furthermore, there is a higher STOI when K = 40 and the PESQ difference is not obvious between the K = 40 and K = 70.

Overall evaluation
In this section, we report on the comparison of the proposed NMF-HMM speech enhancement method with state-of-the-art speech enhancement methods. We chose the optimally modified log-spectral amplitude (OM-LSA) method [61] with improved minima controlled recursive averaging (IMCRA) noise estimator [62]; variable span linear filters method [7] (SLF-NMF), which uses the parametric NMF [17] for estimating the statistics; temporal-NMF [49]; convolutive NMF (CNMF) [55,63]; DNN [64]; and log-MMSE [65] algorithm as the reference methods. For the SLF-NMF, the maximum SNR filter was applied, and the number of eigenvectors was set to one. The variable span linear filters reference code can be found in [7]. The codebook size of clean speech and noise was set to 64 and 8, respectively. The other SLF-NMF parameter settings were the same as NMF-HMM. For the temporal-NMF, all the parameter settings were the same as the work in [49], which ensured that the temporal-NMF could achieve the best speech enhancement performance. For the CNMF, the related settings were similar to the CNMF in [40]. For the DNN, we used the DNS baseline [64] as the reference method, which is one of the state of the art speech enhancement algorithm. The OM-LSA and log-MMSE were state-of-the-art unsupervised speech enhancement methods. while the SLF-NMF and temporal-NMF were state-of-the-art NMF-based speech enhancement methods. The temporal-NMF also considered the temporal information like our methods.
The performance of the NMF-HMM, DNN, temporal-NMF, CNMF, SLF-NMF, log-MMSE, and OM-LSA were evaluated using the test set. Figure 2 shows the average PESQ scores with 95% confidence intervals of these algorithms for 25 types of seen noise. As can be seen, the SLF-NMF had the worst performance among these algorithms. Temporal-NMF and CNMF achieved a higher score than SLF-NMF, which indicated the benefits of temporal information for speech enhancement. Moreover, except for DNS baseline, the proposed NMF-HMM outperformed other enhancement algorithms in all the SNR scenarios. Furthermore, in low SNR scenarios (e.g., − 5-5 dB), the average PESQ score improvement of the proposed NMF-HMM was larger than 0.5 against the other algorithms. Figures 3 and 4 show the PESQ result under an unseen noise environment, which indicates that NMF-HMM could always achieve a higher PESQ score than the reference methods at all four SNRs except for DNS baseline.
The results of the STOI scores with 95% confidence intervals for various algorithms are provided in Table 5. As can be seen, the temporal-NMF, CNMF, and NMF-HMM had higher STOI scores than SLF-NMF under three different test datasets, which illustrates the benefits of considering speech temporal information. In general, NMF-HMM achieved the highest STOI score, better than the referenced NMF-based methods (temporal-NMF, CNMF, and SLF-NMF) for seen and unseen noise. In addition, the DNS baseline achieved a better STOI score than NMF-HMM.   In general, for these non-DNN-based speech enhancement algorithm, the proposed method can achieve the best speech enhancement performance. Moreover, DNS baseline can achieve the highest speech enhancement score. In the future work, we think that a DNN-based strategy can be combine with proposed algorithm to improve to accuracy of basis vector estimation. As a result, our algorithm can achieve a better speech enhancement performance.

Conclusion
In this work, we proposed and analyzed an NMF-HMMbased speech enhancement algorithm that applies the sum of the Poisson distribution, leading to the KL divergence measure, as the observation model for each state of the HMM. The computationally efficient multiplicative update rule is used to conduct parameter updates during the training stage for this proposed method. Moreover, using the HMM, the temporal dynamic information of speech signals can be captured in this method. Furthermore, we detailed the derivation of the proposed NMF-HMM-based MMSE estimator to conduct online speech enhancement. Parallel computation can be applied for the proposed estimator, so we can effectively reduce the time complexity during the online speech enhancement stage. With experiments, a suitable number of state basis vectors for the proposed NMF-HMM were found. Our experimental results also indicated that the proposed algorithm could outperform state-of-the-art NMF-based and unsupervised speech enhancement methods. In the future work, a DNN-based strategy can be considered to improve the accuracy of basis vector estimation. As a result, our algorithm can achieve a better speech enhancement performance.