Skip to main content

An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph

Abstract

The paper uses the K-graphs learning method to construct weighted, connected, undirected multiple graphs, aiming to reveal intrinsic relationships of speech samples in the inter-frame and intra-frame. To benefit from the learned multiple graphs’ property and enhance interpretability, we study the spectral property of speech samples in the joint vertex-frequency domain by using the new graph weight matrix. Moreover, we propose the representation of minimum mean-square error (MMSE) graph spectral magnitude estimator for speech signals residing on undirected multiple graphs. We use the MMSE graph spectral magnitude estimator to improve speech enhancement performance. The numerical simulation results show that the proposed method outperforms the existing methods in graph signal processing (GSP) and the baseline methods for speech enhancement in discrete signal processing (DSP) in terms of PESQ, LLR, output SNR, and STOI results. These results also demonstrate the validity of the learned multiple graphs.

1 Introduction

Graph signal processing (GSP) [1, 2] explores the relationships among discrete signals residing on vertexes via graph models [3]. It has been developing a set of theories based on the traditional discrete signal processing to investigate, analyze, and process the data defined over arbitrary topologies [4, 5]. The scope of research has changed from fundamental GSP concepts [6,7,8] to practical applications including graph signals denoising and restoration [9,10,11], learning graphs from observed data [12,13,14], image processing [15, 16], and graph clustering [17, 18].

GSP is efficient in characterizing time series by using graph theories [19]. To be specific, in [20], dynamic visibility graphs (DVG) were constructed to describe time series, which studied the DVG dependence of different time series. In [21], the recurrence matrix of the time series was defined as the adjacency matrix of an associated complex graph to link different points in the case where the evolution of the considered states is similar. In [22, 23], a visibility graph based on the Gaussian kernel function was defined for electroencephalogram (EEG) signals, which provided a topology to capture sudden fluctuations happening in EEG during seizure activity.

Traditional discrete signal processing (DSP)-based speech enhancement algorithms use short-time spectrum estimation to suppress noises. Specifically, in [24], the authors proposed the traditional Wiener filtering for speech enhancement by leveraging the frequency-domain characteristics of noises and speech signals. In [25], the authors focused on the major importance of the short-time spectral amplitude (STSA) of the speech signal and proposed a minimum mean-square error (MMSE) STSA estimator for speech enhancement by modeling speech and noise spectral components as statistically independent Gaussian random variables. In [26], the authors proposed the optimal modified minimum mean-square error log-spectral (OMLSA) for robust speech enhancement by minimizing the mean-square error of the log-spectra as a weighted geometric mean of the hypothetical gains associated with the speech presence uncertainty. In addition, in [27], the authors proposed a statistical speech enhancement model using acoustic environment classification supported by a Gaussian mixture model. In [28], the authors proposed a joint-constrained dictionary learning method to solve the “cross projection” problem of signals in the joint dictionary for single-channel speech enhancement.

Unlike the traditional DSP-based speech enhancement methods, GSP-based speech enhancement methods first establish graphs for speech signals before performing enhancement. By establishing the graph adjacency matrix with different edges and weights, speech signals can then be flexibly mapped into different graph frequency domains with different graph Fourier bases. It is worth noting that finite (periodic) time series have been constructed as signals indexed by a directed cycle graph [1, 2, 6]. Speech signals are special time series. The current graph topology of finite time series is directly applied for unstructured speech signals, which explores sampled speech signals’ time shifts and succession and fails to capture the potential relationship among speech samples.

Our previous work [29, 30] has made many processes for inferring a suitable graph representation of speech signals. To be specific, in [29], we first established a single undirected graph topology for unstructured speech signals, which successfully mapped time-domain speech signals into the vertex domain and viewed them as speech graph signals. In [30], we proposed a single digraph by using algebraic signal processing (ASP) [31] theories and then built graph wiener filters in the graph Fourier domain for speech enhancement. However, the designed static graph topology for speech signals in our previous work [29, 30] cannot capture the potential relationships between different speech frames. In [32], we proposed to learn a directed multilayer graph model for speech signals by using graph learning, which reveals both the intrinsic relationships of inter-frames and those among speech samples within a frame. But it aimed to learn a complex and large volume graph model for the total speech signals, which does not reveal the dynamic change characteristics in speech signals.

Against this backdrop, in this paper, we propose a K-graph learning method to learn multiple undirected graphs for framed noisy speech graph signals to better match the dynamic nature of speech. Specifically, the framed noisy speech graph signals are partitioned into a set of clusters. For each cluster, an undirected graph is learned to reveal the potential relationships among noisy speech frames in the cluster. In this way, multiple graphs of a small size other than a large-volume graph are learned, which reveal the inter-frame relationships of the total speech signals in a more dynamic way. Additionally, as the size of each cluster is much smaller than that of the whole speech signals, the K-graphs learning method leads to multiple graphs of small volumes with a much lower learning complexity.

On the basis of the learn multiple undirected graphs, we propose the gain function representation for the MMSE graph spectral magnitude estimator by extending the classical MMSE-STSA to enhance noisy speech signals. The contributions of the paper are summarized as follows.

i) We propose the novel undirected multiple graphs by using the K-graphs learning method, which reveals potential relationships among noisy speech frames in real-time. On this basis, we construct a joint graph weight matrix and define the related graph Fourier basis.

ii) Based on the constructed graph Fourier basis, we investigate the gain function representation of MMSE graph magnitude spectral estimator for speech graph signals (SGSs). We propose an MMSE graph spectral magnitude estimator by extending the classical MMSE-STSA method in DSP to perform speech enhancement.

iii) Our numerical results show that the proposed method outperforms the benchmarks in terms of the output signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), and short-time objective intelligibility (STOI). The numerical results also demonstrate the validity of the learned multiple graphs for speech signals. We use the classical MMSE-STSA [25] and OMLSA [26] methods in DSP and existing graph Wiener filtering methods [33] as benchmarks.

The remainder of the paper is organized as follows. Section 2 introduces the related work of GSP. Section 3 investigates the K-graphs learning method for speech signals. The details of the MMSE graph spectral magnitude estimator method are given in Section 4. Section 5 provides our experimental results. Section 6 concludes the paper.

Notation: The superscript \(^T\) represents the transpose. \({\textrm{trace}( \cdot )}\), \({\left\| \cdot \right\| _F}\) and \({\left\| \cdot \right\| _1}\) represent the trace, F-norm, 1-norm, respectively, and \(\textbf{1}\) denotes the all-ones matrix. \(\times\) represents the Cartesian product. \(\textbf{I}\) is the identity matrix. \(E( \cdot )\) and \(p( \cdot )\) represent the expectation operator and the probability density function, respectively.

2 Related work

2.1 Basics of GSP

Let \(G = \left( {\mathcal {V},\mathcal {E},\textbf{A}} \right)\) represent a weighted graph, where \(\mathcal {V}\) denotes the collection of vertices \({v_1}, \cdots ,{v_N}\), the set of edges \(\mathcal {E}\) satisfies \(\left( {m,n} \right) \in \mathcal {E}\) if and only if the vertex \({v_m}\) is connected to the vertex \({v_n}\) [34], and \(\textbf{A}\) denotes the graph weight matrix. The element in \(\textbf{A}\) represents the weight of the edge between vertices \({v_m}\) and \({v_n}\), which intuitively and numerically shows the appropriate dependence or similarity between signals on vertices [35, 36]. This paper focuses on a connected undirected graph with finite vertices. We can define the variation operator by the combinatorial graph Laplacian matrix \(\textbf{L}= \textbf{D}-\textbf{A}\), where \(\textbf{D}\) is a diagonal matrix with elements \({d_{mm}} = \sum \nolimits _{n = 1}^N {{{\textrm{A}}_{mn}}},(m = 1,\cdots N)\) [37].

Following [38, 39], we have the smoothness of graph signals \(\textbf{X}\) given by

$$\begin{aligned} (\textbf{XL})({v_m})=\sum \nolimits _{\left\{ {{v_m}} \right\} \in \mathcal {V}}{{{\textrm{A}}_{m,n}}(x(m) - x(n))}. \end{aligned}$$
(1)

A small smoothness value means that signals indexed by adjacent vertices have similar values, that is, \(\textbf{X}\) is smooth [40]. From (1), we have

$$\begin{aligned} {\textbf{X}^T}\textbf{LX}= & {} \sum \limits _{\left\{ {{v_m}}\right\} \in \mathcal {V}}{x(m)}\sum \limits _{\left\{ {{v_n}}\right\} \in \mathcal {V}}{{{\textrm{A}}_{m,n}}(x(m)- x(n))} \nonumber \\= & {} \sum \limits _{\left\{ {{v_m}}\right\} \in \mathcal {V}}{\sum \limits _{\left\{ {{v_n}}\right\} \in \mathcal {V}}{{{\textrm{A}}_{m,n}}({x^2}(m)-x(m)x(n))}} \\= & {} \frac{1}{2}{\sum \limits _{\left\{ {{v_m}}\right\} \in \mathcal {V}}{\sum \limits _{\left\{ {{v_n}}\right\} \in \mathcal {V }}{{{\textrm{A}}_{m,n}}(x(m)-x(n))}} ^2}.\nonumber \end{aligned}$$
(2)

2.2 Speech signals on graphs

After the framing operation, speech signals can be expressed as a matrix \({\textbf{s}}\in {\mathbb {R}^{M \times {N_s}}}\) where each row represents a frame speech signals, M represents the number of whole noisy speech frames, and \(N_s\) denotes the length of a speech frame. Let us view each discretized speech sample as a vertex. Assuming that the relationship between the adjacent samples is symmetrical, speech samples can be constructed as speech graph signals (SGSs) residing on a connected undirected graph. The one-to-one mapping between the \(i_{th}\) noisy speech sample \(f_i\) in a frame and the signal value of the \(i_{th}\) vertex \({v_i}\) is given by

$$\begin{aligned} {\textbf{s}_G}: \mathbb {R} \rightarrow {\mathcal {V}_s},{f_i} \rightarrow {v_i}, \end{aligned}$$
(3)

where \({\mathcal {V}_s}\) is the set of vertices cardinality \(|{\mathcal {V}_s}|=N_s\), in this way, noisy speech signals in each frame are mapped into the graph domain.

2.3 The joint graph Fourier transform

Let we denote the time-vertex graph signal as \({\textbf{X}_E}{=}\left[ \textbf{x}_{1},\textbf{x}_{2},...,\textbf{x}_{\textrm{T}}\right] \in \mathbb {R}^{N \times \textrm{T}}\), where \(\textbf{x}_1\), \(\textbf{x}_2\), \(\ldots\), \(\textbf{x}_{\textrm{T}}\) represent graph signals sampled at \(1,2,...\textrm{T}\) successive regular intervals with length N [41]. To investigate the spectral properties of \(\textbf{X}_E\), following [42, 43], the joint time-vertex graph Fourier transform (JFT) is defined as

$$\begin{aligned} \widehat{\textbf{X}_{E}}= \textrm{JFT}\{ \textbf{X}_{E}\} = \mathbf {\Psi }_{G}\textbf{X}_{E}\mathbf {\Psi }_{T}, \end{aligned}$$
(4)

where \(\mathbf {\Psi }_T\in \mathbb {R}^{\textrm{T}\times \textrm{T}}\) is constructed as the normalized discrete Fourier transform matrix, with \(\Psi _{T}\left( {t,k} \right) = e^{- j}\frac{2\pi (k-1)}{\textrm{T}} / \sqrt{\textrm{T}}\) and \(\mathbf {\Psi }_G \in {\mathbb {R}^{N\times N}}\) is the eigenvector matrix obtained by the eigendecomposition of \(\textbf{X}_E\)’s graph weight matrix. More specifically, \(\mathbf {\Psi }_T\) is applied to analyze the time-frequency oscillations of \(\textbf{X}_E\) along the time domain, \(\mathbf {\Psi }_G\) allows us to obtain the graph-frequency characters of \(\textbf{X}_E\) along the graph edges. Moreover, the corresponding inverse IJFT is defined as

$$\begin{aligned} \textrm{IJFT}\{ {\widehat{{\textbf{X}_E}}}\} = {\mathbf {\Psi }_G^{-1}}{\widehat{{\textbf{X}_E}}}\mathbf {\Psi }_{T}^{T}. \end{aligned}$$
(5)

The definitions of JFT and IJFT allow us to take into account the variation of the graph and the temporal aspects of time-vertex graph signals.

3 The K-graphs learning method for speech graph signals

This section uses the K-graphs learning method to infer a multiple-graphs model for SGSs without prior graph topologies. It is noted that the inter-frame relationship is not considered in traditional speech enhancement systems. Differing from noises, speech samples have a strong correlation between and within frames. In this paper, considering that when we learn a global graph for speech signals, it would be complex. Inspired by K-means and K-graphs learning [44], we partition the SGSs into clusters and use the K-graphs learning method to capture the potential properties of speech samples both the inter-frame and intra-frame.

Noisy speech signals are mapped into the graph domain by using Eq. (3) and are constructed as SGSs. We employ \(M_s\) speech frames in a cluster to learn a graph for capturing the relationships between \(M_s\) speech frames. The SGSs in the kth cluster, \(\textbf{s}_G^k\) \(\left( {1 \le k \le K}, K= M/M_s \right)\), resides on the kth undirected multiple graph \({G_k} = ({\mathcal {V}_k},{\varvec{\mathcal {L}}_k})\), where \({\mathcal {V}_k}\) indicates the vertex set and \({\varvec{\mathcal {L}}_k}\) is the graph Laplacian matrix of \({G_k}\), and M is the total number of speech frames. Let us now investigate the graph weighted matrix \({\varvec{\mathcal {L}}_k}\) of \({G_k}\), for the sake of revealing the intrinsic relationships among speech frames in real-time.

Following the K-graphs learning framework in [44], we formulate the multiple graphs learning problem of noisy speech graph signals as

$$\begin{aligned} & \min\limits_{\substack{\boldsymbol{\mathcal {L}}_{1}, \cdots \boldsymbol{\mathcal{L}}_{K} \\ \mathbf{s}_{G}^{1}, \cdots \mathbf{s}_{G}^{K} \in \mathbb{R}^{M_{s} \times N_{s}}}} \sum\limits_{k = 1}^{K} {\sum\limits_{\mathbf{s}_{G}^{k} \in \mathbf{s}_{G}^{K}} \textrm{trace}(\mathbf{s}_{G}^{k}{\boldsymbol{\mathcal{L}}_k}(\mathbf{s}_{G}^{k})^{T}) + \sum\limits_{k = 1}^{K} {\beta \|{\mathbf{1}} + {\boldsymbol{\mathcal{L}}_k} \|_F^2} + \sum\limits_{k = 1}^{K} \alpha \| {\boldsymbol{\mathcal{L}}_k}\|_1}, \nonumber\\ & \begin{array}{rl} s.t. \quad\textrm{diag}({\boldsymbol{\mathcal{L}}_k}) & \ge 0 ,\\ {}\textrm{trace}({\boldsymbol{\mathcal{L}}_k}) & = M_s. \\ {}{\boldsymbol{\mathcal{L}}_k} & = ({\boldsymbol{\mathcal{L}}_k})^T, \textrm{1} \le k \le K, \end{array} \end{aligned}$$
(6)

where \(\textbf{s}_{G}^{k}\varvec{\mathcal {L}}_{k}(\textbf{s}_{G}^{k})^{T}\) describes the smoothness of SGSs supported on \(G_k\), the Frobenius norm of \(\textbf{1}+\varvec{\mathcal {L}}_{k}\) is used to control the distribution of the edge weights and the sparsity, and the third term is the regularization function and ensures its positive value [45]. The first and second constraints are used to ensure the symmetricity and non-negativity of \(\varvec{\mathcal {L}}_{k}\). The third constraint is added to prevent trivial solutions and control the volume of the corresponding multiple graphs \(G_k\). \(M_s\) controls the volume of the intra-graph. \(\alpha\) and \(\beta\) are non-negative regularization parameters. As the objective function (6) is convex, we can solve it with the CVX toolbox [46] in the experimental section.

We infer an intra-graph topology to investigate the internal relationships between speech samples within a frame. We denote \(\textbf{s}_g^i\) as the \(i_{th}\) row of \(\textbf{s}_G^k\), which is indexed by an intra-graph \(O_i\). Here we focus on studying the strong causality between adjacent speech samples within a frame. Upon denoting the graph weight matrix of \(O_i\) by \(\textbf{W}_i \in {{\mathrm {\mathbb {R}}}^{{N_s} \times {N_s}}}\), we set \({{w}_i}(m,n) = 1\) if there exists a strong causality between the vertex \(v_m\) and its adjacent vertex \(v_n\), and otherwise \({{w}_i}(m,n) = 0\). That is,

$$\begin{aligned} {\textbf{W}_i} = {\left[ {\begin{array}{*{10}{c}} 0&{}1&{}0&{} \cdots &{} \cdots &{}0\\ 1&{}0&{}1&{} \cdots &{} \cdots &{}0\\ 0&{}1&{}0&{}1&{} \cdots &{}0\\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \ddots &{}0\\ 0 &{} \cdots &{} \cdots &{}1&{}0&{}1\\ 0&{} \cdots &{} \cdots &{} \cdots &{}1&{}0 \end{array}} \right] }. \end{aligned}$$
(7)

Then, we have \({O_i} = ({\mathcal {V}_i},{\textbf{W}_i})\), where \(\mathcal {V}_i\) represents the vertex set with cardinality \(\left| {{\mathcal {V}_i}} \right| ={N_s}\). Hence, \(M_s\) speech frames \({\textbf {f}}\) can be constructed as the speech graph signal \(\textbf{s}_G\) indexed by the multiple graphs \(G_{S}=(\mathcal {V},\varvec{\mathcal {L}}^{*})\) as shown in Fig. 1.

$$\begin{aligned} \begin{array}{ccc} {p: \mathbf {{f}} \rightarrow \mathbf{{s}_G} \in {{\mathrm {\mathbb {R}}}^{{M_s} \times {N_s}}}}&{indexed by}&{{G_S}={G_k}\times {O_i}, 1 \le k \le K,1 \le i \le M_s.} \end{array} \end{aligned}$$
(8)
Fig. 1
figure 1

The visualization of the multiple graphs \(G_s\)

By applying the Cartesian product of the inter-graph Laplacian matrix \(\varvec{\mathcal {L}}_{k}\) and the intra-graph weight matrix \(\textbf{W}_i\), \(\varvec{\mathcal {L}}^{*}\) is constructed as

$$\begin{aligned} \varvec{\mathcal {L}}^{*} = \varvec{\mathcal {L}}_{k}\times \textbf{W}_{i}, 1 \le k \le K,1 \le i \le M_s, \end{aligned}$$
(9)

and the vertex set \(\mathcal {V}^{*}\) is given as

$$\begin{aligned} \mathcal {V}^* = {\mathcal {V}}_{k}\times \mathcal {V}_{i}. \end{aligned}$$
(10)

4 The MMSE graph spectral magnitude estimator method

This section proposes a minimum mean-square error (MMSE) graph spectral magnitude estimator based on the learned multiple graphs above. Specifically, we first define the corresponding joint graph Fourier transform (JGFT) and the inverse JGFT (IJGFT). Then we investigate the representation of the MMSE graph spectral magnitude estimator in GSP by extending the classical MMSE short-time spectral amplitude (STSA) estimator.

4.1 The joint graph Fourier transform for SGSs

Differ from the joint graph Fourier transform definition in Section 2, by applying the singular value decomposition (SVD) on \(\textbf{W}_i\), we have

$$\begin{aligned} \textbf{W}_{i}= \textbf{F}_{w} \times \mathbf {\Lambda }^{w} \times \textbf{D}_{w}^{T}, \end{aligned}$$
(11)

where \(\textbf{F}_{w}\) and \(\textbf{D}_{w}\) are the left unitary matrix and the right unitary matrix of \(\textbf{W}_{i}\), respectively, and \(\mathbf {\Lambda }^{w} = \textrm{diag}(\lambda _{1}^{w},\lambda _{2}^{w},...,\lambda _{N_{s}}^{w})\) is the corresponding diagonal matrix and its element represents the graph frequency along the intra-graph edge. Similarly, we have

$$\begin{aligned} \varvec{\mathcal {L}}_{k}= \textbf{F}_{\mathcal {L}} \times \mathbf {\Lambda }^{\mathcal {L}} \times \textbf{D}_{\mathcal {L}}^{T}, \end{aligned}$$
(12)

where \(\textbf{F}_{\mathcal {L}}\) and \(\textbf{D}_{\mathcal {L}}\) respectively represent the left unitary matrix and the right unitary matrix of \(\varvec{\mathcal {L}}_{k}\) and the element \(\lambda _{j}^{\mathcal {L}}\) \(\left( 1 \le j \le M_{s} \right)\) of \(\mathbf {\Lambda }^{\mathcal {L}}\) represents the graph frequency along the inter-graph edges. The joint graph Fourier transform (JGFT) for \(\textbf{s}_{G}^{k}\) can be defined as

$$\begin{aligned} \textbf{S}_{\mathcal {F}}^{k}= \textrm{JGFT}\{\textbf{s}_{G}^{k}\} = (\textbf{F}_{w})^{- 1}\textbf{s}_{G}^{k}(\textbf{F}_{\mathcal {L}})^{- 1}, \end{aligned}$$
(13)

where \(\textbf{S}_{\mathcal {F}}^{k}\) is the graph Fourier version of \(\textbf{s}_{G}^{k}\). Moreover, the inverse IJGFT of \(\textbf{S}_{\mathcal {F}}^{k}\) is defined as

$$\begin{aligned} \textbf{s}_{G}^{k}= \textrm{IJGFT} \{\textbf{S}_{\mathcal {F}}^{k}\} = \textbf{F}_{w}\textbf{S}_{\mathcal {F}}^{k}\textbf{F}_{\mathcal {L}}. \end{aligned}$$
(14)

It should be noted that by using the defined JGFT, we can get the graph magnitude spectra of speech signals belonging to the real field by mapping speech signals into the graph frequency domain.

4.2 The MMSE graph spectral magnitude estimator for SGSs

Let us now investigate the MMSE graph spectral magnitude estimator. We denote \(\textbf{s}_G^k = {\textbf{x}_G} + {\textbf{n}_G}\) where \({\textbf{x}_G}\) is clean SGSs, \({\textbf{n}_G}\) is the additive graph noise signal which is independent of \({\textbf{x}_G}\). By performing the defined JGFT in (13), we have

$$\begin{aligned} \textbf{S}_\mathcal {F}^k = {\textbf{X}_\mathcal {F}} + {\textbf{N}_\mathcal {F}} \in {{\mathbb {R}}} \end{aligned}$$
(15)

where \(\textbf{S}_{\mathcal {F}}^k\), \({\textbf{X}_\mathcal {F}}\) and \({\textbf{N}_\mathcal {F}}\) are the JGFT coefficient of \(\textbf{s}_G^k\), \({\textbf{x}_G}\) and \({\textbf{n}_G}\) respectively. Upon denoting the \(i_{th}\) row of \(\textbf{S}_\mathcal {F}^k\), \({\textbf{X}_\mathcal {F}}\), and \({\textbf{N}_\mathcal {F}}\) by \(\textbf{Y}^i\), \(\textbf{X}^i\), and \(\textbf{N}^i\), respectively, a graph speech sample on a vertex can be donated as \(\textrm{Y}_j^i = \textrm{X}_j^i + \textrm{N}_j^i\) where \(i=1,2,3,...,K\), \(j=0,1,2,...,{N_s}-1\).

Let us denote \(R_j^i = \left| {\textrm{Y}_j^i} \right|\) and \(\textrm{Z}_j^i = \left| {\textrm{X}_j^i} \right|\). Based on the work in [46], the MMSE estimator for the graph magnitude spectrum \({\textrm{X}_j^i}\) can be obtained as

$$\begin{aligned} \widehat{Z_j^i}= E(Z_j^i|Y_j^i)= \frac{{\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)(Z_j^i)dZ_j^i} }}{{\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}}}. \end{aligned}$$
(16)

In case of the Gaussian statistical model for spectral components, we have

$$\begin{aligned} p(Y_j^i|Z_j^i)= \frac{1}{{\pi ({\lambda _{\textrm{n}}})_j^i}}\textrm{exp} \left\{ {-\frac{1}{{({\lambda _{\textrm{n}}})_j^i}}|Y_j^i - Z_j^i{|^2}} \right\} , \end{aligned}$$
(17)
$$\begin{aligned} p\left( {Z_j^i} \right) = \frac{{Z_j^i}}{{\pi \left( {{\lambda _{\textrm{x}}}} \right) _j^i}}\textrm{exp} \left\{ {-\frac{{{{\left( {Z_j^i} \right) }^2}}}{{\left( {{\lambda _{\textrm{x}}}} \right) _j^i}}} \right\} , \end{aligned}$$
(18)

where \(\;\left( {{\lambda _{\textrm{x}}}} \right) _j^i = E\left[ {{{\left( {X_j^i}\right) }^2}}\right]\) and \(\;\left( {{\lambda _{\textrm{n}}}} \right) _j^i = E\left[ {{{\left( {N_j^i} \right) }^2}}\right]\) are the \(j_{th}\) SGS and the graph noise variance for \({X}_j^i\) and \({N}_j^i\) respectively. By substituting (17) and (18) into \(\int _0^\infty {Z_j^ip(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}\) and \(\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}\), we have

$$\begin{aligned} \xi= & {} \int _0^\infty {Z_j^ip(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i} \nonumber \\= & {} \int _0^\infty {\frac{1}{{\pi ({\lambda _{\textrm{n}}})_j^i}}\textrm{exp} \left\{ {-\frac{1}{{({\lambda _{\textrm{n}}})_j^i}}|Y_j^i - Z_j^i{|^2}} \right\} \frac{{{{(Z_j^i)}^2}}}{{\pi \left( {{\lambda _{\textrm{x}}}} \right) _j^i}}\textrm{exp} \left\{ {-\frac{{{{\left( {Z_j^i}\right) }^2}}}{{\left( {{\lambda _{\textrm{x}}}} \right) _j^i}}} \right\} dZ_j^i}, \end{aligned}$$
(19)

and

$$\begin{aligned} \zeta= & {} \int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ{{_j^i}}} \nonumber \\= & {} \int _0^\infty {\frac{1}{{\pi ({\lambda _{\textrm{n}}})_j^i}}\textrm{exp} \left\{ {- \frac{1}{{({\lambda _{\textrm{n}}})_j^i}}|Y_j^i - Z_j^i{|^2}} \right\} \frac{{Z_j^i}}{{\pi \left( {{\lambda _{\textrm{x}}}} \right) _j^i}}\textrm{exp} \left\{ { - \frac{{{{\left( {Z_j^i} \right) }^2}}}{{\left( {{\lambda _{\textrm{x}}}} \right) _j^i}}} \right\} dZ_j^i} . \end{aligned}$$
(20)

Then, we can rewrite (16) equivalently as

$$\begin{aligned} \widehat{Z_{j}^{i}} = \frac{\xi }{\zeta } = \frac{\int _{0}^{\infty }\textrm{exp} \left\{ -\frac{1}{(\lambda _{\textrm{n}})_{j}^{i}}|Y_{j}^{i}- Z_{j}^{i}|^{2}\right\} \textrm{exp} \left\{ - \frac{\left( Z_{j}^{i} \right) ^{2}}{(\lambda _{x})_{j}^{i}}\right\} (Z_{j}^{i})^{2}dZ_{j}^{i}}{\int _{0}^{\infty }\textrm{exp}\left\{ -\frac{1}{(\lambda _{\textrm{n}})_{j}^{i}}|Y_{j}^{i} - Z_{j}^{i}|^{2} \right\} \textrm{exp} \left\{ -\frac{\left( Z_{j}^{i}\right) ^{2}}{(\lambda _{\textrm{x}})_{j}^{i}}\right\} Z_{j}^{i}dZ_{j}^{i}}. \end{aligned}$$
(21)

And, we arrive at

$$\begin{aligned} \widehat{Z_j^i}= \frac{\xi }{\zeta }= \frac{{\int _0^\infty {\textrm{exp} \left\{ {-\frac{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}{{(Z_j^i - \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}})}^2}} \right\} \textrm{exp} \left\{ { - \frac{{{{(Z_j^i)}^2}}}{{({\lambda _x})_j^i}}} \right\} {{(Z_j^i)}^2}dZ_j^i} }}{{\int _0^\infty {\textrm{exp} \left\{ { - \frac{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}{{(Z_j^i - \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}})}^2}} \right\} \textrm{exp} \left\{ {-\frac{{{{(Z_j^i)}^2}}}{{({\lambda _x})_j^i}}} \right\} (Z_j^i)dZ_j^i} }}Z_j^i. \end{aligned}$$
(22)

Upon denoting \({\sigma ^2} = \frac{{({\lambda _{\textrm{n}}})_j^i+({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}\) and \(\tau = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}\), we can rewrite (21) equivalently as

$$\begin{aligned} \widehat{Z_j^i}= \frac{\xi }{\zeta }= \frac{{\int _0^\infty {\textrm{exp} \left\{ {-{\sigma ^2}{{(Z_j^i- \tau )}^2}}\right\} {{(Z_j^i)}^2}dZ_j^i}}}{{\int _0^\infty {\textrm{exp} \left\{ {-{\sigma ^2}{{(Z_j^i- \tau )}^2}} \right\} (Z_j^i)dZ_j^i}}}. \end{aligned}$$
(23)

Let us now analyze \(\xi\). Upon denoting \(y = Z_j^i\sigma - \sigma \tau\), we have

$$\begin{aligned} \xi= & {} \int _0^\infty {\textrm{exp} \left\{ {-{\sigma ^2}{{(Z_j^i- \tau )}^2}} \right\} {{(Z_j^i)}^2}dZ_j^i} \nonumber \\= & {} \frac{1}{{{\sigma ^3}}}\int _0^{\infty } \textrm{exp} \left\{ { - {{(Z_j^i\sigma - \sigma \tau )}^2}}\right\} {{(Z_j^i\sigma )}^2}dZ_j^i\sigma ^{\underset{\rightarrow }{y = Z_{j}^{i}\sigma - \sigma \tau }} \\= & {} \frac{1}{\sigma ^{3}}\int _{-\sigma \tau }^0 {{y^2}\textrm{exp} \left\{ {-{y^2}} \right\} dy}+ \frac{1}{{{\sigma ^3}}}\int _0^\infty {{y^2}\textrm{exp} \left\{ {-{y^2}} \right\} dy} \nonumber \end{aligned}$$
(24)

Let us introduce the Gauss error function \(erf(x) = \frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{ - {\eta ^2}}}} d\eta\). Substituting erf(x) into (24) (see Appendix 1) gives

$$\begin{aligned} \xi = \frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} + \frac{1}{{{\sigma ^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2})-\frac{{1 + 2{\sigma ^2}{\tau ^2}}}{{2{\sigma ^3}}}\frac{{\sqrt{\pi }}}{2}erf(\sigma \tau ). \end{aligned}$$
(25)

Similarly, we have

$$\begin{aligned} \zeta = \frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }+ \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }erf(\sigma \tau ). \end{aligned}$$
(26)

By combining (21), (25) and (26) (see Appendix 2), we arrive at

$$\begin{aligned} \widehat{Z_j^i} = \frac{{\frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} + \frac{1}{{{\sigma ^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2}) - \frac{{1 + 2{\sigma ^2}{\tau ^2}}}{{2{\sigma ^3}}}\frac{{\sqrt{\pi }}}{2}erf(\sigma \tau )}}{{\frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma } + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }erf(\sigma \tau )}}R_j^i. \end{aligned}$$
(27)

Let us introduce the notation of the prior signal-to-noise ratio \(\varphi _j^i{\text { = }}\frac{{({\lambda _{\text {x}}})_j^i}}{{({\lambda _{\text {n}}})_j^i}}\) and the posterior signal-to-noise ratio \(\delta _j^i{\text { = }}\frac{{({\lambda _{\text {n}}})_j^i}}{{({\lambda _{\text {x}}})_j^i}}\). Due to \({\sigma ^2} = \frac{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i({\lambda _x})_j^i}}\) and \(\tau = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}\), we have

$$\begin{aligned} \gamma _j^i = \sqrt{\frac{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i({\lambda _x})_j^i}}} = \sqrt{\frac{{1 + \varphi _j^i}}{{\varphi _j^i {{(({\lambda _{\text {n}}})_j^i)}^2}}}}, \vartheta _j^i = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}= \frac{{Y_j^i{{(\varphi _j^i)}^2}}}{{1 + {{(\varphi _j^i)}^2}}}. \end{aligned}$$
(28)

By combining (23), (27) and (28), we can obtain the gain function of the MMSE graph spectral magnitude estimator given as

$$\begin{aligned} G_{\text {GMMSE}}(\gamma _{j}^{i},\vartheta _{j}^{i})= & {} \frac{{\widehat{Z_j^i}}}{{R_j^i}} \nonumber \\= & {} \frac{{\frac{{\vartheta _j^i}}{{2{{\left( {\gamma _j^i} \right) }^2}}}\textrm{exp} \left\{ {-{{(\gamma _j^i\vartheta _j^i)}^2}} \right\} + \frac{1}{{{{\left( {\gamma _j^i} \right) }^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{{(\gamma _j^i\vartheta _j^i)}^2}) - \frac{{1 + 2{{(\gamma _j^i\vartheta _j^i)}^2}}}{{2{\gamma _i}^3}}\frac{{\sqrt{\pi }}}{2}erf(\gamma _j^i\vartheta _j^i)}}{{\frac{1}{{2{{\left( {\gamma _j^i} \right) }^2}}}\textrm{exp} \left\{ { - {{(\gamma _j^i\vartheta _j^i)}^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}} + \frac{{\sqrt{\pi }}}{2}\frac{{{\vartheta _i}}}{{{\gamma _i}}}erf(\gamma _j^i\vartheta _j^i)}} \nonumber \\ {\text { }}\approx & {} \frac{{\frac{{C{\vartheta _i}}}{{2{{\left( {\gamma _j^i} \right) }^2}}} + \frac{1}{{{{\left( {\gamma _j^i} \right) }^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{{(\gamma _j^i\vartheta _j^i)}^2}) - \frac{{1 + 2{{(\gamma _j^i\vartheta _j^i)}^2}}}{{2{{\left( {\gamma _j^i} \right) }^3}}}\frac{{\sqrt{\pi }}}{2}erf(\gamma _j^i\vartheta _j^i)}}{{\frac{C}{{2{{\left( {\gamma _j^i} \right) }^2}}} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}}erf(\gamma _j^i\vartheta _j^i)}}, \end{aligned}$$
(29)

where C is a constant. Following the classical decision directed approach in [25], we have \(\varphi _j^i\) as

$$\begin{aligned} \varphi _j^i = (1-{\upsilon ^2})\varphi _{j - 1}^i +{\upsilon ^2}\max ((\delta _j^i - 1),0), \end{aligned}$$
(30)

where \(\upsilon \in (0,1)\) is the gain parameter. For the sake of convenience, we donate GMMSE-KGL to name the proposed MMSE graph magnitude spectral estimator-based K-graphs learning in the following sections.

5 Numerical results and discussions

In this section, we present the output signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ) [47], log-likelihood ratio (LLR) [48], and short-time objective intelligibility (STOI) [49] measure results of the proposed MMSE graph spectral magnitude estimator. The traditional MMSE-STSA method in [25], the optimal modified minimum mean-square error log-spectral method (OMLSA) in [26], the improved graph Wiener filtering method (GWF-SGS) in [30], the vertex-frequency graph Wiener filtering (VFGWF) in [32], the graph Wiener filtering method for directed cyclic time series (GWF-DCGS), and the graph Wiener filtering for arbitrary graph signals (GWF-AGS) in [33] are used as the benchmarks. The output LLR is defined as

$$\begin{aligned} \textrm{LLR} = \text {log}\left( \overrightarrow{d_{p}} \textbf{R}_{\textbf{c}}\overrightarrow{d_{p}}^{T} / \overrightarrow{d_{c}} \textbf{R}_{\textbf{c}}\overrightarrow{d_{c}}^{T} \right) , \end{aligned}$$
(31)

where \({\overrightarrow{{d_p}}}\) and \({\overrightarrow{{d_c}}}\) represent the LPC vector of the enhanced speech frame and original speech signals, respectively. \(\mathbf {R_c}\) represents the autocorrelation matrix of the original speech signals [50]. In our numerical simulations, the noisy speech signals are generated by mixing pure speech signals from the TIMIT database [51] with noise signals at the input signal-to-noise ratios (SNRs) from −15 to 5 dB. Two hundred sentences consisting of 20 speakers (10 females and 10 males) are used as the clean speech signal. White noise, Gaussian color noise, and Babble noise from NOISES-92 library [52] are used as noise signals. The sampling frequency is 16 kHz. The speech signals are framed by the Hamming window with a length of 256 points and an overlap of \(50\%\).

Figure 2 shows the traditional spectrogram obtained by the discrete Fourier transform (DFT) and the graph spectrogram obtained by the proposed graph Fourier basis based on the graph weight matrix \(\textbf{W}_i\) of the inter-graph, respectively. Observe from Fig. 2 that the graph spectrogram is mainly distributed in the high graph frequency regions, while the traditional spectrogram is mainly distributed in the low-frequency regions. The reason is that considering the Theorem 1 for the graph frequency ordering in [53], the smallest eigenvalue of \(\textbf{W}_i\) represents the lowest frequency, and its largest eigenvalue is the highest frequency. These are different from the traditional frequencies. Although the graph spectrogram is similar to that of the conventional spectrogram, this graph spectrogram is utterly different from that of the traditional spectrum. In addition, the proposed graph Fourier basis can map speech signals into the real graph frequency field by applying the eigenvector matrix of \(\textbf{W}_i\).

Fig. 2
figure 2

Clean speech signals and its spectrograms. a The clean speech signals. The traditional spectrogram. The graph spectrogram

Figure 3 shows the output SNR of the proposed GMMSE-KGL method in the case of white noise versus the frame number \(M_s\) where \(N_s=256\). Observe from Fig. 3, to achieve a high output SNR, \(M_s\) should be neither too small nor too large. When it takes a small value, the relationships among non-adjacent frames cannot be well described as the designed small multiple graphs. In contrast, when \(M_s\) takes a large value, the boundaries between sub-multiple graphs might have a similar tendency, which would degrade the output SNR. Because the larger multiple graphs will lose some details of speech samples, resulting in estimating the inability to the graph spectral magnitude of speech samples well. Hence, the range of \(M_s\) can be \(M_s \in [20,30]\) in the case of \(N_s=256\), and we use \(M_s=30\) in our numerical simulations below.

Fig. 3
figure 3

The output SNR versus the cluster number K of the proposed GMMSE-KGL method

Figure 4 shows the output SNR results of the proposed GMMSE-KGL method under white noise with the different vertex number \(N_s\) of the intra-graph topology \(O_i\) where \(M_s=30\). We can see from Fig. 4 that the performance of the proposed GMMSE-KGL method decreases with the increase of \(N_s\). The reason for this is that the intra-graph topology becomes more complex and larger as \(N_s\) increases, resulting in the gain function of the proposed GMMSE-KGL method is not accurately estimated by using the SGS’s graph power spectrum on \(O_i\). Moreover, the proposed GMMSE-KGL method in the case of \(N_s=64\) would obtain a better performance. Considering the fairness of comparison, we use the frame with length 256 and build our intra-graph with \(N_s=256\), that is, \(O_i = (\mathcal {V}_{i},{\textbf {W}}_{i}^{256 \times 256})\).

Fig. 4
figure 4

The output SNR versus the sample number \(N_s\) of the proposed GMMSE-KGL method

Table 1 shows the PESQ results of the K-graphs learning (KGL) followed by the GMMSE method, the graph k-shift operator (GKSO) [30] followed by the GMMSE method and the graph learning (GL) [32] followed by the GMMSE method in the case of white noise. For writing convenience, we denote the methods above as the GMMSE-KGL, GMMSE-KGS, and GMMSE-GL, respectively. We can observe from Table 1 that the PESQ of the proposed GMMSE-KGL outperforms that of the GMMSE-GKSO and GMMSE-GL, which illustrates the effectiveness of the K-graphs learning part in speech enhancement. Moreover, the PESQ of the proposed GMMSE-KGL is 0.5 higher than that of GMMSE-GKSO and is 0.2 higher than that of GMMSE-GL, when the input SNR is larger than -5 dB.

Table 1 The PESQ of the different graph models followed by the GMMSE method

Table 2 shows the PESQ results of the proposed GMMSE and the graph Wiener filtering (GWF) method combined with the K-graphs learning method in the case of white noise, respectively. For writing convenience, we denote the two methods above as the GMMSE-KGL and GWF-KGL. From Table 2, we can see that the PESQ of the proposed GMMSE-KGL is 0.2 higher than that of GWF-KGL. The PESQ results of the proposed GMMSE-KGL demonstrate the effectiveness of the GMMSE part in speech enhancement. Tables 1 and 2 illustrate that the proposed K-graphs learning part and the GMMSE-based enhancement part almost contribute equally to the performance improvement.

Table 2 The PESQ of different graph speech enhancement methods followed by the K-graphs learning method

Table 3 shows the PESQ results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for the case of white noise. Observe from Table 3 that though the proposed GMMSE-KGL method leads to a slightly lower PESQ value as compared to the VFGWF method in the case of −15 dB input SNR, the proposed GMMSE-KGL method outperforms all the benchmarks when the input SNR is more than −10dB in terms of PESQ, which shows the advantage of the proposed GMMSE-KGL method.

Table 3 The PESQ of different methods for white noise

Table 4 shows the LLR results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for the case of white noise. We can see from Table 4 that the proposed GMMSE-KGL method performs better than the GWF-DCGS, GWF-AGS, VFGWF, and MMSE-STSA methods in terms of LLR results. However, when the input SNR is between −10 and −5 dB, the proposed GMMSE-KGL method performs worse than the OMLSA method. Because OMLSA could estimate the noise very well by applying the MCRA method, estimating the log-spectral amplitude of clean speech signals.

Table 4 The LLR of different methods for white noise

Table 5 shows the output SNR results of the proposed GMMSE-KGL method, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for the case of white noise. From Table 5, we can see that the output SNR of the proposed GMMSE-KGL method is higher than that of the GWF-DCGS and the GWF-AGS methods, regardless of the input SNR. The proposed GMMSE-KGL method is also higher than that of the GWF-SGS, VFGWF, and OMLSA methods for the cases where the input SNR is lower than \(-5\) dB. However, when the input SNR is higher than \(-5\) dB, the proposed GMMSE-KGL method leads to a lower output SNR as compared to the MMSE-STSA, OMLASL, GWF-SGS, and VFGWF methods. The reason for this is that the proposed GMMSE-KGL method may misestimate the graph spectra of clean speech signals, so it cannot estimate the graph magnitude of pure graph speech signals.

Table 5 The output SNR of different methods for white noise

Table 6 shows the STOI results of the proposed GMMSE-KGL method and the benchmarks in the case of white noise. Observe from Table 6 that the proposed GMMSE-KGL method outperforms the OMLSA, MMSE-STSA, GWF-AGA, and GWF-DCGS methods overall. However, the proposed GMMSE-KGL method performs worse than the GWF-SGS, and VFGWF methods in terms of STOI. The reason for this is that when we learn the dynamic graphs using the K-graphs learning method, the graph spectrum of some noises is regarded as that of clean speech details, leading to estimating accurately the graph spectrum of clean speech signals.

Table 6 The STOI of different methods for white noise

Table 7 reports the PESQ results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, VFGWF, MMSE-STSA, and OMLSA methods in a comparative study on suppressing Gaussian color noise. Observe from Table 7 that the proposed GMMSE-KGL method outperforms all the benchmarks in terms of PESQ overall. Moreover, from Table 7 we can see that when the input SNR is −15dB, the PESQ of the proposed GMMSE-KGL method is lower than that of the OMLSA and GWF-SGS methods.

Table 7 The PESQ of different methods for Gaussian color noise

Table 8 shows the LLR results in a comparative study on suppressing Gaussian color noise. Observe from Table 8 that the LLR results of the proposed GMMSE-KGL method are slightly higher than the traditional MMSE-STSA and OMLSA methods from 0 to 5 dB and are lower than those of the GWF-DCGS, GWF-AGS, GWF-SGS, and VFGWF methods when the input SNR is no less than 0 dB. This illustrates that the proposed GMMSE-KGL method provides a slightly better spectral envelope than the traditional MMSE-STSA and OMLSA methods. However, the proposed GMMSE-KGL method cannot estimate the graph spectrum reasonably as compared to the existing GSP-based methods.

Table 8 The LLR of different methods for Gaussian color noise

Table 9 shows the output SNR results in a comparative study on suppressing Gaussian color noise. From Table 9, we observe that in terms of its output SNR, the proposed GMMSE-KGL method outperforms the GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, and OMLSA methods in cases of negative input SNRs. In the positive input SNRs, the relationship among noise frames could not be captured by the K-graphs learning method, leading to not estimating noise power spectra in real-time. The performance of the proposed GMMSE-KGL method is lower than the GWF-SGS and VFGWF methods. Additionally, compared to the MMSE-STSA method, the proposed GMMSE-KGL method leads to a much smaller improvement in the output SNR results. The reason for this is that the proposed GMMSE-KGL method reduces more details of clean speech signals and has a slight improvement in terms of the output SNR.

Table 9 The output SNR of different methods for Gaussian color noise

Table 10 shows the STOI results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods in the case of Gaussian color noise. Observe from Table 10 that the STOI results of the proposed GMMSE-KGL method are higher than that of the MMSE-STSA, OMLSA, GWF-DCGS, and GWF-AGS methods when the input SNR is between −15 and 5dB. Note that the proposed GMMSE-KGL method estimates the graph spectrum of some Gaussian color noise as that of clean speech signals, resulting in obtaining low STOI results as compared to the GWF-SGS and VFGWF methods.

Table 10 The STOI of different methods for Gaussian color noise

Table 11 shows the PESQ results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for Babble noise. Observe from Table 11 that the proposed GMMSE-KGL method outperforms all the benchmarks in terms of PESQ with input SNR ranging from −15 to 5 dB.

Table 11 The PESQ of different methods for Babble noise

Table 12 shows the LLR results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA for Babble noise. Observe from Table 12 that the proposed GMMSE-KGL method performs better than the traditional MMSE-STSA and OMLSA methods when the input SNR is between −10 and 5dB. However, the proposed GMMSE-KGL method cannot outperform the GWF-SGS, VFGWF, GWF-DCGS, and GWF-AGS methods when the input SNR is between −15 and 5 dB. The reason for this is that the proposed GMMSE-KGL method may not estimate the graph spectrum of the nonstationary noise as compared to the GWF-SGS, VFGWF, GWF-DCGS, and GWF-AGS methods.

Table 12 The LLR of different methods for Babble noise

Table 13 shows the output SNR results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for Babble noise. Observe from Table 13 that the proposed GMMSE-KGL method outperforms the GWF-SGS, VFGWF, and MMSE-STSA methods when the input SNR is lower than −5dB. Though the GWF-AGS and GWF-DCGS methods could be better than GMMSE-KGL in negative input SNR values, our proposed GMMSE-KGL method performs better than the GWF-AGS and GWF-DCGS methods when input SNRs are positive.

Table 13 The output SNR of different methods for Babble noise

Table 14 shows the STOI results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for Babble noise. It can be observed from Table 14 that when the input SNR is less than 0dB, the STOI results of the proposed GMMSE-KGL method are higher than those of MMSR, OMLSA, and GWF-SGS methods and are lower than those of the VFGWF, GWF-AGS, and GWF-DCGS methods. The above situation is reversed when the input SNR is not less than 0dB. This implies that the proposed GMMSE-KGL method may not estimate a graph spectrum of Babble noise, and even some speech details may be regarded as Babble noise and removed in higher input SNR cases.

Table 14 The STOI of different methods for Babble noise

Table 15 gives the computational complexity of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods. Here we discuss the computational complexity of methods caused by the graph/discrete Fourier transform and its inverse transform. To be specific, let M, \(M_s\) \(N_s\), L represent the number of whole noisy speech frames, the number of noisy speech frames in a cluster, the length of a noisy speech frame, and the tap number of the graph Wiener filtering, respectively. Based on the computational complexity discussion in [30, 32] and the matrix multiplication theory, we can observe from Table 15 that the computational complexity of the proposed GMMSE-KGL method is higher than that of traditional methods. The reason for this is that the proposed GMMSE-KGL method does not apply the fast graph Fourier transform, while all the traditional methods use the FFT operation. Meanwhile, the proposed GMMSE-KGL method based on the Fourier basis of the sparse matrix \(\varvec{\mathcal {L}}^{*}\) has a lower computational complexity compared to both the GWF-DCGS method and that of the GWF-AGS method. Note that \(M_s\) is smaller than \(M\), and the computational complexity of the proposed GMMSE-KGL method is lower than that of the VFGWF method.

Table 15 The computational complexity of different methods

6 Conclusions

This paper used the K-graphs learning method to learn multiple graphs for speech signals, which can investigate intrinsic relationships among inter-frames and the relationship between speech samples within a frame in real-time. On this basis, we developed a representation of the MMSE graph spectral magnitude estimator and used different input SNRs to evaluate the performance of the proposed GMMSE-KGL method on speech enhancement. The experimental results showed that the proposed GMMSE-KGL method outperformed the graph Wiener filtering methods in GSP on the PESQ and was comparable to some of the well-performing traditional baseline methods in DSP in terms of the LLR, STOI, and output SNR.

Availability of data and materials

Not applicable.

References

  1. A. Ortega, P. Frossard, J. Kovačević, J.M.F. Moura, P. Vandergheynst, Graph signal processing: Overview, challenges, and applications. Proc. IEEE 106(5), 808–828 (2018). https://doi.org/10.1109/JPROC.2018.2820126

    Article  Google Scholar 

  2. D.I. Shuman, S.K. Narang, P. Frossard, A. Ortega, P. Vandergheynst, The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30(3), 83–98 (2013). https://doi.org/10.1109/MSP.2012.2235192

    Article  Google Scholar 

  3. Q.S.Q.S. Junzheng, J. David, B. Tay, Design of non-subsampled graph filter banks via lifting schemes. IEEE Signal Process. Lett. 27, 441–445 (2020).

    Article  Google Scholar 

  4. B. Girault, A. Ortega, S.S. Narayayan, Graph vertex sampling with arbitrary graph signal hilbert spaces, IEEE Int. Conf. Acoust., Speech, Signal Processing, Spain, 2020, 5670–5674 (2020).

  5. Y. Tanaka, Y.C. Eldar, A. Ortega, G. Cheung, Sampling signals on graphs: From theory to applications. IEEE Signal Process. Mag. 37(6), 14–30 (2020). https://doi.org/10.1109/MSP.2020.3016908

    Article  Google Scholar 

  6. A. Sandryhaila, J.M.F. Moura, Discrete signal processing on graphs. IEEE Trans. Signal Process. 61(3), 1644-1656 (2013). https://doi.org/10.1109/TSP.2013.2238935

  7. J. Domingos, J.M.F. Moura, Graph fourier transform: A stable approximation. IEEE Trans. Signal Process. 68, 4422–4437 (2020). https://doi.org/10.1109/TSP.2020.3009645

    Article  MathSciNet  MATH  Google Scholar 

  8. M.J.M.F. Shi J., Graph signal processing: Modulation, convolution, and sampling, 2019, https://arxiv.org/abs/1912.06762.

  9. S. Chen, A. Sandryhaila, J.M.F. Moura, J. Kovacevic, Signal denoising on graphs via graph filtering, 872–876 (2014). https://doi.org/10.1109/GlobalSIP.2014.7032244

  10. M. Onuki, S. Ono, M. Yamagishi, Y. Tanaka, Graph signal denoising via trilateral filter on graph spectral domain. IEEE Trans. Signal Inf. Process. Over Netw. 2(2), 137–148 (2016). https://doi.org/10.1109/TSIPN.2016.2532464

    Article  MathSciNet  Google Scholar 

  11. S. Ono, I. Yamada, I. Kumazawa, Total generalized variation for graph signals , 5456–5460 (2015). https://doi.org/10.1109/ICASSP.2015.7179014

  12. V. Kalofolias, How to learn a graph from smooth signals 51, 920–929 (2016). http://proceedings.mlr.press/v51/kalofolias16.html

  13. K. Yamada, Y. Tanaka, A. Ortega, Time-varying graph learning based on sparseness of temporal variation, 5411–5415 (2019). https://doi.org/10.1109/ICASSP.2019.8682762

  14. K. Yamada, Y. Tanaka, A. Ortega, Time-varying graph learning with constraints on graph temporal variation. CoRR, abs/2001.03346 (2020). https://arxiv.org/abs/2001.03346

  15. G. Cheung, E. Magli, Y. Tanaka, M.K. Ng, Graph spectral image processing. IEEE, 106 (5), 907–930 (2018). https://ediss.sub.uni-hamburg.de/handle/ediss/9268

  16. H. Sadreazami, A. Asif, A. Mohammadi, A late adaptive graph-based edge-aware filtering with iterative weight updating process, 1581–1584 (2017). https://doi.org/10.1109/MWSCAS.2017.8053239

  17. L.J. Kondor R. I., Diffusion kernels on graphs and other discrete structures, International Conference on Machine Learning, 315–322 (2002).

  18. A.J. Smola, R. Kondor, Kernels and regularization on graphs. 2777, 144–158 (2003). https://doi.org/10.1007/978-3-540-45167-9_12

    Article  Google Scholar 

  19. B.F.e.a. Lacasa L., Luque B, From time series to complex networks: the visibility graph, 105 (13), 4972–4975 (2008).

  20. S.A. Bezsudnov I.V., Gavrilov S.V., From time series to complex networks: the dynamical visibility graph. Phys. A Stat. Mech. Appl. 414, 1-13 (2012). https://arxiv.org/abs/1208.6365v1

  21. D.J.e.a. Donner R.V., Zou Y., Recurrence networks-a novel paradigm for nonlinear time analysis, New Journal of Physics, 12(3), 129-132 (2010).

  22. C.V.K. Mathur P., .Graph signal processing of eeg signals for detection of epilepsy, 7th International Conference on Signal Processing and Information Networks, 839–843 (2020).

  23. M.S.e.a. Roy S. S., Chatterjee S., Detection of focal eeg signals employing weighted visibility graph, International Conference on Computer, Electrical & Communication Engineering, India, 2020, pp. 1–5(2020).

  24. P. Scalart, J. Filho, Speech enhancement based on a priori signal to noise estimation, IEEE Int. Conf. Acoust., Speech, Signal Processing, USA, 1996, 629–632 (1996).

  25. Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sig. Process. 32(6), 1109–1121 (1984).

    Article  Google Scholar 

  26. I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)

    Article  MATH  Google Scholar 

  27. J.H. Choi, J.H. Chang, On using acoustic environment classification for statistical model-based speech enhancement. Speech Comm. 54(3), 477–490 (2012)

    Article  Google Scholar 

  28. L. Sun, Y. Bu, P. Li, Z. Wu, Single-channel speech enhancement based on joint constrained dictionary learning. EURASIP J. Audio Speech Music. Process. 2021(1), 29 (2021). https://doi.org/10.1186/s13636-021-00218-3

    Article  Google Scholar 

  29. B.L.Z.Y. Tingting W., Haiyan G., Speech signal processing on graphs: Graph topology, graph frequency analysis and denoising. Chin. J. Elect. 29(5), 926–936 (2020)

  30. T. Wang, H. Guo, X. Yan, Z. Yang, Speech signal processing on graphs: The graph frequency analysis and an improved graph wiener filtering method. Speech Commun. 127, 82–91 (2021). https://doi.org/10.1016/j.specom.2020.12.010

    Article  Google Scholar 

  31. M. Puschel, J.M. Moura, Algebraic signal processing theory: Foundation and 1-d time. IEEE Trans. Signal Process. 56(8–1), 3572–3585 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  32. T. Wang, H. Guo, Q. Zhang, Z. Yang, A new multilayer graph model for speech signals with graph learning. Digit. Signal Process. 122, 103360 (2022). https://doi.org/10.1016/j.dsp.2021.103360

    Article  Google Scholar 

  33. A. Gavili, X. Zhang, On the shift operator, graph frequency, and optimal filtering in graph signal processing. IEEE Trans. Signal Process. 65(23), 6303–6318 (2017). https://doi.org/10.1109/TSP.2017.2752689

    Article  MathSciNet  MATH  Google Scholar 

  34. G. Yang, L. Yang, C. Huang, An orthogonal partition selection strategy for the sampling of graph signals with successive local aggregations. Signal Process. 188, 108211 (2021). https://doi.org/10.1016/j.sigpro.2021.108211

    Article  Google Scholar 

  35. J. Miettinen, S.A. Vorobyov, E. Ollila, Modelling and studying the effect of graph errors in graph signal processing. Signal Process. 189, 108-256 (2021). https://doi.org/10.1016/j.sigpro.2021.108256

    Article  Google Scholar 

  36. H. Sevi, G. Rilling, P. Borgnat, Modeling signals over directed graphs through filtering, IEEE Global Conference on Signal and Information Processing, USA, 2018, 718–722 (2018). https://doi.org/10.1109/GlobalSIP.2018.8646534

  37. F. Wang, Y. Wang, G. Cheung, A-optimal sampling and robust reconstruction for graph signals via truncated neumann series. IEEE Signal Process. Lett. 25(5), 680–684 (2018). https://doi.org/10.1109/LSP.2018.2818062

    Article  Google Scholar 

  38. B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, M.G. Rabbat, Characterization and inference of graph diffusion processes from observations of stationary signals. IEEE Trans. Signal Inf. Process. Netw. 4(3), 481–496 (2018). https://doi.org/10.1109/TSIPN.2017.2742940

    Article  MathSciNet  Google Scholar 

  39. X. Dong, D. Thanou, P. Frossard, P. Vandergheynst, Learning laplacian matrix in smooth graph signal representations. IEEE Trans. Sig. Process. 64(23), 6160–6173 (2016).

    Article  MathSciNet  MATH  Google Scholar 

  40. Y. Yankelevsky, M. Elad, Finding GEMS: multi-scale dictionaries for high-dimensional graph signals. IEEE Trans. Signal Process. 67(7), 1889–1901 (2019). https://doi.org/10.1109/TSP.2019.2899822

    Article  MathSciNet  MATH  Google Scholar 

  41. F. Grassi, A. Loukas, N. Perraudin, B. Ricaud, A time-vertex signal processing framework: Scalable processing and meaningful representations for time-series on graphs. IEEE Trans. Signal Process. 66(3), 817–829 (2018). https://doi.org/10.1109/TSP.2017.2775589

    Article  MathSciNet  MATH  Google Scholar 

  42. A. Loukas, D. Foucard, Frequency analysis of temporal graph signals, CoRR abs/1602.04434 (2016). http://arxiv.org/abs/1602.04434

  43. J. Yu, X. Xie, H. Feng, B. Hu, On critical sampling of time-vertex graph signals, IEEE Global Conference on Signal and Information Processing, Canada, 1–5 (2019). https://doi.org/10.1109/GlobalSIP45357.2019.8969108

  44. H. Araghi, M. Sabbaqi, M. Babaie-Zadeh, K-graphs: An algorithm for graph signal clustering and multiple graph learning. IEEE Signal Process. Lett. 26(10), 1486–1490 (2019). https://doi.org/10.1109/LSP.2019.2936665

    Article  Google Scholar 

  45. X. Dong, D. Thanou, M.G. Rabbat, P. Frossard, Learning graphs from data: A signal representation perspective. IEEE Signal Process. Mag. 36(3), 44–63 (2019). https://doi.org/10.1109/MSP.2018.2887284

    Article  Google Scholar 

  46. B.S. Grant M., CVX: matlab software for disciplined convex programming 2012-2019 CVX Research, Inc., Austin. http://cvxr.com

  47. I.T. Recommendation, Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. Rec. ITU-T P (2001)

  48. C.M.A. Quackenbush S. R., Barnwell T. P., Objective measures of speech qualit (Prentice Hall Advanced Reference Series, Englewood Cliffs, 1986), ISBN: 0-13-629056-6

  49. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, IEEE Int. Conf. Acoust., Speech, Signal Processing, USA, 2010, 4214–4217(2010). https://doi.org/10.1109/ICASSP.2010.5495701

  50. Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Speech Audio Process. 16(1), 229–238 (2008). https://doi.org/10.1109/TASL.2007.911054

    Article  Google Scholar 

  51. S.F. Boll, DARPA TIMIT acoustic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. (1993)

  52. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm. 12(3), 247–251 (1993).

    Article  Google Scholar 

  53. A. Sandryhaila, J.M.F. Moura, Discrete signal processing on graphs: Frequency analysis. IEEE Trans. Signal Process. 62(12), 3042–3054 (2014). https://doi.org/10.1109/TSP.2014.2321121

    Article  MathSciNet  MATH  Google Scholar 

  54. R.I. Gradsh teyn I.S., Table of integrals, series and product (Academic, New York, 1980). https://doi.org/10.1109/TSP.2014.2321121

Download references

Acknowledgements

Not applicable.

Funding

This work is supported by the National Natural Science Foundation of China (No.62071242, No.61901227), the Graduate Innovation Program of Jiangsu Province (No.KYCX19-0897), and the China Scholarship Council.

Author information

Authors and Affiliations

Authors

Contributions

(1) We proposed a novel global graph topology by using the K-graphs learning method, which consists of the multiple graphs to describe the potential connections among inter-frames and intra-frames. (2) We proposed the graph representation of minimum mean-square error (MMSE) graph spectral magnitude estimator for SGSs in the vertex-frequency domain by extending the classical MMSE-STSA estimator in DSP to perform speech enhancement. (3)The performance of the MMSE graph spectral magnitude estimator method outperforms the benchmarks in terms of PESQ, LLR, output SNR and STOI. The classical MMSE-STSA and OMLSA methods in DSP and the existing graph Wiener filtering methods are used as benchmarks. The authors read and approved the final manuscript.

Authors’ information

Tingting Wang is pursuing her doctorate in signal and information processing from the Nanjing University of Posts and Telecommunications, Nanjing, China. Her current research interests include graph signal processing, classical speech signal processing.

Haiyan is currently an Associate Professor with NJUPT, Nanjing, China. Her research interests include speech signal processing and B5G/6G wireless transmission.

Zirui Ge is pursuing his doctoral degree in signal and information processing from the Nanjing University of Posts and Telecommunications, Nanjing, China.

Qiquan Zhang’s research interests are digital speech and audio signal processing, speech enhancement algorithms, and microphone array signal processing.

Zhen Yang’s research interests include various aspects of signal processing and communication, such as communication systems and networks, cognitive radio, spectrum sensing, speech and audio processing, compressive sensing, and wireless communication. He has published more than 200 papers in academic journals and conferences. Prof. Yang served as Vice Chairman of Chinese Institute of Communications, Chairman of Jiangsu Institute of Communications from 2010 to 2015, the Chair of APCC(Asian Pacific Communication Conference) Steering Committee from 2013 to 2014. He is currently the Fellow of the Chinese Institute of Communications, Vice Director of the Editorial Board of the Journal on Communications. He is also a Member of the Editorial Board for several other journals such as Chinese Journal of Electronics et al.

Corresponding authors

Correspondence to Tingting Wang or Zhen Yang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendixes

Appendixes

1.1 Appendix 1

In this Appendix, we derive the molecular fraction \(\xi\) in (23)

$$\begin{aligned} \xi = \int _{0}^{\infty } \textrm{exp} \left\{ - \sigma ^{2}(Z_{j}^{i} - \tau )^{2} \right\} (Z_{j}^{i})^{2}dZ_{j}^{i} = \frac{1}{\sigma ^{3}}\int _{0}^{\infty } \textrm{exp} \left\{ - (Z_{j}^{i}\sigma - \sigma \tau )^{2} \right\} (Z_{j}^{i}\sigma )^{2}dZ_{j}^{i}\sigma . \end{aligned}$$
(32)

Upon denoting \(y = Z_j^i\sigma - \sigma \tau\), we have

$$\begin{aligned} \xi = \frac{1}{{{\sigma ^3}}}\int _{ - \sigma \tau }^0 {\left( {{y^2}{\text { + }}2y\sigma \tau + {{\left( {\sigma \tau } \right) }^2}} \right) \textrm{exp} \left\{ { - {y^2}} \right\} dy} + \frac{1}{{{\sigma ^3}}}\int _0^\infty {\left( {{y^2}{\text { + }}2y\sigma \tau + {{\left( {\sigma \tau } \right) }^2}} \right) \textrm{exp} \left\{ { - {y^2}} \right\} dy}. \end{aligned}$$
(33)

Using the representation of the combinations of exponentials and arbitrary power [[54], eq.3.381.4]

$$\begin{aligned} \int _0^\infty {{x^{v - 1}}\textrm{exp} \left\{ { - \mu v} \right\} dx}, \end{aligned}$$
(34)

we obtain

$$\begin{aligned} \xi= & {} \frac{1}{{{\sigma ^3}}}\left( {\int _{ - \sigma \tau }^0 {{y^2}\textrm{exp} \left\{ { - {y^2}} \right\} dy{\text { + }}} \int _{ - \sigma \tau }^0 {2y\sigma \tau \textrm{exp} \left\{ { - {y^2}} \right\} dy} {\text { + }}\int _{ - \sigma \tau }^0 {{{\left( {\sigma \tau } \right) }^2}\textrm{exp} \left\{ { - {y^2}} \right\} dy} } \right) \nonumber \\+ & {} \frac{1}{{{\sigma ^3}}}\left( {\int _0^\infty {{y^2}\textrm{exp} \left\{ { - {y^2}} \right\} dy{\text { + }}} \int _0^\infty {2y\sigma \tau \textrm{exp} \left\{ { - {y^2}} \right\} dy} {\text { + }}\int _0^\infty {{{\left( {\sigma \tau } \right) }^2}\textrm{exp} \left\{ { - {y^2}} \right\} dy} } \right) \nonumber \\= & {} \frac{1}{{{\sigma ^3}}}\left( {{\text { - }}\frac{1}{2}\int _{ - \sigma \tau }^0 {yd\textrm{exp} \left\{ { - {y^2}} \right\} - \sigma \tau \int _{ - \sigma \tau }^0 {d\textrm{exp} \left\{ { - {y^2}} \right\} {\text { + }}{{\left( {\sigma \tau } \right) }^2}\int _{ - \sigma \tau }^0 {\textrm{exp} \left\{ { - {y^2}} \right\} dy} } } } \right) \nonumber \\+ & {} \left( {\frac{1}{{{\sigma ^3}}}\Gamma \left( {\frac{3}{2}} \right) {\text { + }}\frac{\tau }{{{\sigma ^2}}}\Gamma \left( 1 \right) {\text { + }}\frac{{{\tau ^2}}}{\sigma }\frac{1}{2}\Gamma \left( {\frac{1}{2}} \right) } \right) \nonumber \\= & {} \frac{1}{{2{\sigma ^3}}}\left( {\left. {{\text { - }}y\textrm{exp} \left\{ { - {y^2}} \right\} } \right| _{ - \sigma \tau }^0} \right) {\text { + }}\left. {\frac{{ - \tau }}{{{\sigma ^2}}}\textrm{exp} \left\{ { - {y^2}} \right\} } \right| _{ - \sigma \tau }^0{\text { + }}\frac{{{\tau ^2}}}{\sigma }\int _{ - \sigma \tau }^0 {\textrm{exp} \left\{ { - {y^2}} \right\} dy} \nonumber \\+ & {} \left( {\frac{\pi }{{4{\sigma ^3}}}{\text { + }}\frac{\tau }{{{\sigma ^2}}}{\text { + }}\frac{{{\tau ^2}\sqrt{\pi }}}{{2\sigma }}} \right) \nonumber \\= & {} \frac{{{\text { - }}\tau }}{{2{\sigma ^2}}}\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} {\text { + }}\frac{\tau }{{{\sigma ^2}}}(\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} - 1) + \frac{{{\tau ^2}}}{\sigma }\int _{ - \sigma \tau }^0 {\textrm{exp} \left\{ { - {y^2}} \right\} dy} \nonumber \\+ & {} \left( {\frac{\pi }{{4{\sigma ^3}}}{\text { + }}\frac{\tau }{{{\sigma ^2}}}{\text { + }}\frac{{{\tau ^2}\sqrt{\pi }}}{{2\sigma }}} \right) . \end{aligned}$$
(35)

Using the representation of the Gauss error function \(erf(x) = \frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{ - {\eta ^2}}}} d\eta\) [[54], 3.321.1] and upon denoting \({\text {-}}y{\text { = t}}\), we obtain

$$\begin{aligned} \xi= & {} \frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} {\text {+}}\frac{1}{{{\sigma ^3}}}\left( {\frac{\pi }{4}{\text {+}}\frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2}} \right) {\text {+}}\frac{{{\tau ^2}}}{\sigma }\int _{-\sigma \tau }^0 {\textrm{exp} \left\{ {-{y^2}} \right\} dy} \nonumber \\= & {} \frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} {\text {+}}\frac{1}{{{\sigma ^3}}}\left( {\frac{\pi }{4}{\text {+}}\frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2}} \right) {\text {+}}\frac{{\sqrt{\pi }}}{2} \cdot \frac{2}{{\sqrt{\pi }}}\int _0^{\sigma \tau } {\textrm{exp} \left\{ {- {t^2}} \right\} dt}\\= & {} \frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} {\text {+}}\frac{1}{{{\sigma ^3}}}\left( {\frac{\pi }{4}{\text {+}}\frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2}} \right) {\text {+}}\frac{{\sqrt{\pi }}}{2}erf(\sigma \tau ).\nonumber \end{aligned}$$
(36)

1.2 Appendix 2

In this Appendix, we derive the gain function (28) of MMSE graph magnitude estimator, the denominator fraction \(\zeta\) in (22) is given as

$$\begin{aligned} \zeta {\text { = }}\int _0^\infty {\textrm{exp} \left\{ { - {\sigma ^2}{{(Z_j^i - \tau )}^2}} \right\} Z_j^idZ_j^i}. \end{aligned}$$
(37)

Upon denoting \(y = Z_j^i\sigma - \sigma \tau\), we have

$$\begin{aligned} \zeta= & {} \frac{1}{{{\sigma ^2}}}\int _{ - \sigma \tau }^0 {\left( {y{\text { + }}\sigma \tau } \right) \textrm{exp} \left\{ { - {y^2}} \right\} dy} + \frac{1}{{{\sigma ^2}}}\int _0^\infty {\left( {y{\text { + }}\sigma \tau } \right) \textrm{exp} \left\{ { - {y^2}} \right\} dy}\nonumber \\= & {} \frac{1}{{{\sigma ^2}}}\int _{ - \sigma \tau }^0 {y\textrm{exp} \left\{ { - {y^2}} \right\} dy} {\text { + }}\frac{\tau }{\sigma }\int _{ - \sigma \tau }^0 {\textrm{exp} \left\{ { -{y^2}} \right\} dy} {\text { + }}\frac{1}{{{\sigma ^2}}}\int _0^\infty {y\textrm{exp} \left\{ { - {y^2}} \right\} dy}\nonumber \\+ & {} \frac{\tau }{\sigma }\int _0^\infty {\textrm{exp} \left\{ { - {y^2}} \right\} dy} \nonumber \\= & {} \frac{{ - 1}}{{2{\sigma ^2}}}\int _{ - \sigma \tau }^0 {d\textrm{exp} \left\{ { - {y^2}} \right\} } + \frac{\tau }{\sigma }\int _{ - \sigma \tau }^0 {\textrm{exp} \left\{ { - {y^2}} \right\} dy} + \frac{1}{{{\sigma ^2}}}\frac{1}{2}\Gamma \left( 1 \right) + \frac{\tau }{\sigma }\frac{1}{2}\Gamma \left( {\frac{1}{2}} \right) \nonumber \\= & {} \frac{{ - 1}}{{2{\sigma ^2}}}\left. {\textrm{exp} \left\{ { - {y^2}} \right\} } \right| _{ - \sigma \tau }^0 + \frac{1}{{2{\sigma ^2}}}\frac{1}{2}\Gamma \left( {\frac{1}{2}} \right) \nonumber \\= & {} \frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} \frac{1}{{2{\sigma ^2}}} + \frac{{\tau \sqrt{\pi }}}{{2\sigma }} + \frac{\tau }{\sigma }\int _{ - \sigma \tau }^0 {\textrm{exp} \left\{ { - {y^2}} \right\} dy}. \end{aligned}$$
(38)

Using the representation of the Gauss error function \(erf(x)=\frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{-{\eta ^2}}}}d\eta\) [[54], 3.321.1] and upon denoting \({\text {-}}y{\text {=t}}\), we have

$$\begin{aligned} \zeta= & {} \frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}}\right\} +\frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }+ \frac{\tau }{\sigma }\int _0^{\sigma \tau }{\textrm{exp} \left\{ {-{t^2}} \right\} dt}\nonumber \\= & {} \frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}}\right\} +\frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }+\frac{\tau }{\sigma }\frac{{\sqrt{\pi }}}{2} \cdot \frac{2}{{\sqrt{\pi }}}\int _0^{\sigma \tau } {\textrm{exp} \left\{ { -{t^2}} \right\} dt}\\= & {} \frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} +\frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }+\frac{\tau }{\sigma }\frac{{\sqrt{\pi }}}{2}erf\left( {\sigma \tau }\right) . \nonumber \end{aligned}$$
(39)

Submitting the (A.5) and (B.3) into (23), we obtain

$$\begin{aligned} \widehat{Z_j^i} = \frac{\xi }{\zeta }=\frac{{\frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} + \frac{1}{{{\sigma ^3}}}(\frac{\pi }{4}+ \frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2}) - \frac{{1 + 2{\sigma ^2}{\tau ^2}}}{{2{\sigma ^3}}}\frac{{\sqrt{\pi }}}{2}erf(\sigma \tau )}}{{\frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}}\right\} + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma } + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }erf(\sigma \tau )}}R_j^i. \end{aligned}$$
(40)

Hence, the gain function of the MMSE graph spectral magnitude estimator can be described as

$$\begin{aligned} G_{\text {GMMSE}}(\gamma _{j}^{i},\vartheta _{j}^{i})= & {} \frac{\widehat{Z_{j}^{i}}}{R_{j}^{i}} = \frac{{\frac{{\vartheta _j^i}}{{2{{\left( {\gamma _j^i} \right) }^2}}}\textrm{exp} \left\{ { - {{(\gamma _j^i\vartheta _j^i)}^2}} \right\} + \frac{1}{{{{\left( {\gamma _j^i} \right) }^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{{(\gamma _j^i\vartheta _j^i)}^2}) - \frac{{1 + 2{{(\gamma _j^i\vartheta _j^i)}^2}}}{{2{\gamma _i}^3}}\frac{{\sqrt{\pi }}}{2}erf(\gamma _j^i\vartheta _j^i)}}{{\frac{1}{{2{{\left( {\gamma _j^i} \right) }^2}}}\textrm{exp} \left\{ { - {{(\gamma _j^i\vartheta _j^i)}^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}} + \frac{{\sqrt{\pi }}}{2}\frac{{{\vartheta _i}}}{{{\gamma _i}}}erf(\gamma _j^i\vartheta _j^i)}} \nonumber \\\approx & {} \frac{{\frac{{C{\vartheta _i}}}{{2{{\left( {\gamma _j^i} \right) }^2}}} + \frac{1}{{{{\left( {\gamma _j^i} \right) }^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{{(\gamma _j^i\vartheta _j^i)}^2}) - \frac{{1 + 2{{(\gamma _j^i\vartheta _j^i)}^2}}}{{2{{\left( {\gamma _j^i} \right) }^3}}}\frac{{\sqrt{\pi }}}{2}erf(\gamma _j^i\vartheta _j^i)}}{{\frac{C}{{2{{\left( {\gamma _j^i} \right) }^2}}} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}}erf(\gamma _j^i\vartheta _j^i)}}. \end{aligned}$$
(41)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, T., Guo, H., Ge, Z. et al. An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph. J AUDIO SPEECH MUSIC PROC. 2023, 7 (2023). https://doi.org/10.1186/s13636-023-00272-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-023-00272-z

Keywords