- Methodology
- Open Access
- Published:
An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph
EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 7 (2023)
Abstract
The paper uses the K-graphs learning method to construct weighted, connected, undirected multiple graphs, aiming to reveal intrinsic relationships of speech samples in the inter-frame and intra-frame. To benefit from the learned multiple graphs’ property and enhance interpretability, we study the spectral property of speech samples in the joint vertex-frequency domain by using the new graph weight matrix. Moreover, we propose the representation of minimum mean-square error (MMSE) graph spectral magnitude estimator for speech signals residing on undirected multiple graphs. We use the MMSE graph spectral magnitude estimator to improve speech enhancement performance. The numerical simulation results show that the proposed method outperforms the existing methods in graph signal processing (GSP) and the baseline methods for speech enhancement in discrete signal processing (DSP) in terms of PESQ, LLR, output SNR, and STOI results. These results also demonstrate the validity of the learned multiple graphs.
1 Introduction
Graph signal processing (GSP) [1, 2] explores the relationships among discrete signals residing on vertexes via graph models [3]. It has been developing a set of theories based on the traditional discrete signal processing to investigate, analyze, and process the data defined over arbitrary topologies [4, 5]. The scope of research has changed from fundamental GSP concepts [6,7,8] to practical applications including graph signals denoising and restoration [9,10,11], learning graphs from observed data [12,13,14], image processing [15, 16], and graph clustering [17, 18].
GSP is efficient in characterizing time series by using graph theories [19]. To be specific, in [20], dynamic visibility graphs (DVG) were constructed to describe time series, which studied the DVG dependence of different time series. In [21], the recurrence matrix of the time series was defined as the adjacency matrix of an associated complex graph to link different points in the case where the evolution of the considered states is similar. In [22, 23], a visibility graph based on the Gaussian kernel function was defined for electroencephalogram (EEG) signals, which provided a topology to capture sudden fluctuations happening in EEG during seizure activity.
Traditional discrete signal processing (DSP)-based speech enhancement algorithms use short-time spectrum estimation to suppress noises. Specifically, in [24], the authors proposed the traditional Wiener filtering for speech enhancement by leveraging the frequency-domain characteristics of noises and speech signals. In [25], the authors focused on the major importance of the short-time spectral amplitude (STSA) of the speech signal and proposed a minimum mean-square error (MMSE) STSA estimator for speech enhancement by modeling speech and noise spectral components as statistically independent Gaussian random variables. In [26], the authors proposed the optimal modified minimum mean-square error log-spectral (OMLSA) for robust speech enhancement by minimizing the mean-square error of the log-spectra as a weighted geometric mean of the hypothetical gains associated with the speech presence uncertainty. In addition, in [27], the authors proposed a statistical speech enhancement model using acoustic environment classification supported by a Gaussian mixture model. In [28], the authors proposed a joint-constrained dictionary learning method to solve the “cross projection” problem of signals in the joint dictionary for single-channel speech enhancement.
Unlike the traditional DSP-based speech enhancement methods, GSP-based speech enhancement methods first establish graphs for speech signals before performing enhancement. By establishing the graph adjacency matrix with different edges and weights, speech signals can then be flexibly mapped into different graph frequency domains with different graph Fourier bases. It is worth noting that finite (periodic) time series have been constructed as signals indexed by a directed cycle graph [1, 2, 6]. Speech signals are special time series. The current graph topology of finite time series is directly applied for unstructured speech signals, which explores sampled speech signals’ time shifts and succession and fails to capture the potential relationship among speech samples.
Our previous work [29, 30] has made many processes for inferring a suitable graph representation of speech signals. To be specific, in [29], we first established a single undirected graph topology for unstructured speech signals, which successfully mapped time-domain speech signals into the vertex domain and viewed them as speech graph signals. In [30], we proposed a single digraph by using algebraic signal processing (ASP) [31] theories and then built graph wiener filters in the graph Fourier domain for speech enhancement. However, the designed static graph topology for speech signals in our previous work [29, 30] cannot capture the potential relationships between different speech frames. In [32], we proposed to learn a directed multilayer graph model for speech signals by using graph learning, which reveals both the intrinsic relationships of inter-frames and those among speech samples within a frame. But it aimed to learn a complex and large volume graph model for the total speech signals, which does not reveal the dynamic change characteristics in speech signals.
Against this backdrop, in this paper, we propose a K-graph learning method to learn multiple undirected graphs for framed noisy speech graph signals to better match the dynamic nature of speech. Specifically, the framed noisy speech graph signals are partitioned into a set of clusters. For each cluster, an undirected graph is learned to reveal the potential relationships among noisy speech frames in the cluster. In this way, multiple graphs of a small size other than a large-volume graph are learned, which reveal the inter-frame relationships of the total speech signals in a more dynamic way. Additionally, as the size of each cluster is much smaller than that of the whole speech signals, the K-graphs learning method leads to multiple graphs of small volumes with a much lower learning complexity.
On the basis of the learn multiple undirected graphs, we propose the gain function representation for the MMSE graph spectral magnitude estimator by extending the classical MMSE-STSA to enhance noisy speech signals. The contributions of the paper are summarized as follows.
i) We propose the novel undirected multiple graphs by using the K-graphs learning method, which reveals potential relationships among noisy speech frames in real-time. On this basis, we construct a joint graph weight matrix and define the related graph Fourier basis.
ii) Based on the constructed graph Fourier basis, we investigate the gain function representation of MMSE graph magnitude spectral estimator for speech graph signals (SGSs). We propose an MMSE graph spectral magnitude estimator by extending the classical MMSE-STSA method in DSP to perform speech enhancement.
iii) Our numerical results show that the proposed method outperforms the benchmarks in terms of the output signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), and short-time objective intelligibility (STOI). The numerical results also demonstrate the validity of the learned multiple graphs for speech signals. We use the classical MMSE-STSA [25] and OMLSA [26] methods in DSP and existing graph Wiener filtering methods [33] as benchmarks.
The remainder of the paper is organized as follows. Section 2 introduces the related work of GSP. Section 3 investigates the K-graphs learning method for speech signals. The details of the MMSE graph spectral magnitude estimator method are given in Section 4. Section 5 provides our experimental results. Section 6 concludes the paper.
Notation: The superscript \(^T\) represents the transpose. \({\textrm{trace}( \cdot )}\), \({\left\| \cdot \right\| _F}\) and \({\left\| \cdot \right\| _1}\) represent the trace, F-norm, 1-norm, respectively, and \(\textbf{1}\) denotes the all-ones matrix. \(\times\) represents the Cartesian product. \(\textbf{I}\) is the identity matrix. \(E( \cdot )\) and \(p( \cdot )\) represent the expectation operator and the probability density function, respectively.
2 Related work
2.1 Basics of GSP
Let \(G = \left( {\mathcal {V},\mathcal {E},\textbf{A}} \right)\) represent a weighted graph, where \(\mathcal {V}\) denotes the collection of vertices \({v_1}, \cdots ,{v_N}\), the set of edges \(\mathcal {E}\) satisfies \(\left( {m,n} \right) \in \mathcal {E}\) if and only if the vertex \({v_m}\) is connected to the vertex \({v_n}\) [34], and \(\textbf{A}\) denotes the graph weight matrix. The element in \(\textbf{A}\) represents the weight of the edge between vertices \({v_m}\) and \({v_n}\), which intuitively and numerically shows the appropriate dependence or similarity between signals on vertices [35, 36]. This paper focuses on a connected undirected graph with finite vertices. We can define the variation operator by the combinatorial graph Laplacian matrix \(\textbf{L}= \textbf{D}-\textbf{A}\), where \(\textbf{D}\) is a diagonal matrix with elements \({d_{mm}} = \sum \nolimits _{n = 1}^N {{{\textrm{A}}_{mn}}},(m = 1,\cdots N)\) [37].
Following [38, 39], we have the smoothness of graph signals \(\textbf{X}\) given by
A small smoothness value means that signals indexed by adjacent vertices have similar values, that is, \(\textbf{X}\) is smooth [40]. From (1), we have
2.2 Speech signals on graphs
After the framing operation, speech signals can be expressed as a matrix \({\textbf{s}}\in {\mathbb {R}^{M \times {N_s}}}\) where each row represents a frame speech signals, M represents the number of whole noisy speech frames, and \(N_s\) denotes the length of a speech frame. Let us view each discretized speech sample as a vertex. Assuming that the relationship between the adjacent samples is symmetrical, speech samples can be constructed as speech graph signals (SGSs) residing on a connected undirected graph. The one-to-one mapping between the \(i_{th}\) noisy speech sample \(f_i\) in a frame and the signal value of the \(i_{th}\) vertex \({v_i}\) is given by
where \({\mathcal {V}_s}\) is the set of vertices cardinality \(|{\mathcal {V}_s}|=N_s\), in this way, noisy speech signals in each frame are mapped into the graph domain.
2.3 The joint graph Fourier transform
Let we denote the time-vertex graph signal as \({\textbf{X}_E}{=}\left[ \textbf{x}_{1},\textbf{x}_{2},...,\textbf{x}_{\textrm{T}}\right] \in \mathbb {R}^{N \times \textrm{T}}\), where \(\textbf{x}_1\), \(\textbf{x}_2\), \(\ldots\), \(\textbf{x}_{\textrm{T}}\) represent graph signals sampled at \(1,2,...\textrm{T}\) successive regular intervals with length N [41]. To investigate the spectral properties of \(\textbf{X}_E\), following [42, 43], the joint time-vertex graph Fourier transform (JFT) is defined as
where \(\mathbf {\Psi }_T\in \mathbb {R}^{\textrm{T}\times \textrm{T}}\) is constructed as the normalized discrete Fourier transform matrix, with \(\Psi _{T}\left( {t,k} \right) = e^{- j}\frac{2\pi (k-1)}{\textrm{T}} / \sqrt{\textrm{T}}\) and \(\mathbf {\Psi }_G \in {\mathbb {R}^{N\times N}}\) is the eigenvector matrix obtained by the eigendecomposition of \(\textbf{X}_E\)’s graph weight matrix. More specifically, \(\mathbf {\Psi }_T\) is applied to analyze the time-frequency oscillations of \(\textbf{X}_E\) along the time domain, \(\mathbf {\Psi }_G\) allows us to obtain the graph-frequency characters of \(\textbf{X}_E\) along the graph edges. Moreover, the corresponding inverse IJFT is defined as
The definitions of JFT and IJFT allow us to take into account the variation of the graph and the temporal aspects of time-vertex graph signals.
3 The K-graphs learning method for speech graph signals
This section uses the K-graphs learning method to infer a multiple-graphs model for SGSs without prior graph topologies. It is noted that the inter-frame relationship is not considered in traditional speech enhancement systems. Differing from noises, speech samples have a strong correlation between and within frames. In this paper, considering that when we learn a global graph for speech signals, it would be complex. Inspired by K-means and K-graphs learning [44], we partition the SGSs into clusters and use the K-graphs learning method to capture the potential properties of speech samples both the inter-frame and intra-frame.
Noisy speech signals are mapped into the graph domain by using Eq. (3) and are constructed as SGSs. We employ \(M_s\) speech frames in a cluster to learn a graph for capturing the relationships between \(M_s\) speech frames. The SGSs in the kth cluster, \(\textbf{s}_G^k\) \(\left( {1 \le k \le K}, K= M/M_s \right)\), resides on the kth undirected multiple graph \({G_k} = ({\mathcal {V}_k},{\varvec{\mathcal {L}}_k})\), where \({\mathcal {V}_k}\) indicates the vertex set and \({\varvec{\mathcal {L}}_k}\) is the graph Laplacian matrix of \({G_k}\), and M is the total number of speech frames. Let us now investigate the graph weighted matrix \({\varvec{\mathcal {L}}_k}\) of \({G_k}\), for the sake of revealing the intrinsic relationships among speech frames in real-time.
Following the K-graphs learning framework in [44], we formulate the multiple graphs learning problem of noisy speech graph signals as
where \(\textbf{s}_{G}^{k}\varvec{\mathcal {L}}_{k}(\textbf{s}_{G}^{k})^{T}\) describes the smoothness of SGSs supported on \(G_k\), the Frobenius norm of \(\textbf{1}+\varvec{\mathcal {L}}_{k}\) is used to control the distribution of the edge weights and the sparsity, and the third term is the regularization function and ensures its positive value [45]. The first and second constraints are used to ensure the symmetricity and non-negativity of \(\varvec{\mathcal {L}}_{k}\). The third constraint is added to prevent trivial solutions and control the volume of the corresponding multiple graphs \(G_k\). \(M_s\) controls the volume of the intra-graph. \(\alpha\) and \(\beta\) are non-negative regularization parameters. As the objective function (6) is convex, we can solve it with the CVX toolbox [46] in the experimental section.
We infer an intra-graph topology to investigate the internal relationships between speech samples within a frame. We denote \(\textbf{s}_g^i\) as the \(i_{th}\) row of \(\textbf{s}_G^k\), which is indexed by an intra-graph \(O_i\). Here we focus on studying the strong causality between adjacent speech samples within a frame. Upon denoting the graph weight matrix of \(O_i\) by \(\textbf{W}_i \in {{\mathrm {\mathbb {R}}}^{{N_s} \times {N_s}}}\), we set \({{w}_i}(m,n) = 1\) if there exists a strong causality between the vertex \(v_m\) and its adjacent vertex \(v_n\), and otherwise \({{w}_i}(m,n) = 0\). That is,
Then, we have \({O_i} = ({\mathcal {V}_i},{\textbf{W}_i})\), where \(\mathcal {V}_i\) represents the vertex set with cardinality \(\left| {{\mathcal {V}_i}} \right| ={N_s}\). Hence, \(M_s\) speech frames \({\textbf {f}}\) can be constructed as the speech graph signal \(\textbf{s}_G\) indexed by the multiple graphs \(G_{S}=(\mathcal {V},\varvec{\mathcal {L}}^{*})\) as shown in Fig. 1.
By applying the Cartesian product of the inter-graph Laplacian matrix \(\varvec{\mathcal {L}}_{k}\) and the intra-graph weight matrix \(\textbf{W}_i\), \(\varvec{\mathcal {L}}^{*}\) is constructed as
and the vertex set \(\mathcal {V}^{*}\) is given as
4 The MMSE graph spectral magnitude estimator method
This section proposes a minimum mean-square error (MMSE) graph spectral magnitude estimator based on the learned multiple graphs above. Specifically, we first define the corresponding joint graph Fourier transform (JGFT) and the inverse JGFT (IJGFT). Then we investigate the representation of the MMSE graph spectral magnitude estimator in GSP by extending the classical MMSE short-time spectral amplitude (STSA) estimator.
4.1 The joint graph Fourier transform for SGSs
Differ from the joint graph Fourier transform definition in Section 2, by applying the singular value decomposition (SVD) on \(\textbf{W}_i\), we have
where \(\textbf{F}_{w}\) and \(\textbf{D}_{w}\) are the left unitary matrix and the right unitary matrix of \(\textbf{W}_{i}\), respectively, and \(\mathbf {\Lambda }^{w} = \textrm{diag}(\lambda _{1}^{w},\lambda _{2}^{w},...,\lambda _{N_{s}}^{w})\) is the corresponding diagonal matrix and its element represents the graph frequency along the intra-graph edge. Similarly, we have
where \(\textbf{F}_{\mathcal {L}}\) and \(\textbf{D}_{\mathcal {L}}\) respectively represent the left unitary matrix and the right unitary matrix of \(\varvec{\mathcal {L}}_{k}\) and the element \(\lambda _{j}^{\mathcal {L}}\) \(\left( 1 \le j \le M_{s} \right)\) of \(\mathbf {\Lambda }^{\mathcal {L}}\) represents the graph frequency along the inter-graph edges. The joint graph Fourier transform (JGFT) for \(\textbf{s}_{G}^{k}\) can be defined as
where \(\textbf{S}_{\mathcal {F}}^{k}\) is the graph Fourier version of \(\textbf{s}_{G}^{k}\). Moreover, the inverse IJGFT of \(\textbf{S}_{\mathcal {F}}^{k}\) is defined as
It should be noted that by using the defined JGFT, we can get the graph magnitude spectra of speech signals belonging to the real field by mapping speech signals into the graph frequency domain.
4.2 The MMSE graph spectral magnitude estimator for SGSs
Let us now investigate the MMSE graph spectral magnitude estimator. We denote \(\textbf{s}_G^k = {\textbf{x}_G} + {\textbf{n}_G}\) where \({\textbf{x}_G}\) is clean SGSs, \({\textbf{n}_G}\) is the additive graph noise signal which is independent of \({\textbf{x}_G}\). By performing the defined JGFT in (13), we have
where \(\textbf{S}_{\mathcal {F}}^k\), \({\textbf{X}_\mathcal {F}}\) and \({\textbf{N}_\mathcal {F}}\) are the JGFT coefficient of \(\textbf{s}_G^k\), \({\textbf{x}_G}\) and \({\textbf{n}_G}\) respectively. Upon denoting the \(i_{th}\) row of \(\textbf{S}_\mathcal {F}^k\), \({\textbf{X}_\mathcal {F}}\), and \({\textbf{N}_\mathcal {F}}\) by \(\textbf{Y}^i\), \(\textbf{X}^i\), and \(\textbf{N}^i\), respectively, a graph speech sample on a vertex can be donated as \(\textrm{Y}_j^i = \textrm{X}_j^i + \textrm{N}_j^i\) where \(i=1,2,3,...,K\), \(j=0,1,2,...,{N_s}-1\).
Let us denote \(R_j^i = \left| {\textrm{Y}_j^i} \right|\) and \(\textrm{Z}_j^i = \left| {\textrm{X}_j^i} \right|\). Based on the work in [46], the MMSE estimator for the graph magnitude spectrum \({\textrm{X}_j^i}\) can be obtained as
In case of the Gaussian statistical model for spectral components, we have
where \(\;\left( {{\lambda _{\textrm{x}}}} \right) _j^i = E\left[ {{{\left( {X_j^i}\right) }^2}}\right]\) and \(\;\left( {{\lambda _{\textrm{n}}}} \right) _j^i = E\left[ {{{\left( {N_j^i} \right) }^2}}\right]\) are the \(j_{th}\) SGS and the graph noise variance for \({X}_j^i\) and \({N}_j^i\) respectively. By substituting (17) and (18) into \(\int _0^\infty {Z_j^ip(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}\) and \(\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}\), we have
and
Then, we can rewrite (16) equivalently as
And, we arrive at
Upon denoting \({\sigma ^2} = \frac{{({\lambda _{\textrm{n}}})_j^i+({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}\) and \(\tau = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}\), we can rewrite (21) equivalently as
Let us now analyze \(\xi\). Upon denoting \(y = Z_j^i\sigma - \sigma \tau\), we have
Let us introduce the Gauss error function \(erf(x) = \frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{ - {\eta ^2}}}} d\eta\). Substituting erf(x) into (24) (see Appendix 1) gives
Similarly, we have
By combining (21), (25) and (26) (see Appendix 2), we arrive at
Let us introduce the notation of the prior signal-to-noise ratio \(\varphi _j^i{\text { = }}\frac{{({\lambda _{\text {x}}})_j^i}}{{({\lambda _{\text {n}}})_j^i}}\) and the posterior signal-to-noise ratio \(\delta _j^i{\text { = }}\frac{{({\lambda _{\text {n}}})_j^i}}{{({\lambda _{\text {x}}})_j^i}}\). Due to \({\sigma ^2} = \frac{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i({\lambda _x})_j^i}}\) and \(\tau = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}\), we have
By combining (23), (27) and (28), we can obtain the gain function of the MMSE graph spectral magnitude estimator given as
where C is a constant. Following the classical decision directed approach in [25], we have \(\varphi _j^i\) as
where \(\upsilon \in (0,1)\) is the gain parameter. For the sake of convenience, we donate GMMSE-KGL to name the proposed MMSE graph magnitude spectral estimator-based K-graphs learning in the following sections.
5 Numerical results and discussions
In this section, we present the output signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ) [47], log-likelihood ratio (LLR) [48], and short-time objective intelligibility (STOI) [49] measure results of the proposed MMSE graph spectral magnitude estimator. The traditional MMSE-STSA method in [25], the optimal modified minimum mean-square error log-spectral method (OMLSA) in [26], the improved graph Wiener filtering method (GWF-SGS) in [30], the vertex-frequency graph Wiener filtering (VFGWF) in [32], the graph Wiener filtering method for directed cyclic time series (GWF-DCGS), and the graph Wiener filtering for arbitrary graph signals (GWF-AGS) in [33] are used as the benchmarks. The output LLR is defined as
where \({\overrightarrow{{d_p}}}\) and \({\overrightarrow{{d_c}}}\) represent the LPC vector of the enhanced speech frame and original speech signals, respectively. \(\mathbf {R_c}\) represents the autocorrelation matrix of the original speech signals [50]. In our numerical simulations, the noisy speech signals are generated by mixing pure speech signals from the TIMIT database [51] with noise signals at the input signal-to-noise ratios (SNRs) from −15 to 5 dB. Two hundred sentences consisting of 20 speakers (10 females and 10 males) are used as the clean speech signal. White noise, Gaussian color noise, and Babble noise from NOISES-92 library [52] are used as noise signals. The sampling frequency is 16 kHz. The speech signals are framed by the Hamming window with a length of 256 points and an overlap of \(50\%\).
Figure 2 shows the traditional spectrogram obtained by the discrete Fourier transform (DFT) and the graph spectrogram obtained by the proposed graph Fourier basis based on the graph weight matrix \(\textbf{W}_i\) of the inter-graph, respectively. Observe from Fig. 2 that the graph spectrogram is mainly distributed in the high graph frequency regions, while the traditional spectrogram is mainly distributed in the low-frequency regions. The reason is that considering the Theorem 1 for the graph frequency ordering in [53], the smallest eigenvalue of \(\textbf{W}_i\) represents the lowest frequency, and its largest eigenvalue is the highest frequency. These are different from the traditional frequencies. Although the graph spectrogram is similar to that of the conventional spectrogram, this graph spectrogram is utterly different from that of the traditional spectrum. In addition, the proposed graph Fourier basis can map speech signals into the real graph frequency field by applying the eigenvector matrix of \(\textbf{W}_i\).
Figure 3 shows the output SNR of the proposed GMMSE-KGL method in the case of white noise versus the frame number \(M_s\) where \(N_s=256\). Observe from Fig. 3, to achieve a high output SNR, \(M_s\) should be neither too small nor too large. When it takes a small value, the relationships among non-adjacent frames cannot be well described as the designed small multiple graphs. In contrast, when \(M_s\) takes a large value, the boundaries between sub-multiple graphs might have a similar tendency, which would degrade the output SNR. Because the larger multiple graphs will lose some details of speech samples, resulting in estimating the inability to the graph spectral magnitude of speech samples well. Hence, the range of \(M_s\) can be \(M_s \in [20,30]\) in the case of \(N_s=256\), and we use \(M_s=30\) in our numerical simulations below.
Figure 4 shows the output SNR results of the proposed GMMSE-KGL method under white noise with the different vertex number \(N_s\) of the intra-graph topology \(O_i\) where \(M_s=30\). We can see from Fig. 4 that the performance of the proposed GMMSE-KGL method decreases with the increase of \(N_s\). The reason for this is that the intra-graph topology becomes more complex and larger as \(N_s\) increases, resulting in the gain function of the proposed GMMSE-KGL method is not accurately estimated by using the SGS’s graph power spectrum on \(O_i\). Moreover, the proposed GMMSE-KGL method in the case of \(N_s=64\) would obtain a better performance. Considering the fairness of comparison, we use the frame with length 256 and build our intra-graph with \(N_s=256\), that is, \(O_i = (\mathcal {V}_{i},{\textbf {W}}_{i}^{256 \times 256})\).
Table 1 shows the PESQ results of the K-graphs learning (KGL) followed by the GMMSE method, the graph k-shift operator (GKSO) [30] followed by the GMMSE method and the graph learning (GL) [32] followed by the GMMSE method in the case of white noise. For writing convenience, we denote the methods above as the GMMSE-KGL, GMMSE-KGS, and GMMSE-GL, respectively. We can observe from Table 1 that the PESQ of the proposed GMMSE-KGL outperforms that of the GMMSE-GKSO and GMMSE-GL, which illustrates the effectiveness of the K-graphs learning part in speech enhancement. Moreover, the PESQ of the proposed GMMSE-KGL is 0.5 higher than that of GMMSE-GKSO and is 0.2 higher than that of GMMSE-GL, when the input SNR is larger than -5 dB.
Table 2 shows the PESQ results of the proposed GMMSE and the graph Wiener filtering (GWF) method combined with the K-graphs learning method in the case of white noise, respectively. For writing convenience, we denote the two methods above as the GMMSE-KGL and GWF-KGL. From Table 2, we can see that the PESQ of the proposed GMMSE-KGL is 0.2 higher than that of GWF-KGL. The PESQ results of the proposed GMMSE-KGL demonstrate the effectiveness of the GMMSE part in speech enhancement. Tables 1 and 2 illustrate that the proposed K-graphs learning part and the GMMSE-based enhancement part almost contribute equally to the performance improvement.
Table 3 shows the PESQ results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for the case of white noise. Observe from Table 3 that though the proposed GMMSE-KGL method leads to a slightly lower PESQ value as compared to the VFGWF method in the case of −15 dB input SNR, the proposed GMMSE-KGL method outperforms all the benchmarks when the input SNR is more than −10dB in terms of PESQ, which shows the advantage of the proposed GMMSE-KGL method.
Table 4 shows the LLR results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for the case of white noise. We can see from Table 4 that the proposed GMMSE-KGL method performs better than the GWF-DCGS, GWF-AGS, VFGWF, and MMSE-STSA methods in terms of LLR results. However, when the input SNR is between −10 and −5 dB, the proposed GMMSE-KGL method performs worse than the OMLSA method. Because OMLSA could estimate the noise very well by applying the MCRA method, estimating the log-spectral amplitude of clean speech signals.
Table 5 shows the output SNR results of the proposed GMMSE-KGL method, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for the case of white noise. From Table 5, we can see that the output SNR of the proposed GMMSE-KGL method is higher than that of the GWF-DCGS and the GWF-AGS methods, regardless of the input SNR. The proposed GMMSE-KGL method is also higher than that of the GWF-SGS, VFGWF, and OMLSA methods for the cases where the input SNR is lower than \(-5\) dB. However, when the input SNR is higher than \(-5\) dB, the proposed GMMSE-KGL method leads to a lower output SNR as compared to the MMSE-STSA, OMLASL, GWF-SGS, and VFGWF methods. The reason for this is that the proposed GMMSE-KGL method may misestimate the graph spectra of clean speech signals, so it cannot estimate the graph magnitude of pure graph speech signals.
Table 6 shows the STOI results of the proposed GMMSE-KGL method and the benchmarks in the case of white noise. Observe from Table 6 that the proposed GMMSE-KGL method outperforms the OMLSA, MMSE-STSA, GWF-AGA, and GWF-DCGS methods overall. However, the proposed GMMSE-KGL method performs worse than the GWF-SGS, and VFGWF methods in terms of STOI. The reason for this is that when we learn the dynamic graphs using the K-graphs learning method, the graph spectrum of some noises is regarded as that of clean speech details, leading to estimating accurately the graph spectrum of clean speech signals.
Table 7 reports the PESQ results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, VFGWF, MMSE-STSA, and OMLSA methods in a comparative study on suppressing Gaussian color noise. Observe from Table 7 that the proposed GMMSE-KGL method outperforms all the benchmarks in terms of PESQ overall. Moreover, from Table 7 we can see that when the input SNR is −15dB, the PESQ of the proposed GMMSE-KGL method is lower than that of the OMLSA and GWF-SGS methods.
Table 8 shows the LLR results in a comparative study on suppressing Gaussian color noise. Observe from Table 8 that the LLR results of the proposed GMMSE-KGL method are slightly higher than the traditional MMSE-STSA and OMLSA methods from 0 to 5 dB and are lower than those of the GWF-DCGS, GWF-AGS, GWF-SGS, and VFGWF methods when the input SNR is no less than 0 dB. This illustrates that the proposed GMMSE-KGL method provides a slightly better spectral envelope than the traditional MMSE-STSA and OMLSA methods. However, the proposed GMMSE-KGL method cannot estimate the graph spectrum reasonably as compared to the existing GSP-based methods.
Table 9 shows the output SNR results in a comparative study on suppressing Gaussian color noise. From Table 9, we observe that in terms of its output SNR, the proposed GMMSE-KGL method outperforms the GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, and OMLSA methods in cases of negative input SNRs. In the positive input SNRs, the relationship among noise frames could not be captured by the K-graphs learning method, leading to not estimating noise power spectra in real-time. The performance of the proposed GMMSE-KGL method is lower than the GWF-SGS and VFGWF methods. Additionally, compared to the MMSE-STSA method, the proposed GMMSE-KGL method leads to a much smaller improvement in the output SNR results. The reason for this is that the proposed GMMSE-KGL method reduces more details of clean speech signals and has a slight improvement in terms of the output SNR.
Table 10 shows the STOI results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods in the case of Gaussian color noise. Observe from Table 10 that the STOI results of the proposed GMMSE-KGL method are higher than that of the MMSE-STSA, OMLSA, GWF-DCGS, and GWF-AGS methods when the input SNR is between −15 and 5dB. Note that the proposed GMMSE-KGL method estimates the graph spectrum of some Gaussian color noise as that of clean speech signals, resulting in obtaining low STOI results as compared to the GWF-SGS and VFGWF methods.
Table 11 shows the PESQ results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for Babble noise. Observe from Table 11 that the proposed GMMSE-KGL method outperforms all the benchmarks in terms of PESQ with input SNR ranging from −15 to 5 dB.
Table 12 shows the LLR results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA for Babble noise. Observe from Table 12 that the proposed GMMSE-KGL method performs better than the traditional MMSE-STSA and OMLSA methods when the input SNR is between −10 and 5dB. However, the proposed GMMSE-KGL method cannot outperform the GWF-SGS, VFGWF, GWF-DCGS, and GWF-AGS methods when the input SNR is between −15 and 5 dB. The reason for this is that the proposed GMMSE-KGL method may not estimate the graph spectrum of the nonstationary noise as compared to the GWF-SGS, VFGWF, GWF-DCGS, and GWF-AGS methods.
Table 13 shows the output SNR results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for Babble noise. Observe from Table 13 that the proposed GMMSE-KGL method outperforms the GWF-SGS, VFGWF, and MMSE-STSA methods when the input SNR is lower than −5dB. Though the GWF-AGS and GWF-DCGS methods could be better than GMMSE-KGL in negative input SNR values, our proposed GMMSE-KGL method performs better than the GWF-AGS and GWF-DCGS methods when input SNRs are positive.
Table 14 shows the STOI results of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods for Babble noise. It can be observed from Table 14 that when the input SNR is less than 0dB, the STOI results of the proposed GMMSE-KGL method are higher than those of MMSR, OMLSA, and GWF-SGS methods and are lower than those of the VFGWF, GWF-AGS, and GWF-DCGS methods. The above situation is reversed when the input SNR is not less than 0dB. This implies that the proposed GMMSE-KGL method may not estimate a graph spectrum of Babble noise, and even some speech details may be regarded as Babble noise and removed in higher input SNR cases.
Table 15 gives the computational complexity of the proposed GMMSE-KGL, GWF-DCGS, GWF-AGS, GWF-SGS, VFGWF, MMSE-STSA, and OMLSA methods. Here we discuss the computational complexity of methods caused by the graph/discrete Fourier transform and its inverse transform. To be specific, let M, \(M_s\) \(N_s\), L represent the number of whole noisy speech frames, the number of noisy speech frames in a cluster, the length of a noisy speech frame, and the tap number of the graph Wiener filtering, respectively. Based on the computational complexity discussion in [30, 32] and the matrix multiplication theory, we can observe from Table 15 that the computational complexity of the proposed GMMSE-KGL method is higher than that of traditional methods. The reason for this is that the proposed GMMSE-KGL method does not apply the fast graph Fourier transform, while all the traditional methods use the FFT operation. Meanwhile, the proposed GMMSE-KGL method based on the Fourier basis of the sparse matrix \(\varvec{\mathcal {L}}^{*}\) has a lower computational complexity compared to both the GWF-DCGS method and that of the GWF-AGS method. Note that \(M_s\) is smaller than \(M\), and the computational complexity of the proposed GMMSE-KGL method is lower than that of the VFGWF method.
6 Conclusions
This paper used the K-graphs learning method to learn multiple graphs for speech signals, which can investigate intrinsic relationships among inter-frames and the relationship between speech samples within a frame in real-time. On this basis, we developed a representation of the MMSE graph spectral magnitude estimator and used different input SNRs to evaluate the performance of the proposed GMMSE-KGL method on speech enhancement. The experimental results showed that the proposed GMMSE-KGL method outperformed the graph Wiener filtering methods in GSP on the PESQ and was comparable to some of the well-performing traditional baseline methods in DSP in terms of the LLR, STOI, and output SNR.
Availability of data and materials
Not applicable.
References
A. Ortega, P. Frossard, J. Kovačević, J.M.F. Moura, P. Vandergheynst, Graph signal processing: Overview, challenges, and applications. Proc. IEEE 106(5), 808–828 (2018). https://doi.org/10.1109/JPROC.2018.2820126
D.I. Shuman, S.K. Narang, P. Frossard, A. Ortega, P. Vandergheynst, The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30(3), 83–98 (2013). https://doi.org/10.1109/MSP.2012.2235192
Q.S.Q.S. Junzheng, J. David, B. Tay, Design of non-subsampled graph filter banks via lifting schemes. IEEE Signal Process. Lett. 27, 441–445 (2020).
B. Girault, A. Ortega, S.S. Narayayan, Graph vertex sampling with arbitrary graph signal hilbert spaces, IEEE Int. Conf. Acoust., Speech, Signal Processing, Spain, 2020, 5670–5674 (2020).
Y. Tanaka, Y.C. Eldar, A. Ortega, G. Cheung, Sampling signals on graphs: From theory to applications. IEEE Signal Process. Mag. 37(6), 14–30 (2020). https://doi.org/10.1109/MSP.2020.3016908
A. Sandryhaila, J.M.F. Moura, Discrete signal processing on graphs. IEEE Trans. Signal Process. 61(3), 1644-1656 (2013). https://doi.org/10.1109/TSP.2013.2238935
J. Domingos, J.M.F. Moura, Graph fourier transform: A stable approximation. IEEE Trans. Signal Process. 68, 4422–4437 (2020). https://doi.org/10.1109/TSP.2020.3009645
M.J.M.F. Shi J., Graph signal processing: Modulation, convolution, and sampling, 2019, https://arxiv.org/abs/1912.06762.
S. Chen, A. Sandryhaila, J.M.F. Moura, J. Kovacevic, Signal denoising on graphs via graph filtering, 872–876 (2014). https://doi.org/10.1109/GlobalSIP.2014.7032244
M. Onuki, S. Ono, M. Yamagishi, Y. Tanaka, Graph signal denoising via trilateral filter on graph spectral domain. IEEE Trans. Signal Inf. Process. Over Netw. 2(2), 137–148 (2016). https://doi.org/10.1109/TSIPN.2016.2532464
S. Ono, I. Yamada, I. Kumazawa, Total generalized variation for graph signals , 5456–5460 (2015). https://doi.org/10.1109/ICASSP.2015.7179014
V. Kalofolias, How to learn a graph from smooth signals 51, 920–929 (2016). http://proceedings.mlr.press/v51/kalofolias16.html
K. Yamada, Y. Tanaka, A. Ortega, Time-varying graph learning based on sparseness of temporal variation, 5411–5415 (2019). https://doi.org/10.1109/ICASSP.2019.8682762
K. Yamada, Y. Tanaka, A. Ortega, Time-varying graph learning with constraints on graph temporal variation. CoRR, abs/2001.03346 (2020). https://arxiv.org/abs/2001.03346
G. Cheung, E. Magli, Y. Tanaka, M.K. Ng, Graph spectral image processing. IEEE, 106 (5), 907–930 (2018). https://ediss.sub.uni-hamburg.de/handle/ediss/9268
H. Sadreazami, A. Asif, A. Mohammadi, A late adaptive graph-based edge-aware filtering with iterative weight updating process, 1581–1584 (2017). https://doi.org/10.1109/MWSCAS.2017.8053239
L.J. Kondor R. I., Diffusion kernels on graphs and other discrete structures, International Conference on Machine Learning, 315–322 (2002).
A.J. Smola, R. Kondor, Kernels and regularization on graphs. 2777, 144–158 (2003). https://doi.org/10.1007/978-3-540-45167-9_12
B.F.e.a. Lacasa L., Luque B, From time series to complex networks: the visibility graph, 105 (13), 4972–4975 (2008).
S.A. Bezsudnov I.V., Gavrilov S.V., From time series to complex networks: the dynamical visibility graph. Phys. A Stat. Mech. Appl. 414, 1-13 (2012). https://arxiv.org/abs/1208.6365v1
D.J.e.a. Donner R.V., Zou Y., Recurrence networks-a novel paradigm for nonlinear time analysis, New Journal of Physics, 12(3), 129-132 (2010).
C.V.K. Mathur P., .Graph signal processing of eeg signals for detection of epilepsy, 7th International Conference on Signal Processing and Information Networks, 839–843 (2020).
M.S.e.a. Roy S. S., Chatterjee S., Detection of focal eeg signals employing weighted visibility graph, International Conference on Computer, Electrical & Communication Engineering, India, 2020, pp. 1–5(2020).
P. Scalart, J. Filho, Speech enhancement based on a priori signal to noise estimation, IEEE Int. Conf. Acoust., Speech, Signal Processing, USA, 1996, 629–632 (1996).
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sig. Process. 32(6), 1109–1121 (1984).
I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)
J.H. Choi, J.H. Chang, On using acoustic environment classification for statistical model-based speech enhancement. Speech Comm. 54(3), 477–490 (2012)
L. Sun, Y. Bu, P. Li, Z. Wu, Single-channel speech enhancement based on joint constrained dictionary learning. EURASIP J. Audio Speech Music. Process. 2021(1), 29 (2021). https://doi.org/10.1186/s13636-021-00218-3
B.L.Z.Y. Tingting W., Haiyan G., Speech signal processing on graphs: Graph topology, graph frequency analysis and denoising. Chin. J. Elect. 29(5), 926–936 (2020)
T. Wang, H. Guo, X. Yan, Z. Yang, Speech signal processing on graphs: The graph frequency analysis and an improved graph wiener filtering method. Speech Commun. 127, 82–91 (2021). https://doi.org/10.1016/j.specom.2020.12.010
M. Puschel, J.M. Moura, Algebraic signal processing theory: Foundation and 1-d time. IEEE Trans. Signal Process. 56(8–1), 3572–3585 (2008)
T. Wang, H. Guo, Q. Zhang, Z. Yang, A new multilayer graph model for speech signals with graph learning. Digit. Signal Process. 122, 103360 (2022). https://doi.org/10.1016/j.dsp.2021.103360
A. Gavili, X. Zhang, On the shift operator, graph frequency, and optimal filtering in graph signal processing. IEEE Trans. Signal Process. 65(23), 6303–6318 (2017). https://doi.org/10.1109/TSP.2017.2752689
G. Yang, L. Yang, C. Huang, An orthogonal partition selection strategy for the sampling of graph signals with successive local aggregations. Signal Process. 188, 108211 (2021). https://doi.org/10.1016/j.sigpro.2021.108211
J. Miettinen, S.A. Vorobyov, E. Ollila, Modelling and studying the effect of graph errors in graph signal processing. Signal Process. 189, 108-256 (2021). https://doi.org/10.1016/j.sigpro.2021.108256
H. Sevi, G. Rilling, P. Borgnat, Modeling signals over directed graphs through filtering, IEEE Global Conference on Signal and Information Processing, USA, 2018, 718–722 (2018). https://doi.org/10.1109/GlobalSIP.2018.8646534
F. Wang, Y. Wang, G. Cheung, A-optimal sampling and robust reconstruction for graph signals via truncated neumann series. IEEE Signal Process. Lett. 25(5), 680–684 (2018). https://doi.org/10.1109/LSP.2018.2818062
B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, M.G. Rabbat, Characterization and inference of graph diffusion processes from observations of stationary signals. IEEE Trans. Signal Inf. Process. Netw. 4(3), 481–496 (2018). https://doi.org/10.1109/TSIPN.2017.2742940
X. Dong, D. Thanou, P. Frossard, P. Vandergheynst, Learning laplacian matrix in smooth graph signal representations. IEEE Trans. Sig. Process. 64(23), 6160–6173 (2016).
Y. Yankelevsky, M. Elad, Finding GEMS: multi-scale dictionaries for high-dimensional graph signals. IEEE Trans. Signal Process. 67(7), 1889–1901 (2019). https://doi.org/10.1109/TSP.2019.2899822
F. Grassi, A. Loukas, N. Perraudin, B. Ricaud, A time-vertex signal processing framework: Scalable processing and meaningful representations for time-series on graphs. IEEE Trans. Signal Process. 66(3), 817–829 (2018). https://doi.org/10.1109/TSP.2017.2775589
A. Loukas, D. Foucard, Frequency analysis of temporal graph signals, CoRR abs/1602.04434 (2016). http://arxiv.org/abs/1602.04434
J. Yu, X. Xie, H. Feng, B. Hu, On critical sampling of time-vertex graph signals, IEEE Global Conference on Signal and Information Processing, Canada, 1–5 (2019). https://doi.org/10.1109/GlobalSIP45357.2019.8969108
H. Araghi, M. Sabbaqi, M. Babaie-Zadeh, K-graphs: An algorithm for graph signal clustering and multiple graph learning. IEEE Signal Process. Lett. 26(10), 1486–1490 (2019). https://doi.org/10.1109/LSP.2019.2936665
X. Dong, D. Thanou, M.G. Rabbat, P. Frossard, Learning graphs from data: A signal representation perspective. IEEE Signal Process. Mag. 36(3), 44–63 (2019). https://doi.org/10.1109/MSP.2018.2887284
B.S. Grant M., CVX: matlab software for disciplined convex programming 2012-2019 CVX Research, Inc., Austin. http://cvxr.com
I.T. Recommendation, Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. Rec. ITU-T P (2001)
C.M.A. Quackenbush S. R., Barnwell T. P., Objective measures of speech qualit (Prentice Hall Advanced Reference Series, Englewood Cliffs, 1986), ISBN: 0-13-629056-6
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, IEEE Int. Conf. Acoust., Speech, Signal Processing, USA, 2010, 4214–4217(2010). https://doi.org/10.1109/ICASSP.2010.5495701
Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Speech Audio Process. 16(1), 229–238 (2008). https://doi.org/10.1109/TASL.2007.911054
S.F. Boll, DARPA TIMIT acoustic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Tech. Rep. (1993)
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm. 12(3), 247–251 (1993).
A. Sandryhaila, J.M.F. Moura, Discrete signal processing on graphs: Frequency analysis. IEEE Trans. Signal Process. 62(12), 3042–3054 (2014). https://doi.org/10.1109/TSP.2014.2321121
R.I. Gradsh teyn I.S., Table of integrals, series and product (Academic, New York, 1980). https://doi.org/10.1109/TSP.2014.2321121
Acknowledgements
Not applicable.
Funding
This work is supported by the National Natural Science Foundation of China (No.62071242, No.61901227), the Graduate Innovation Program of Jiangsu Province (No.KYCX19-0897), and the China Scholarship Council.
Author information
Authors and Affiliations
Contributions
(1) We proposed a novel global graph topology by using the K-graphs learning method, which consists of the multiple graphs to describe the potential connections among inter-frames and intra-frames. (2) We proposed the graph representation of minimum mean-square error (MMSE) graph spectral magnitude estimator for SGSs in the vertex-frequency domain by extending the classical MMSE-STSA estimator in DSP to perform speech enhancement. (3)The performance of the MMSE graph spectral magnitude estimator method outperforms the benchmarks in terms of PESQ, LLR, output SNR and STOI. The classical MMSE-STSA and OMLSA methods in DSP and the existing graph Wiener filtering methods are used as benchmarks. The authors read and approved the final manuscript.
Authors’ information
Tingting Wang is pursuing her doctorate in signal and information processing from the Nanjing University of Posts and Telecommunications, Nanjing, China. Her current research interests include graph signal processing, classical speech signal processing.
Haiyan is currently an Associate Professor with NJUPT, Nanjing, China. Her research interests include speech signal processing and B5G/6G wireless transmission.
Zirui Ge is pursuing his doctoral degree in signal and information processing from the Nanjing University of Posts and Telecommunications, Nanjing, China.
Qiquan Zhang’s research interests are digital speech and audio signal processing, speech enhancement algorithms, and microphone array signal processing.
Zhen Yang’s research interests include various aspects of signal processing and communication, such as communication systems and networks, cognitive radio, spectrum sensing, speech and audio processing, compressive sensing, and wireless communication. He has published more than 200 papers in academic journals and conferences. Prof. Yang served as Vice Chairman of Chinese Institute of Communications, Chairman of Jiangsu Institute of Communications from 2010 to 2015, the Chair of APCC(Asian Pacific Communication Conference) Steering Committee from 2013 to 2014. He is currently the Fellow of the Chinese Institute of Communications, Vice Director of the Editorial Board of the Journal on Communications. He is also a Member of the Editorial Board for several other journals such as Chinese Journal of Electronics et al.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendixes
Appendixes
1.1 Appendix 1
In this Appendix, we derive the molecular fraction \(\xi\) in (23)
Upon denoting \(y = Z_j^i\sigma - \sigma \tau\), we have
Using the representation of the combinations of exponentials and arbitrary power [[54], eq.3.381.4]
we obtain
Using the representation of the Gauss error function \(erf(x) = \frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{ - {\eta ^2}}}} d\eta\) [[54], 3.321.1] and upon denoting \({\text {-}}y{\text { = t}}\), we obtain
1.2 Appendix 2
In this Appendix, we derive the gain function (28) of MMSE graph magnitude estimator, the denominator fraction \(\zeta\) in (22) is given as
Upon denoting \(y = Z_j^i\sigma - \sigma \tau\), we have
Using the representation of the Gauss error function \(erf(x)=\frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{-{\eta ^2}}}}d\eta\) [[54], 3.321.1] and upon denoting \({\text {-}}y{\text {=t}}\), we have
Submitting the (A.5) and (B.3) into (23), we obtain
Hence, the gain function of the MMSE graph spectral magnitude estimator can be described as
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, T., Guo, H., Ge, Z. et al. An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph. J AUDIO SPEECH MUSIC PROC. 2023, 7 (2023). https://doi.org/10.1186/s13636-023-00272-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636-023-00272-z
Keywords
- Graph signal processing
- K-graphs learning
- Graph representation
- MMSE
- Speech enhancement