An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph

Wang, Tingting; Guo, Haiyan; Ge, Zirui; Zhang, Qiquan; Yang, Zhen

doi:10.1186/s13636-023-00272-z

Methodology
Open access
Published: 03 February 2023

An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph

Tingting Wang¹,
Haiyan Guo¹,
Zirui Ge¹,
Qiquan Zhang² &
…
Zhen Yang¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 7 (2023) Cite this article

1855 Accesses
Metrics details

Abstract

The paper uses the K-graphs learning method to construct weighted, connected, undirected multiple graphs, aiming to reveal intrinsic relationships of speech samples in the inter-frame and intra-frame. To benefit from the learned multiple graphs’ property and enhance interpretability, we study the spectral property of speech samples in the joint vertex-frequency domain by using the new graph weight matrix. Moreover, we propose the representation of minimum mean-square error (MMSE) graph spectral magnitude estimator for speech signals residing on undirected multiple graphs. We use the MMSE graph spectral magnitude estimator to improve speech enhancement performance. The numerical simulation results show that the proposed method outperforms the existing methods in graph signal processing (GSP) and the baseline methods for speech enhancement in discrete signal processing (DSP) in terms of PESQ, LLR, output SNR, and STOI results. These results also demonstrate the validity of the learned multiple graphs.

1 Introduction

Graph signal processing (GSP) [1, 2] explores the relationships among discrete signals residing on vertexes via graph models [3]. It has been developing a set of theories based on the traditional discrete signal processing to investigate, analyze, and process the data defined over arbitrary topologies [4, 5]. The scope of research has changed from fundamental GSP concepts [6,7,8] to practical applications including graph signals denoising and restoration [9,10,11], learning graphs from observed data [12,13,14], image processing [15, 16], and graph clustering [17, 18].

GSP is efficient in characterizing time series by using graph theories [19]. To be specific, in [20], dynamic visibility graphs (DVG) were constructed to describe time series, which studied the DVG dependence of different time series. In [21], the recurrence matrix of the time series was defined as the adjacency matrix of an associated complex graph to link different points in the case where the evolution of the considered states is similar. In [22, 23], a visibility graph based on the Gaussian kernel function was defined for electroencephalogram (EEG) signals, which provided a topology to capture sudden fluctuations happening in EEG during seizure activity.

Traditional discrete signal processing (DSP)-based speech enhancement algorithms use short-time spectrum estimation to suppress noises. Specifically, in [24], the authors proposed the traditional Wiener filtering for speech enhancement by leveraging the frequency-domain characteristics of noises and speech signals. In [25], the authors focused on the major importance of the short-time spectral amplitude (STSA) of the speech signal and proposed a minimum mean-square error (MMSE) STSA estimator for speech enhancement by modeling speech and noise spectral components as statistically independent Gaussian random variables. In [26], the authors proposed the optimal modified minimum mean-square error log-spectral (OMLSA) for robust speech enhancement by minimizing the mean-square error of the log-spectra as a weighted geometric mean of the hypothetical gains associated with the speech presence uncertainty. In addition, in [27], the authors proposed a statistical speech enhancement model using acoustic environment classification supported by a Gaussian mixture model. In [28], the authors proposed a joint-constrained dictionary learning method to solve the “cross projection” problem of signals in the joint dictionary for single-channel speech enhancement.

Unlike the traditional DSP-based speech enhancement methods, GSP-based speech enhancement methods first establish graphs for speech signals before performing enhancement. By establishing the graph adjacency matrix with different edges and weights, speech signals can then be flexibly mapped into different graph frequency domains with different graph Fourier bases. It is worth noting that finite (periodic) time series have been constructed as signals indexed by a directed cycle graph [1, 2, 6]. Speech signals are special time series. The current graph topology of finite time series is directly applied for unstructured speech signals, which explores sampled speech signals’ time shifts and succession and fails to capture the potential relationship among speech samples.

Our previous work [29, 30] has made many processes for inferring a suitable graph representation of speech signals. To be specific, in [29], we first established a single undirected graph topology for unstructured speech signals, which successfully mapped time-domain speech signals into the vertex domain and viewed them as speech graph signals. In [30], we proposed a single digraph by using algebraic signal processing (ASP) [31] theories and then built graph wiener filters in the graph Fourier domain for speech enhancement. However, the designed static graph topology for speech signals in our previous work [29, 30] cannot capture the potential relationships between different speech frames. In [32], we proposed to learn a directed multilayer graph model for speech signals by using graph learning, which reveals both the intrinsic relationships of inter-frames and those among speech samples within a frame. But it aimed to learn a complex and large volume graph model for the total speech signals, which does not reveal the dynamic change characteristics in speech signals.

Against this backdrop, in this paper, we propose a K-graph learning method to learn multiple undirected graphs for framed noisy speech graph signals to better match the dynamic nature of speech. Specifically, the framed noisy speech graph signals are partitioned into a set of clusters. For each cluster, an undirected graph is learned to reveal the potential relationships among noisy speech frames in the cluster. In this way, multiple graphs of a small size other than a large-volume graph are learned, which reveal the inter-frame relationships of the total speech signals in a more dynamic way. Additionally, as the size of each cluster is much smaller than that of the whole speech signals, the K-graphs learning method leads to multiple graphs of small volumes with a much lower learning complexity.

On the basis of the learn multiple undirected graphs, we propose the gain function representation for the MMSE graph spectral magnitude estimator by extending the classical MMSE-STSA to enhance noisy speech signals. The contributions of the paper are summarized as follows.

i) We propose the novel undirected multiple graphs by using the K-graphs learning method, which reveals potential relationships among noisy speech frames in real-time. On this basis, we construct a joint graph weight matrix and define the related graph Fourier basis.

ii) Based on the constructed graph Fourier basis, we investigate the gain function representation of MMSE graph magnitude spectral estimator for speech graph signals (SGSs). We propose an MMSE graph spectral magnitude estimator by extending the classical MMSE-STSA method in DSP to perform speech enhancement.

iii) Our numerical results show that the proposed method outperforms the benchmarks in terms of the output signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), and short-time objective intelligibility (STOI). The numerical results also demonstrate the validity of the learned multiple graphs for speech signals. We use the classical MMSE-STSA [25] and OMLSA [26] methods in DSP and existing graph Wiener filtering methods [33] as benchmarks.

The remainder of the paper is organized as follows. Section 2 introduces the related work of GSP. Section 3 investigates the K-graphs learning method for speech signals. The details of the MMSE graph spectral magnitude estimator method are given in Section 4. Section 5 provides our experimental results. Section 6 concludes the paper.

Notation: The superscript $^T$ represents the transpose. ${\textrm{trace}( \cdot )}$, ${\left\| \cdot \right\| _F}$ and ${\left\| \cdot \right\| _1}$ represent the trace, F-norm, 1-norm, respectively, and $\textbf{1}$ denotes the all-ones matrix. $\times$ represents the Cartesian product. $\textbf{I}$ is the identity matrix. $E( \cdot )$ and $p( \cdot )$ represent the expectation operator and the probability density function, respectively.

2 Related work

2.1 Basics of GSP

Let $G = \left( {\mathcal {V},\mathcal {E},\textbf{A}} \right)$ represent a weighted graph, where $\mathcal {V}$ denotes the collection of vertices ${v_1}, \cdots ,{v_N}$, the set of edges $\mathcal {E}$ satisfies $\left( {m,n} \right) \in \mathcal {E}$ if and only if the vertex ${v_m}$ is connected to the vertex ${v_n}$ [34], and $\textbf{A}$ denotes the graph weight matrix. The element in $\textbf{A}$ represents the weight of the edge between vertices ${v_m}$ and ${v_n}$, which intuitively and numerically shows the appropriate dependence or similarity between signals on vertices [35, 36]. This paper focuses on a connected undirected graph with finite vertices. We can define the variation operator by the combinatorial graph Laplacian matrix $\textbf{L}= \textbf{D}-\textbf{A}$, where $\textbf{D}$ is a diagonal matrix with elements ${d_{mm}} = \sum \nolimits _{n = 1}^N {{{\textrm{A}}_{mn}}},(m = 1,\cdots N)$ [37].

Following [38, 39], we have the smoothness of graph signals $\textbf{X}$ given by

$$\begin{aligned} (\textbf{XL})({v_m})=\sum \nolimits _{\left\{ {{v_m}} \right\} \in \mathcal {V}}{{{\textrm{A}}_{m,n}}(x(m) - x(n))}. \end{aligned}$$

(1)

A small smoothness value means that signals indexed by adjacent vertices have similar values, that is, $\textbf{X}$ is smooth [40]. From (1), we have

$$\begin{aligned} {\textbf{X}^T}\textbf{LX}= & {} \sum \limits _{\left\{ {{v_m}}\right\} \in \mathcal {V}}{x(m)}\sum \limits _{\left\{ {{v_n}}\right\} \in \mathcal {V}}{{{\textrm{A}}_{m,n}}(x(m)- x(n))} \nonumber \\= & {} \sum \limits _{\left\{ {{v_m}}\right\} \in \mathcal {V}}{\sum \limits _{\left\{ {{v_n}}\right\} \in \mathcal {V}}{{{\textrm{A}}_{m,n}}({x^2}(m)-x(m)x(n))}} \\= & {} \frac{1}{2}{\sum \limits _{\left\{ {{v_m}}\right\} \in \mathcal {V}}{\sum \limits _{\left\{ {{v_n}}\right\} \in \mathcal {V }}{{{\textrm{A}}_{m,n}}(x(m)-x(n))}} ^2}.\nonumber \end{aligned}$$

(2)

2.2 Speech signals on graphs

After the framing operation, speech signals can be expressed as a matrix ${\textbf{s}}\in {\mathbb {R}^{M \times {N_s}}}$ where each row represents a frame speech signals, M represents the number of whole noisy speech frames, and $N_s$ denotes the length of a speech frame. Let us view each discretized speech sample as a vertex. Assuming that the relationship between the adjacent samples is symmetrical, speech samples can be constructed as speech graph signals (SGSs) residing on a connected undirected graph. The one-to-one mapping between the $i_{th}$ noisy speech sample $f_i$ in a frame and the signal value of the $i_{th}$ vertex ${v_i}$ is given by

$$\begin{aligned} {\textbf{s}_G}: \mathbb {R} \rightarrow {\mathcal {V}_s},{f_i} \rightarrow {v_i}, \end{aligned}$$

(3)

where ${\mathcal {V}_s}$ is the set of vertices cardinality $|{\mathcal {V}_s}|=N_s$, in this way, noisy speech signals in each frame are mapped into the graph domain.

2.3 The joint graph Fourier transform

Let we denote the time-vertex graph signal as ${\textbf{X}_E}{=}\left[ \textbf{x}_{1},\textbf{x}_{2},...,\textbf{x}_{\textrm{T}}\right] \in \mathbb {R}^{N \times \textrm{T}}$, where $\textbf{x}_1$, $\textbf{x}_2$, $\ldots$, $\textbf{x}_{\textrm{T}}$ represent graph signals sampled at $1,2,...\textrm{T}$ successive regular intervals with length N [41]. To investigate the spectral properties of $\textbf{X}_E$, following [42, 43], the joint time-vertex graph Fourier transform (JFT) is defined as

$$\begin{aligned} \widehat{\textbf{X}_{E}}= \textrm{JFT}\{ \textbf{X}_{E}\} = \mathbf {\Psi }_{G}\textbf{X}_{E}\mathbf {\Psi }_{T}, \end{aligned}$$

(4)

where $\mathbf {\Psi }_T\in \mathbb {R}^{\textrm{T}\times \textrm{T}}$ is constructed as the normalized discrete Fourier transform matrix, with $\Psi _{T}\left( {t,k} \right) = e^{- j}\frac{2\pi (k-1)}{\textrm{T}} / \sqrt{\textrm{T}}$ and $\mathbf {\Psi }_G \in {\mathbb {R}^{N\times N}}$ is the eigenvector matrix obtained by the eigendecomposition of $\textbf{X}_E$’s graph weight matrix. More specifically, $\mathbf {\Psi }_T$ is applied to analyze the time-frequency oscillations of $\textbf{X}_E$ along the time domain, $\mathbf {\Psi }_G$ allows us to obtain the graph-frequency characters of $\textbf{X}_E$ along the graph edges. Moreover, the corresponding inverse IJFT is defined as

$$\begin{aligned} \textrm{IJFT}\{ {\widehat{{\textbf{X}_E}}}\} = {\mathbf {\Psi }_G^{-1}}{\widehat{{\textbf{X}_E}}}\mathbf {\Psi }_{T}^{T}. \end{aligned}$$

(5)

The definitions of JFT and IJFT allow us to take into account the variation of the graph and the temporal aspects of time-vertex graph signals.

3 The K-graphs learning method for speech graph signals

This section uses the K-graphs learning method to infer a multiple-graphs model for SGSs without prior graph topologies. It is noted that the inter-frame relationship is not considered in traditional speech enhancement systems. Differing from noises, speech samples have a strong correlation between and within frames. In this paper, considering that when we learn a global graph for speech signals, it would be complex. Inspired by K-means and K-graphs learning [44], we partition the SGSs into clusters and use the K-graphs learning method to capture the potential properties of speech samples both the inter-frame and intra-frame.

Noisy speech signals are mapped into the graph domain by using Eq. (3) and are constructed as SGSs. We employ $M_s$ speech frames in a cluster to learn a graph for capturing the relationships between $M_s$ speech frames. The SGSs in the kth cluster, $\textbf{s}_G^k$ $\left( {1 \le k \le K}, K= M/M_s \right)$, resides on the kth undirected multiple graph ${G_k} = ({\mathcal {V}_k},{\varvec{\mathcal {L}}_k})$, where ${\mathcal {V}_k}$ indicates the vertex set and ${\varvec{\mathcal {L}}_k}$ is the graph Laplacian matrix of ${G_k}$, and M is the total number of speech frames. Let us now investigate the graph weighted matrix ${\varvec{\mathcal {L}}_k}$ of ${G_k}$, for the sake of revealing the intrinsic relationships among speech frames in real-time.

Following the K-graphs learning framework in [44], we formulate the multiple graphs learning problem of noisy speech graph signals as

$$\begin{aligned} & \min\limits_{\substack{\boldsymbol{\mathcal {L}}_{1}, \cdots \boldsymbol{\mathcal{L}}_{K} \\ \mathbf{s}_{G}^{1}, \cdots \mathbf{s}_{G}^{K} \in \mathbb{R}^{M_{s} \times N_{s}}}} \sum\limits_{k = 1}^{K} {\sum\limits_{\mathbf{s}_{G}^{k} \in \mathbf{s}_{G}^{K}} \textrm{trace}(\mathbf{s}_{G}^{k}{\boldsymbol{\mathcal{L}}_k}(\mathbf{s}_{G}^{k})^{T}) + \sum\limits_{k = 1}^{K} {\beta \|{\mathbf{1}} + {\boldsymbol{\mathcal{L}}_k} \|_F^2} + \sum\limits_{k = 1}^{K} \alpha \| {\boldsymbol{\mathcal{L}}_k}\|_1}, \nonumber\\ & \begin{array}{rl} s.t. \quad\textrm{diag}({\boldsymbol{\mathcal{L}}_k}) & \ge 0 ,\\ {}\textrm{trace}({\boldsymbol{\mathcal{L}}_k}) & = M_s. \\ {}{\boldsymbol{\mathcal{L}}_k} & = ({\boldsymbol{\mathcal{L}}_k})^T, \textrm{1} \le k \le K, \end{array} \end{aligned}$$

(6)

where $\textbf{s}_{G}^{k}\varvec{\mathcal {L}}_{k}(\textbf{s}_{G}^{k})^{T}$ describes the smoothness of SGSs supported on $G_k$, the Frobenius norm of $\textbf{1}+\varvec{\mathcal {L}}_{k}$ is used to control the distribution of the edge weights and the sparsity, and the third term is the regularization function and ensures its positive value [45]. The first and second constraints are used to ensure the symmetricity and non-negativity of $\varvec{\mathcal {L}}_{k}$. The third constraint is added to prevent trivial solutions and control the volume of the corresponding multiple graphs $G_k$. $M_s$ controls the volume of the intra-graph. $\alpha$ and $\beta$ are non-negative regularization parameters. As the objective function (6) is convex, we can solve it with the CVX toolbox [46] in the experimental section.

We infer an intra-graph topology to investigate the internal relationships between speech samples within a frame. We denote $\textbf{s}_g^i$ as the $i_{th}$ row of $\textbf{s}_G^k$, which is indexed by an intra-graph $O_i$. Here we focus on studying the strong causality between adjacent speech samples within a frame. Upon denoting the graph weight matrix of $O_i$ by $\textbf{W}_i \in {{\mathrm {\mathbb {R}}}^{{N_s} \times {N_s}}}$, we set ${{w}_i}(m,n) = 1$ if there exists a strong causality between the vertex $v_m$ and its adjacent vertex $v_n$, and otherwise ${{w}_i}(m,n) = 0$. That is,

$$\begin{aligned} {\textbf{W}_i} = {\left[ {\begin{array}{*{10}{c}} 0&{}1&{}0&{} \cdots &{} \cdots &{}0\\ 1&{}0&{}1&{} \cdots &{} \cdots &{}0\\ 0&{}1&{}0&{}1&{} \cdots &{}0\\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \ddots &{}0\\ 0 &{} \cdots &{} \cdots &{}1&{}0&{}1\\ 0&{} \cdots &{} \cdots &{} \cdots &{}1&{}0 \end{array}} \right] }. \end{aligned}$$

(7)

Then, we have ${O_i} = ({\mathcal {V}_i},{\textbf{W}_i})$, where $\mathcal {V}_i$ represents the vertex set with cardinality $\left| {{\mathcal {V}_i}} \right| ={N_s}$. Hence, $M_s$ speech frames ${\textbf {f}}$ can be constructed as the speech graph signal $\textbf{s}_G$ indexed by the multiple graphs $G_{S}=(\mathcal {V},\varvec{\mathcal {L}}^{*})$ as shown in Fig. 1.

$$\begin{aligned} \begin{array}{ccc} {p: \mathbf {{f}} \rightarrow \mathbf{{s}_G} \in {{\mathrm {\mathbb {R}}}^{{M_s} \times {N_s}}}}&{indexed by}&{{G_S}={G_k}\times {O_i}, 1 \le k \le K,1 \le i \le M_s.} \end{array} \end{aligned}$$

(8)

By applying the Cartesian product of the inter-graph Laplacian matrix $\varvec{\mathcal {L}}_{k}$ and the intra-graph weight matrix $\textbf{W}_i$, $\varvec{\mathcal {L}}^{*}$ is constructed as

$$\begin{aligned} \varvec{\mathcal {L}}^{*} = \varvec{\mathcal {L}}_{k}\times \textbf{W}_{i}, 1 \le k \le K,1 \le i \le M_s, \end{aligned}$$

(9)

and the vertex set $\mathcal {V}^{*}$ is given as

$$\begin{aligned} \mathcal {V}^* = {\mathcal {V}}_{k}\times \mathcal {V}_{i}. \end{aligned}$$

(10)

4 The MMSE graph spectral magnitude estimator method

This section proposes a minimum mean-square error (MMSE) graph spectral magnitude estimator based on the learned multiple graphs above. Specifically, we first define the corresponding joint graph Fourier transform (JGFT) and the inverse JGFT (IJGFT). Then we investigate the representation of the MMSE graph spectral magnitude estimator in GSP by extending the classical MMSE short-time spectral amplitude (STSA) estimator.

4.1 The joint graph Fourier transform for SGSs

Differ from the joint graph Fourier transform definition in Section 2, by applying the singular value decomposition (SVD) on $\textbf{W}_i$, we have

$$\begin{aligned} \textbf{W}_{i}= \textbf{F}_{w} \times \mathbf {\Lambda }^{w} \times \textbf{D}_{w}^{T}, \end{aligned}$$

(11)

where $\textbf{F}_{w}$ and $\textbf{D}_{w}$ are the left unitary matrix and the right unitary matrix of $\textbf{W}_{i}$, respectively, and $\mathbf {\Lambda }^{w} = \textrm{diag}(\lambda _{1}^{w},\lambda _{2}^{w},...,\lambda _{N_{s}}^{w})$ is the corresponding diagonal matrix and its element represents the graph frequency along the intra-graph edge. Similarly, we have

$$\begin{aligned} \varvec{\mathcal {L}}_{k}= \textbf{F}_{\mathcal {L}} \times \mathbf {\Lambda }^{\mathcal {L}} \times \textbf{D}_{\mathcal {L}}^{T}, \end{aligned}$$

(12)

where $\textbf{F}_{\mathcal {L}}$ and $\textbf{D}_{\mathcal {L}}$ respectively represent the left unitary matrix and the right unitary matrix of $\varvec{\mathcal {L}}_{k}$ and the element $\lambda _{j}^{\mathcal {L}}$ $\left( 1 \le j \le M_{s} \right)$ of $\mathbf {\Lambda }^{\mathcal {L}}$ represents the graph frequency along the inter-graph edges. The joint graph Fourier transform (JGFT) for $\textbf{s}_{G}^{k}$ can be defined as

$$\begin{aligned} \textbf{S}_{\mathcal {F}}^{k}= \textrm{JGFT}\{\textbf{s}_{G}^{k}\} = (\textbf{F}_{w})^{- 1}\textbf{s}_{G}^{k}(\textbf{F}_{\mathcal {L}})^{- 1}, \end{aligned}$$

(13)

where $\textbf{S}_{\mathcal {F}}^{k}$ is the graph Fourier version of $\textbf{s}_{G}^{k}$. Moreover, the inverse IJGFT of $\textbf{S}_{\mathcal {F}}^{k}$ is defined as

$$\begin{aligned} \textbf{s}_{G}^{k}= \textrm{IJGFT} \{\textbf{S}_{\mathcal {F}}^{k}\} = \textbf{F}_{w}\textbf{S}_{\mathcal {F}}^{k}\textbf{F}_{\mathcal {L}}. \end{aligned}$$

(14)

It should be noted that by using the defined JGFT, we can get the graph magnitude spectra of speech signals belonging to the real field by mapping speech signals into the graph frequency domain.

4.2 The MMSE graph spectral magnitude estimator for SGSs

Let us now investigate the MMSE graph spectral magnitude estimator. We denote $\textbf{s}_G^k = {\textbf{x}_G} + {\textbf{n}_G}$ where ${\textbf{x}_G}$ is clean SGSs, ${\textbf{n}_G}$ is the additive graph noise signal which is independent of ${\textbf{x}_G}$. By performing the defined JGFT in (13), we have

$$\begin{aligned} \textbf{S}_\mathcal {F}^k = {\textbf{X}_\mathcal {F}} + {\textbf{N}_\mathcal {F}} \in {{\mathbb {R}}} \end{aligned}$$

(15)

where $\textbf{S}_{\mathcal {F}}^k$, ${\textbf{X}_\mathcal {F}}$ and ${\textbf{N}_\mathcal {F}}$ are the JGFT coefficient of $\textbf{s}_G^k$, ${\textbf{x}_G}$ and ${\textbf{n}_G}$ respectively. Upon denoting the $i_{th}$ row of $\textbf{S}_\mathcal {F}^k$, ${\textbf{X}_\mathcal {F}}$, and ${\textbf{N}_\mathcal {F}}$ by $\textbf{Y}^i$, $\textbf{X}^i$, and $\textbf{N}^i$, respectively, a graph speech sample on a vertex can be donated as $\textrm{Y}_j^i = \textrm{X}_j^i + \textrm{N}_j^i$ where $i=1,2,3,...,K$, $j=0,1,2,...,{N_s}-1$.

Let us denote $R_j^i = \left| {\textrm{Y}_j^i} \right|$ and $\textrm{Z}_j^i = \left| {\textrm{X}_j^i} \right|$. Based on the work in [46], the MMSE estimator for the graph magnitude spectrum ${\textrm{X}_j^i}$ can be obtained as

$$\begin{aligned} \widehat{Z_j^i}= E(Z_j^i|Y_j^i)= \frac{{\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)(Z_j^i)dZ_j^i} }}{{\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}}}. \end{aligned}$$

(16)

In case of the Gaussian statistical model for spectral components, we have

$$\begin{aligned} p(Y_j^i|Z_j^i)= \frac{1}{{\pi ({\lambda _{\textrm{n}}})_j^i}}\textrm{exp} \left\{ {-\frac{1}{{({\lambda _{\textrm{n}}})_j^i}}|Y_j^i - Z_j^i{|^2}} \right\} , \end{aligned}$$

(17)

$$\begin{aligned} p\left( {Z_j^i} \right) = \frac{{Z_j^i}}{{\pi \left( {{\lambda _{\textrm{x}}}} \right) _j^i}}\textrm{exp} \left\{ {-\frac{{{{\left( {Z_j^i} \right) }^2}}}{{\left( {{\lambda _{\textrm{x}}}} \right) _j^i}}} \right\} , \end{aligned}$$

(18)

where $\;\left( {{\lambda _{\textrm{x}}}} \right) _j^i = E\left[ {{{\left( {X_j^i}\right) }^2}}\right]$ and $\;\left( {{\lambda _{\textrm{n}}}} \right) _j^i = E\left[ {{{\left( {N_j^i} \right) }^2}}\right]$ are the $j_{th}$ SGS and the graph noise variance for ${X}_j^i$ and ${N}_j^i$ respectively. By substituting (17) and (18) into $\int _0^\infty {Z_j^ip(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}$ and $\int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i}$, we have

$$\begin{aligned} \xi= & {} \int _0^\infty {Z_j^ip(Y_j^i|Z_j^i)p(Z_j^i)dZ_j^i} \nonumber \\= & {} \int _0^\infty {\frac{1}{{\pi ({\lambda _{\textrm{n}}})_j^i}}\textrm{exp} \left\{ {-\frac{1}{{({\lambda _{\textrm{n}}})_j^i}}|Y_j^i - Z_j^i{|^2}} \right\} \frac{{{{(Z_j^i)}^2}}}{{\pi \left( {{\lambda _{\textrm{x}}}} \right) _j^i}}\textrm{exp} \left\{ {-\frac{{{{\left( {Z_j^i}\right) }^2}}}{{\left( {{\lambda _{\textrm{x}}}} \right) _j^i}}} \right\} dZ_j^i}, \end{aligned}$$

(19)

and

$$\begin{aligned} \zeta= & {} \int _0^\infty {p(Y_j^i|Z_j^i)p(Z_j^i)dZ{{_j^i}}} \nonumber \\= & {} \int _0^\infty {\frac{1}{{\pi ({\lambda _{\textrm{n}}})_j^i}}\textrm{exp} \left\{ {- \frac{1}{{({\lambda _{\textrm{n}}})_j^i}}|Y_j^i - Z_j^i{|^2}} \right\} \frac{{Z_j^i}}{{\pi \left( {{\lambda _{\textrm{x}}}} \right) _j^i}}\textrm{exp} \left\{ { - \frac{{{{\left( {Z_j^i} \right) }^2}}}{{\left( {{\lambda _{\textrm{x}}}} \right) _j^i}}} \right\} dZ_j^i} . \end{aligned}$$

(20)

Then, we can rewrite (16) equivalently as

$$\begin{aligned} \widehat{Z_{j}^{i}} = \frac{\xi }{\zeta } = \frac{\int _{0}^{\infty }\textrm{exp} \left\{ -\frac{1}{(\lambda _{\textrm{n}})_{j}^{i}}|Y_{j}^{i}- Z_{j}^{i}|^{2}\right\} \textrm{exp} \left\{ - \frac{\left( Z_{j}^{i} \right) ^{2}}{(\lambda _{x})_{j}^{i}}\right\} (Z_{j}^{i})^{2}dZ_{j}^{i}}{\int _{0}^{\infty }\textrm{exp}\left\{ -\frac{1}{(\lambda _{\textrm{n}})_{j}^{i}}|Y_{j}^{i} - Z_{j}^{i}|^{2} \right\} \textrm{exp} \left\{ -\frac{\left( Z_{j}^{i}\right) ^{2}}{(\lambda _{\textrm{x}})_{j}^{i}}\right\} Z_{j}^{i}dZ_{j}^{i}}. \end{aligned}$$

(21)

And, we arrive at

$$\begin{aligned} \widehat{Z_j^i}= \frac{\xi }{\zeta }= \frac{{\int _0^\infty {\textrm{exp} \left\{ {-\frac{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}{{(Z_j^i - \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}})}^2}} \right\} \textrm{exp} \left\{ { - \frac{{{{(Z_j^i)}^2}}}{{({\lambda _x})_j^i}}} \right\} {{(Z_j^i)}^2}dZ_j^i} }}{{\int _0^\infty {\textrm{exp} \left\{ { - \frac{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}{{(Z_j^i - \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}})}^2}} \right\} \textrm{exp} \left\{ {-\frac{{{{(Z_j^i)}^2}}}{{({\lambda _x})_j^i}}} \right\} (Z_j^i)dZ_j^i} }}Z_j^i. \end{aligned}$$

(22)

Upon denoting ${\sigma ^2} = \frac{{({\lambda _{\textrm{n}}})_j^i+({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i({\lambda _x})_j^i}}$ and $\tau = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\textrm{n}}})_j^i + ({\lambda _x})_j^i}}$, we can rewrite (21) equivalently as

$$\begin{aligned} \widehat{Z_j^i}= \frac{\xi }{\zeta }= \frac{{\int _0^\infty {\textrm{exp} \left\{ {-{\sigma ^2}{{(Z_j^i- \tau )}^2}}\right\} {{(Z_j^i)}^2}dZ_j^i}}}{{\int _0^\infty {\textrm{exp} \left\{ {-{\sigma ^2}{{(Z_j^i- \tau )}^2}} \right\} (Z_j^i)dZ_j^i}}}. \end{aligned}$$

(23)

Let us now analyze $\xi$. Upon denoting $y = Z_j^i\sigma - \sigma \tau$, we have

$$\begin{aligned} \xi= & {} \int _0^\infty {\textrm{exp} \left\{ {-{\sigma ^2}{{(Z_j^i- \tau )}^2}} \right\} {{(Z_j^i)}^2}dZ_j^i} \nonumber \\= & {} \frac{1}{{{\sigma ^3}}}\int _0^{\infty } \textrm{exp} \left\{ { - {{(Z_j^i\sigma - \sigma \tau )}^2}}\right\} {{(Z_j^i\sigma )}^2}dZ_j^i\sigma ^{\underset{\rightarrow }{y = Z_{j}^{i}\sigma - \sigma \tau }} \\= & {} \frac{1}{\sigma ^{3}}\int _{-\sigma \tau }^0 {{y^2}\textrm{exp} \left\{ {-{y^2}} \right\} dy}+ \frac{1}{{{\sigma ^3}}}\int _0^\infty {{y^2}\textrm{exp} \left\{ {-{y^2}} \right\} dy} \nonumber \end{aligned}$$

(24)

Let us introduce the Gauss error function $erf(x) = \frac{2}{{\sqrt{\pi }}}\int _0^x {{e^{ - {\eta ^2}}}} d\eta$. Substituting erf(x) into (24) (see Appendix 1) gives

$$\begin{aligned} \xi = \frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} + \frac{1}{{{\sigma ^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2})-\frac{{1 + 2{\sigma ^2}{\tau ^2}}}{{2{\sigma ^3}}}\frac{{\sqrt{\pi }}}{2}erf(\sigma \tau ). \end{aligned}$$

(25)

Similarly, we have

$$\begin{aligned} \zeta = \frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ {-{\sigma ^2}{\tau ^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }+ \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }erf(\sigma \tau ). \end{aligned}$$

(26)

By combining (21), (25) and (26) (see Appendix 2), we arrive at

$$\begin{aligned} \widehat{Z_j^i} = \frac{{\frac{\tau }{{2{\sigma ^2}}}\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} + \frac{1}{{{\sigma ^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{\sigma ^2}{\tau ^2}) - \frac{{1 + 2{\sigma ^2}{\tau ^2}}}{{2{\sigma ^3}}}\frac{{\sqrt{\pi }}}{2}erf(\sigma \tau )}}{{\frac{1}{{2{\sigma ^2}}}\textrm{exp} \left\{ { - {\sigma ^2}{\tau ^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma } + \frac{{\sqrt{\pi }}}{2}\frac{\tau }{\sigma }erf(\sigma \tau )}}R_j^i. \end{aligned}$$

(27)

Let us introduce the notation of the prior signal-to-noise ratio $\varphi _j^i{\text { = }}\frac{{({\lambda _{\text {x}}})_j^i}}{{({\lambda _{\text {n}}})_j^i}}$ and the posterior signal-to-noise ratio $\delta _j^i{\text { = }}\frac{{({\lambda _{\text {n}}})_j^i}}{{({\lambda _{\text {x}}})_j^i}}$. Due to ${\sigma ^2} = \frac{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i({\lambda _x})_j^i}}$ and $\tau = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}$, we have

$$\begin{aligned} \gamma _j^i = \sqrt{\frac{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i({\lambda _x})_j^i}}} = \sqrt{\frac{{1 + \varphi _j^i}}{{\varphi _j^i {{(({\lambda _{\text {n}}})_j^i)}^2}}}}, \vartheta _j^i = \frac{{Y_j^i({\lambda _x})_j^i}}{{({\lambda _{\text {n}}})_j^i + ({\lambda _x})_j^i}}= \frac{{Y_j^i{{(\varphi _j^i)}^2}}}{{1 + {{(\varphi _j^i)}^2}}}. \end{aligned}$$

(28)

By combining (23), (27) and (28), we can obtain the gain function of the MMSE graph spectral magnitude estimator given as

$$\begin{aligned} G_{\text {GMMSE}}(\gamma _{j}^{i},\vartheta _{j}^{i})= & {} \frac{{\widehat{Z_j^i}}}{{R_j^i}} \nonumber \\= & {} \frac{{\frac{{\vartheta _j^i}}{{2{{\left( {\gamma _j^i} \right) }^2}}}\textrm{exp} \left\{ {-{{(\gamma _j^i\vartheta _j^i)}^2}} \right\} + \frac{1}{{{{\left( {\gamma _j^i} \right) }^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{{(\gamma _j^i\vartheta _j^i)}^2}) - \frac{{1 + 2{{(\gamma _j^i\vartheta _j^i)}^2}}}{{2{\gamma _i}^3}}\frac{{\sqrt{\pi }}}{2}erf(\gamma _j^i\vartheta _j^i)}}{{\frac{1}{{2{{\left( {\gamma _j^i} \right) }^2}}}\textrm{exp} \left\{ { - {{(\gamma _j^i\vartheta _j^i)}^2}} \right\} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}} + \frac{{\sqrt{\pi }}}{2}\frac{{{\vartheta _i}}}{{{\gamma _i}}}erf(\gamma _j^i\vartheta _j^i)}} \nonumber \\ {\text { }}\approx & {} \frac{{\frac{{C{\vartheta _i}}}{{2{{\left( {\gamma _j^i} \right) }^2}}} + \frac{1}{{{{\left( {\gamma _j^i} \right) }^3}}}(\frac{\pi }{4} + \frac{{\sqrt{\pi }}}{2}{{(\gamma _j^i\vartheta _j^i)}^2}) - \frac{{1 + 2{{(\gamma _j^i\vartheta _j^i)}^2}}}{{2{{\left( {\gamma _j^i} \right) }^3}}}\frac{{\sqrt{\pi }}}{2}erf(\gamma _j^i\vartheta _j^i)}}{{\frac{C}{{2{{\left( {\gamma _j^i} \right) }^2}}} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}} + \frac{{\sqrt{\pi }}}{2}\frac{{\vartheta _j^i}}{{\gamma _j^i}}erf(\gamma _j^i\vartheta _j^i)}}, \end{aligned}$$

(29)

where C is a constant. Following the classical decision directed approach in [25], we have $\varphi _j^i$ as

$$\begin{aligned} \varphi _j^i = (1-{\upsilon ^2})\varphi _{j - 1}^i +{\upsilon ^2}\max ((\delta _j^i - 1),0), \end{aligned}$$

(30)

where $\upsilon \in (0,1)$ is the gain parameter. For the sake of convenience, we donate GMMSE-KGL to name the proposed MMSE graph magnitude spectral estimator-based K-graphs learning in the following sections.

5 Numerical results and discussions

In this section, we present the output signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ) [47], log-likelihood ratio (LLR) [48], and short-time objective intelligibility (STOI) [49] measure results of the proposed MMSE graph spectral magnitude estimator. The traditional MMSE-STSA method in [25], the optimal modified minimum mean-square error log-spectral method (OMLSA) in [26], the improved graph Wiener filtering method (GWF-SGS) in [30], the vertex-frequency graph Wiener filtering (VFGWF) in [32], the graph Wiener filtering method for directed cyclic time series (GWF-DCGS), and the graph Wiener filtering for arbitrary graph signals (GWF-AGS) in [33] are used as the benchmarks. The output LLR is defined as

$$\begin{aligned} \textrm{LLR} = \text {log}\left( \overrightarrow{d_{p}} \textbf{R}_{\textbf{c}}\overrightarrow{d_{p}}^{T} / \overrightarrow{d_{c}} \textbf{R}_{\textbf{c}}\overrightarrow{d_{c}}^{T} \right) , \end{aligned}$$

(31)

where ${\overrightarrow{{d_p}}}$ and ${\overrightarrow{{d_c}}}$ represent the LPC vector of the enhanced speech frame and original speech signals, respectively. $\mathbf {R_c}$ represents the autocorrelation matrix of the original speech signals [50]. In our numerical simulations, the noisy speech signals are generated by mixing pure speech signals from the TIMIT database [51] with noise signals at the input signal-to-noise ratios (SNRs) from −15 to 5 dB. Two hundred sentences consisting of 20 speakers (10 females and 10 males) are used as the clean speech signal. White noise, Gaussian color noise, and Babble noise from NOISES-92 library [52] are used as noise signals. The sampling frequency is 16 kHz. The speech signals are framed by the Hamming window with a length of 256 points and an overlap of $50\%$.

Figure 2 shows the traditional spectrogram obtained by the discrete Fourier transform (DFT) and the graph spectrogram obtained by the proposed graph Fourier basis based on the graph weight matrix $\textbf{W}_i$ of the inter-graph, respectively. Observe from Fig. 2 that the graph spectrogram is mainly distributed in the high graph frequency regions, while the traditional spectrogram is mainly distributed in the low-frequency regions. The reason is that considering the Theorem 1 for the graph frequency ordering in [53], the smallest eigenvalue of $\textbf{W}_i$ represents the lowest frequency, and its largest eigenvalue is the highest frequency. These are different from the traditional frequencies. Although the graph spectrogram is similar to that of the conventional spectrogram, this graph spectrogram is utterly different from that of the traditional spectrum. In addition, the proposed graph Fourier basis can map speech signals into the real graph frequency field by applying the eigenvector matrix of $\textbf{W}_i$.

Figure 3 shows the output SNR of the proposed GMMSE-KGL method in the case of white noise versus the frame number $M_s$ where $N_s=256$. Observe from Fig. 3, to achieve a high output SNR, $M_s$ should be neither too small nor too large. When it takes a small value, the relationships among non-adjacent frames cannot be well described as the designed small multiple graphs. In contrast, when $M_s$ takes a large value, the boundaries between sub-multiple graphs might have a similar tendency, which would degrade the output SNR. Because the larger multiple graphs will lose some details of speech samples, resulting in estimating the inability to the graph spectral magnitude of speech samples well. Hence, the range of $M_s$ can be $M_s \in [20,30]$ in the case of $N_s=256$, and we use $M_s=30$ in our numerical simulations below.

Figure 4 shows the output SNR results of the proposed GMMSE-KGL method under white noise with the different vertex number $N_s$ of the intra-graph topology $O_i$ where $M_s=30$. We can see from Fig. 4 that the performance of the proposed GMMSE-KGL method decreases with the increase of $N_s$. The reason for this is that the intra-graph topology becomes more complex and larger as $N_s$ increases, resulting in the gain function of the proposed GMMSE-KGL method is not accurately estimated by using the SGS’s graph power spectrum on $O_i$. Moreover, the proposed GMMSE-KGL method in the case of $N_s=64$ would obtain a better performance. Considering the fairness of comparison, we use the frame with length 256 and build our intra-graph with $N_s=256$, that is, $O_i = (\mathcal {V}_{i},{\textbf {W}}_{i}^{256 \times 256})$.

Table 1 shows the PESQ results of the K-graphs learning (KGL) followed by the GMMSE method, the graph k-shift operator (GKSO) [30] followed by the GMMSE method and the graph learning (GL) [32] followed by the GMMSE method in the case of white noise. For writing convenience, we denote the methods above as the GMMSE-KGL, GMMSE-KGS, and GMMSE-GL, respectively. We can observe from Table 1 that the PESQ of the proposed GMMSE-KGL outperforms that of the GMMSE-GKSO and GMMSE-GL, which illustrates the effectiveness of the K-graphs learning part in speech enhancement. Moreover, the PESQ of the proposed GMMSE-KGL is 0.5 higher than that of GMMSE-GKSO and is 0.2 higher than that of GMMSE-GL, when the input SNR is larger than -5 dB.

Table 1 The PESQ of the different graph models followed by the GMMSE method

An MMSE graph spectral magnitude estimator for speech signals residing on an undirected multiple graph

Abstract

1 Introduction

2 Related work

2.1 Basics of GSP

2.2 Speech signals on graphs

2.3 The joint graph Fourier transform

3 The K-graphs learning method for speech graph signals

4 The MMSE graph spectral magnitude estimator method

4.1 The joint graph Fourier transform for SGSs

4.2 The MMSE graph spectral magnitude estimator for SGSs

5 Numerical results and discussions

6 Conclusions

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Appendixes

Appendixes

1.1 Appendix 1

1.2 Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords