A signal subspace approach to spatio-temporal prediction for multichannel speech enhancement

Borowicz, Adam

doi:10.1186/s13636-015-0051-z

Research
Open access
Published: 10 February 2015

A signal subspace approach to spatio-temporal prediction for multichannel speech enhancement

Adam Borowicz¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2015, Article number: 5 (2015) Cite this article

4874 Accesses
Metrics details

Abstract

The spatio-temporal-prediction (STP) method for multichannel speech enhancement has recently been proposed. This approach makes it theoretically possible to attenuate the residual noise without distorting speech. In addition, the STP method depends only on the second-order statistics and can be implemented using a simple linear filtering framework. Unfortunately, some numerical problems can arise when estimating the filter matrix in transients. In such a case, the speech correlation matrix is usually rank deficient, so that no solution exists. In this paper, we propose to implement the spatio-temporal-prediction method using a signal subspace approach. This allows for nullifying the noise subspace and processing only the noisy signal in the signal-plus-noise subspace. As a result, we are able to not only regularize the solution in transients but also to achieve higher attenuation of the residual noise. The experimental results also show that the signal subspace approach distorts speech less than the conventional method.

1 Introduction

Speech enhancement is important for many applications including mobile communications, speech coding, speech recognition, and hearing aids. The traditional objective of multichannel speech enhancement is to recover the source speech signal from the outputs of an array of microphones. It is usually achieved by using the beamforming techniques [1-3]. The key idea of beamforming is to process signals of a microphone array, so as to extract the sounds that come from only one direction. In this way, it is possible to dereverberate speech, but the background noise can be reducted as well by avoiding noise directions. Unfortunately, in order to work reasonably well in a reverberant environment, these techniques usually require knowing the impulse responses of the acoustic room or their relative ratios. These parameters can be fixed, provided the geometry of the microphone array is known, or estimated adaptively [4], which in general is a difficult task, however.

Recently, the objective of multichannel speech enhancement has been reformulated, so that noise reduction can be achieved without dereverberating speech. In opposition to the beamforming techniques, the knowledge about the geometry of the microphone array is not required, and the optimal filter depends only on the second-order statistics of the noisy signal.

In [5], the authors presented the most common techniques of multichannel noise reduction based on linear filtering. In such solutions, the noise-free speech is estimated by a linear transformation of the observation vector. The simplest approach is to minimize the mean square error (MSE) between the noise-free and filtered speech signals at a given microphone, which leads to a multichannel version of the classical Wiener filter. In this case, some noise is reduced at the cost of the increased speech distortion, but we cannot explicitly control the trade-off between these quantities.

Speech estimation can also be considered as a constrained optimization problem, where the speech distortions are minimized subject to the residual noise power. This approach is used by the single-channel methods [6] and was implemented in a similar way using a signal subspace technique in [5]. Unlike the frequency domain methods, which are based on the discrete Fourier transform (DFT), the signal subspace approach decomposes the vector space of noisy signals into the speech-plus-noise subspace and noise-only subspace using the Karhunen-Loeve transform (KLT). Then, spectral weighting is performed only in the signal-plus-noise subspace. The components projected onto the noise-only subspace are simply nullified, which results in significantly better performance when compared to the conventional DFT-based methods, where the full-band (and thus erroneous) spectrum must be processed. Unfortunately, also in this case, it is impossible to reduce the residual noise without introducing speech distortions. Several single-channel approaches [7-9] that exploit the masking effects are known to make the speech distortion or the residual noise inaudible, but introducing psychoacoustics into multichannel speech enhancement is a challenging task. On the other hand, some hearing properties have been introduced in a beamforming technique [10], but the resulting improvement is not as great as in the single-channel case.

It seems that the major limitation of all these methods is that they use only temporal prediction. In fact, spatial correlations are implicitly embedded in the second-order statistics, or inter-channel correlation matrices, but are not explicitly used. Therefore, in [11,12], the authors proposed a novel technique based on the spatio-temporal prediction (STP). A DFT-based implementation of this technique has also been proposed [13,14], but in this case, the algorithm has been restricted to use only spatial prediction. It has been verified experimentally that the STP approach outperforms the classical beamforming techniques in terms of noise reduction [11]. In [5], it was proved analytically that by using the STP method, it is theoretically possible to reduce the residual noise without distorting the speech. However, a major drawback of the STP method is its numerical instability, as this approach assumes that speech correlation matrix is of full rank. Because this is not true for low power speech at transients, the solution must be regularized empirically in practice. Alternatively, under the uncertainty about the speech presence, the conditional estimators can be used [15]. Even if the speech correlation matrix is of full rank, the STP method requires many microphones to effectively reduce the residual noise.

In this paper, we propose a signal-subspace implementation of the STP method. By decomposing the signal vector space, we are able to limit processing to the signal-plus-noise subspace only. Thus, the numerical problems can be evaded in a more natural way. Since the noisy speech projected on the noise-only subspace can simply be nullified, the signal subspace approach allows for attenuating noise more, even for a small number of microphones. In addition, we have rederived the STP method using a notation slightly different from that in [5], in order to expose the possibility of denoising all microphone signals at once.

2 Signal model and linear filtering

Let us consider an array of N microphones with arbitrary geometry and a single speech source s(k) located inside a reverberant enclosure. The observation signal at the nth microphone is given by:

$$ y_{n}(k) = a_{n}(k) * s(k) + v_{n}(k) = x_{n}(k) + v_{n}(k), $$

((1))

where ∗ denotes convolution, a _n is the acoustic impulse response from the source to the nth microphone, and x _n(k) and v _n(k) are, respectively, the noise-free speech and the noise components received by the nth microphone. Such a mixing model is illustrated in Figure 1.

Usually data are processed in L-sample blocks. Thus, the signals can be represented using the vector-matrix notation as follows:

$$ \textbf{y}_{n}(k) = \left[y_{n}(k)\; y_{n}(k-1)\; \ldots\; y_{n}(k-L+1)\right]^{T}. $$

((2))

The estimate of the noise-free speech at the nth microphone can be obtained using a linear transformation of the observation vector:

$$ \hat{\mathbf{x}}_{n}(k) = \mathbf{H}_{n} \mathbf{y}(k) = \mathbf{H}_{n} \left[\mathbf{x}(k)+\mathbf{v}(k)\right], $$

((3))

where:

$$ \begin{aligned} \mathbf{y}(k) &= \left[\mathbf{y}_{1}^{T}(k)\; \mathbf{y}_{2}^{T}(k)\; \ldots\; \mathbf{y}_{N}^{T}(k) \right]^{T}, \\ \mathbf{x}(k) &= \left[\mathbf{x}_{1}^{T}(k)\; \mathbf{x}_{2}^{T}(k)\; \ldots\; \mathbf{x}_{N}^{T}(k) \right]^{T},\\ \mathbf{v}(k) &= \left[\mathbf{v}_{1}^{T}(k)\; \mathbf{v}_{2}^{T}(k)\; \ldots\; \mathbf{v}_{N}^{T}(k) \right]^{T}. \\ \end{aligned} $$

((4))

The vectors x _n(k) and v _n(k) denote the noise-free speech and the noise, respectively, and are defined similarly to Equation 2. H _n is a filtering matrix of size L×L N. The estimation error is defined by:

$$ \begin{aligned} \mathbf{e}(k) &= \hat{\mathbf{x}}_{n}(k) - \mathbf{x}_{n}(k)\\ & = \underbrace{(\mathbf{H}_{n} - \mathbf{U}_{n})\mathbf{x}(k)}_{\mathbf{e}_{x}(k)} + \underbrace{\mathbf{H}_{n} \mathbf{v}(k)}_{\mathbf{e}_{v}(k)}, \\ \end{aligned} $$

((5))

where:

$$ \mathbf{U}_{n} = \left[\mathbf{0}_{L \times (n-1)L} \; \mathbf{I}_{L} \; \mathbf{0}_{L \times (N-n)L} \right], $$

((6))

is a selection matrix of size L×L N. The terms e _x(k) and e _v(k) denote the speech distortion and the residual noise, respectively.

For completeness, we also define the correlation matrix of an arbitrary vector a as:

$$ \mathbf{R_{aa}} (k) = E \left\{ \mathbf{a}(k) \mathbf{a}^{T}(k) \right\}, $$

((7))

where E{.} is the expectation operator. Assuming that the speech and noise are short-term stationary and uncorrelated processes, the correlation matrix of the noisy speech can be written as:

$$ \mathbf{R_{yy}}(k) = \mathbf{R_{xx}}(k) + \mathbf{R_{vv}}(k). $$

((8))

Unless otherwise stated, all equations hold for any arbitrarily chosen point in time. Therefore, for the sake of brevity, the time index k is often omitted in the rest of this paper.

3 Spatio-temporal prediction

The STP method is based on the assumption that the microphone signals can be predicted not only in the time domain but also in the space domain [11]. In particular, the signal x _m(k) can be predicted from the signal x _n(k) using a linear filter matrix W _n,m such that:

$$ \mathbf{x}_{m}(k) = \mathbf{W}_{n,m}^{T} \mathbf{x}_{n}(k), \quad m=1, 2, 3, \ldots, N, $$

((9))

with W _n,n=I _L. The prediction matrices can be concatenated so as to form the L×N L matrix:

$$ \mathbf{W}_{n} = \left[\mathbf{W}_{n,1}\; \mathbf{W}_{n,2}\; \ldots\; \mathbf{W}_{n,N} \right], $$

((10))

and:

$$ \mathbf{x}(k) = \mathbf{W}_{n}^{T} \mathbf{x}_{n}(k). $$

((11))

By substituting Equation 11 into Equation 5 and assuming that $\mathbf {H}_{n} \mathbf {W}_{n}^{T} = \mathbf {I}_{L}$, we can deduce that the residual noise can be minimized without distorting speech. Thus, the constrained optimization problem is formulated as follows:

$$ \mathop{\min}\limits_{\mathbf{H}_{n}}\text{tr}\left\{E\left[\mathbf{e}_{v}(k)\mathbf{e}_{v}^{T}(k)\right]\right\} \textrm{subject to } \mathbf{H}_{n} \mathbf{W}_{n}^{T} = \mathbf{I}_{L}. $$

((12))

The optimal filter matrix is found using the Lagrange multipliers method:

$$ \mathbf{H}_{n} = \left(\mathbf{W}_{n} \mathbf{R_{vv}}^{-1}\mathbf{W}_{n}^{T}\right)^{-1}\mathbf{W}_{n} \mathbf{R_{vv}}^{-1}. $$

((13))

A solution exists if and only if R _vv is positive definite, and the matrix W _n is of rank L. As noise signals are usually stationary and have smooth spectra, R _vv has full rank and can be estimated using long-term averaging during speech pauses.

Unfortunately the prediction matrices W _n,m for m≠n are not known and have to be estimated. They can be found by solving the following minimization problem:

$$ \mathop{\min}\limits_{\mathbf{W}_{n,m}} E\left\{ \left[\mathbf{x}_{m}(k) - \mathbf{W}_{n,m}^{T} \mathbf{x}_{n}(k)\right]^{T} \left[\mathbf{x}_{m}(k) - \mathbf{W}_{n,m}^{T} \mathbf{x}_{n}(k)\right] \right\}. $$

((14))

whose solution is given by:

$$ \mathbf{W}_{n,m}^{T} = \mathbf{R}_{\mathbf{x}_{m} \mathbf{x}_{n}} \mathbf{R}_{\mathbf{x}_{n}\mathbf{x}_{n}}^{-1} $$

((15))

where $\mathbf {R}_{\mathbf {a}_{i} \mathbf {a}_{j}}\phantom {\dot {i}\!}$ stands for the (i,j)th L×L submatrix of the matrix R _{a
a}. The correlation matrices of the clean speech are unknown, and the vectors x _n(k) cannot be observed directly, but by using Equation 8 we can write:

$$ \mathbf{R}_{\mathbf{x}_{n} \mathbf{x}_{m}} = \mathbf{R}_{\mathbf{y}_{n} \mathbf{y}_{m}} - \mathbf{R}_{\mathbf{v}_{n} \mathbf{v}_{m}}, \quad m = 1, 2, \ldots, N. $$

((16))

Thus, finally, we obtain the following expression for the prediction matrices:

$$ \mathbf{W}_{n,m}^{T} = \left(\mathbf{R}_{\mathbf{y}_{m} \mathbf{y}_{n}} - \mathbf{R}_{\mathbf{v}_{m} \mathbf{v}_{n}}\right) \left(\mathbf{R}_{\mathbf{y}_{n} \mathbf{y}_{n}} - \mathbf{R}_{\mathbf{v}_{n} \mathbf{v}_{n}}\right)^{-1}. $$

((17))

In order to obtain a full rank matrix W _n,m, the matrices $\mathbf {R}_{\mathbf {x}_{m} \mathbf {x}_{n}}\phantom {\dot {i}\!}$ and $\mathbf {R}_{\mathbf {x}_{n}\mathbf {x}_{n}}\phantom {\dot {i}\!}$ have to be positive definite. In [5], the authors suggest to estimate the filter matrix (Equation 13) only when the speech source is active, using a voice activity detector (VAD), but this generally does not prevent the matrix W _n,m from being rank deficient. Moreover, such a technique can introduce discontinuity effects at transients or/and increased residual noise during silence intervals. For low-power speech signals, the covariance matrix of the clean speech is usually positive semi-definite, or at least ill-conditioned, which means that in practice the STP method is numerically stable only for high signal-to-noise ratios (SNRs). The simplest solution is to add some white noise to the speech signal, so that the inverses in Equation 13 and Equation 17 can be replaced with pseudoinverses and properly regularized [16]. However, all these approaches are rather empirical and need a careful adjustment. Thus, we need a more robust solution, which can be applied also to low power speech signals, especially at low SNRs.

4 Signal subspace approach

In the conventional STP method, data are processed in the vector space of the noisy speech. The key idea of the signal subspace approach is to decompose that vector space into the signal-plus-noise and noise-only subspaces and to process data only in the signal-plus-noise subspace, while the projection of the noisy signal onto the noise-only subspace is simply nullified. The dimensionality of the signal-plus-noise or, simply, signal subspace is closely related to the rank of the speech correlation matrix. Thus, by introducing the signal subspace approach to the STP method, we are able to not only increase the attenuation of the residual noise during silence intervals but also to avoid the ill-conditioning issues.

Let us rewrite Equation 13 more compactly. Please notice that the prediction matrix can be alternatively written as:

$$ \mathbf{W}_{n} = \mathbf{R}_{\mathbf{x}_{n} \mathbf{x}_{n}}^{-1} \mathbf{U}_{n} \mathbf{R_{\mathbf{xx}}}, $$

((18))

and then, by substituting the above into Equation 13, we obtain:

$$ \mathbf{H}_{n} = \mathbf{R}_{\mathbf{x}_{n} \mathbf{x}_{n}} \left(\mathbf{U}_{n} \mathbf{R_{xx}}\mathbf{R}_{vv}^{-1} \mathbf{R_{xx}} \mathbf{U}_{n}^{T}\right)^{-1} \mathbf{U}_{n} \mathbf{R_{xx}}\mathbf{R}_{vv}^{-1}. $$

((19))

Since R _vv is positive definite, the matrices R _xx and R _vv can be jointly diagonalized [17,18], i.e.:

$$ \mathbf{R}_{\mathbf{vv}}^{-1/2} \mathbf{R_{xx}} \mathbf{R_{vv}}^{-1/2} = \mathbf{V} \Lambda \mathbf{V}^{T}, $$

((20))

where V denotes the orthogonal matrix of the eigenvectors, and Λ=diag{λ ₁,…,λ _NL} is the diagonal matrix of the corresponding eigenvalues. We also assume that the eigenvalues in Λ are arranged in descending order, i.e. λ _i≥λ _j for any i<j. The matrix V can also be interpreted as the KLT matrix of the whitened clean speech. Alternatively, it can be obtained using the eigendecomposition of the whitened noisy speech correlation matrix:

$$ \mathbf{R}_{\mathbf{vv}}^{-1/2} \mathbf{R_{yy}} \mathbf{R_{vv}}^{-1/2} = \mathbf{V} \left(\Lambda + \mathbf{I}\right) \mathbf{V}^{T}. $$

((21))

As shown in [17], the vector space of the noisy speech can be decomposed using the square matrix:

$$ \mathbf{B} = \mathbf{V}^{T} \mathbf{R}_{\mathbf{vv}}^{1/2} $$

((22))

which has full rank but is not necessarily orthogonal. Please notice that applying B ^−T to the noisy signal is equivalent to whitening data before performing the subspace decomposition, so that the resulting coefficients are perfectly decorrelated in the transform domain, i.e.:

$$ E \left[{\tilde{\mathbf{y}}}(k){\tilde{\mathbf{y}}}^{T}(k)\right] = \Lambda + \mathbf{I}, $$

((23))

where ${\tilde {\mathbf {y}}}(k) = \mathbf {B}^{-T}\mathbf {y}(k)$. Thus, our correlation matrices can be expressed as follows:

$$ \begin{aligned} \mathbf{R}_{\mathbf{yy}} &= \mathbf{B}^{T} (\Lambda + \mathbf{I}) \mathbf{B}\\ \mathbf{R}_{\mathbf{xx}} &= \mathbf{B}^{T} \Lambda \mathbf{B}\\ \mathbf{R}_{\mathbf{vv}} &= \mathbf{B}^{T} \mathbf{B}\\ \end{aligned} $$

((24))

Let $\mathbf {Q}_{n} = \Lambda \mathbf {B} \mathbf {U}_{n}^{T}$. Substituting the relations given in Equation 24 into Equation 19 results in the optimal filter matrix:

$$ \mathbf{H}_{n} = \mathbf{U}_{n} \mathbf{B}^{T} \left[\mathbf{Q}_{n} \left(\mathbf{Q}_{n}^{T} \mathbf{Q}_{n}\right)^{-1} \mathbf{Q}_{n}^{T} \right] \mathbf{B}^{-T}. $$

((25))

Since R _vv is positive definite, and R _xx can be semi-positive definite, the dimension of the signal-plus-noise subspace is equal to the number of non-zero eigenvalues of the correlation matrix of the whitened clean speech. Assume that N L=L _s+L _v, where L _s and L _v denote the dimensions of the signal-plus-noise and noise-only subspaces, respectively. Thus, for L _s<N L, we can rewrite Equation 25 as follows:

$$ \mathbf{H}_{n} = \mathbf{U}_{n} \mathbf{B}^{T} {\left[ \begin{array}{ccc} \Sigma_{n} && \mathbf{0}_{L_{s} \times L_{v}}\\ \mathbf{0}_{L_{v} \times L_{s}} && \mathbf{0}_{L_{v} \times L_{v}}\\ \end{array} \right]} \mathbf{B}^{-T}, $$

((26))

where:

$$ \Sigma_{n} = \mathbf{Q}_{n,1:L_{s}} {\left[ \mathbf{Q}_{n,1:L_{s}}^{T} \mathbf{Q}_{n,1:L_{s}} \right]}^{-1} \mathbf{Q}_{n,1:L_{s}}^{T} $$

((27))

can be viewed as a reweighting matrix, with $\mathbf {Q}_{n,1:L_{s}}$ denoting sub-matrix of Q _n consisting rows from 1 to L _s. As can be seen the noisy signal is transformed using a non-orthogonal matrix B ^−T. The denoising is achieved by ‘reweighting’ the coefficients in the signal-plus-noise subspace using the matrix Σ _n and simply nullifying the noise-only subspace. In opposition to the conventional signal subspace approach, the reweighting matrix is not diagonal here but symmetric and idempotent.

Finally, the filtered signal is brought back to the time domain using the inverse transform B ^T.

In practice, L _s can be estimated as the number of the strictly positive eigenvalues, according to the following rule:

$$ L_{s} \approx \mathop{\arg \max}\limits_{1 \ge l \ge NL } \: \left\{ \lambda_{l} > \theta \right\}, $$

((28))

where the threshold θ is a some small positive number.

It can be noticed that $\mathbf {Q}_{n}^{T} \mathbf {Q}_{n}$ is invertible as long as L _s≥L. However, even when this condition is not in force (which is fairly common at transients or during silence intervals), the inverse can be easily regularized. For example, if L _s=L, Q _n,1:L is a square matrix, and Σ _n=I, which means that the filter performs nullifying the noise subspace without cleaning the signal-plus-noise subspace, or that the residual noise can be effectively reduced without distorting the speech.

Therefore, in order to regularize the solution, the best we can do is to use the following rule:

$$ \Sigma_{n} = \left\{ \begin{array}{lll} \mathbf{Q}_{n,1:L_{s}} {\left[ \mathbf{Q}_{n,1:L_{s}}^{T} \mathbf{Q}_{n,1:L_{s}} \right]}^{-1} \mathbf{Q}_{n,1:L_{s}}^{T}, && L_{s} > L\\ \mathbf{I}_{L_{s}}, && \text{otherwise.}\\ \end{array} \right. $$

((29))

Please also notice that if N=1 and L _s=L, then the filter matrix is simply the identity matrix. For N>1, it is possible to arrange matrices H _n, n=1,2,…,N into the single filter matrix:

$$ \mathbf{H}_{\mathrm{P}} = \left[\mathbf{H}_{1}^{T} \; \mathbf{H}_{2}^{T} \; \ldots\; \mathbf{H}_{N}^{T}\right], $$

((30))

which can be used to estimate all noise-free microphone signals at once. Namely, the vector x(k) can be estimated as follows:

$$ \mathbf{x}(k) \approx \hat{\mathbf{x}}(k) = \mathbf{H}_{\mathrm{P}} \mathbf{y}(k). $$

((31))

The filter matrix H _P can also be written in a more convenient form:

$$ \mathbf{H}_{\mathrm{P}} = \left[\mathbf{U} \circ \left(\mathbf{B}^{T} \Lambda \mathbf{B}\right)\right] \left[\mathbf{U} \circ \left(\mathbf{B}^{T} \Lambda^{2} \mathbf{B}\right)\right]^{-1} \mathbf{B} \Lambda \mathbf{B}^{-T}, $$

((32))

where:

$$ \mathbf{U} = \mathbf{I}_{N} \otimes \mathbf{J}_{L \times L}, $$

((33))

and the operators ∘ and ⊗ stand for the Hadamard and the Kronecker products, respectively, and J _L×L is the L×L matrix of ones.

The proposed approach can be verified analytically in terms of noise reduction and speech distortion. The noise reduction factor can be defined for any filter matrix H _n as follows:

$$ {\xi_{\text{nr}}}\left(\mathbf{H}_{n}\right) = \frac{\text{tr}\left\{ E \left[ \mathbf{U}_{n} \mathbf{v} \mathbf{v}^{T} \mathbf{U}_{n}^{T} \right]\right\}}{\text{tr}\left\{ E \left[ \mathbf{H}_{n} \mathbf{v} \mathbf{v}^{T} \mathbf{H}_{n}^{T} \right]\right\}} = \frac{\text{tr}\left\{ \mathbf{U}_{n} \mathbf{R_{vv}} \mathbf{U}_{n}^{T} \right\}}{\text{tr}\left\{ \mathbf{H}_{n} \mathbf{R_{vv}} \mathbf{H}_{n}^{T} \right\}}. $$

((34))

It is expected that ξ _nr(k)≥1: the larger this factor, the lower residual noise. Usually, the noise is reduced at the cost of attenuating speech. Therefore, in order to quantify this attenuation, we define the speech reduction factor:

$$ {\xi_{\text{sr}}}\left(\mathbf{H}_{n}\right) = \frac{\text{tr}\left\{ E \left[ \mathbf{U}_{n} \mathbf{x} \mathbf{x}^{T} \mathbf{U}_{n}^{T} \right]\right\}}{\text{tr}\left\{ E \left[ \mathbf{H}_{n} \mathbf{x} \mathbf{x}^{T} \mathbf{H}_{n}^{T} \right]\right\}} = \frac{\text{tr}\left\{ \mathbf{U}_{n} \mathbf{R_{xx}} \mathbf{U}_{n}^{T} \right\}}{\text{tr}\left\{ \mathbf{H}_{n} \mathbf{R_{xx}} \mathbf{H}_{n}^{T} \right\}} $$

((35))

and expect ξ _sr(H _n)≥1. The output SNR of the filter H can be expressed in the following way:

$$ \text{SNR}(\mathbf{H}) = \frac{\text{tr}\left\{ \mathbf{H}_{n} \mathbf{R_{xx}} \mathbf{H}_{n}^{T} \right\}}{\text{tr}\left\{ \mathbf{H}_{n} \mathbf{R_{vv}} \mathbf{H}_{n}^{T} \right\}} = \text{SNR} \frac{{\xi_{\text{nr}}}(\mathbf{H})}{{\xi_{\text{sr}}}(\mathbf{H})}, $$

((36))

where the SNR stands for the input SNR.

For L _s≥L, the proposed approach is theoretically equivalent to the time-domain implementation of the STP method. In order to analyse performance of the proposed implementation for L _s<L, we consider the case of the white noise, for which $\mathbf {R}_{\mathbf {v}_{n}\mathbf {v}_{n}} = \sigma _{\mathbf {v}_{n}} \mathbf {I}\phantom {\dot {i}\!}$. Because the inverse $\left (\mathbf {Q}_{n}^{T} \mathbf {Q}_{n}\right)^{-1}\phantom {\dot {i}\!}$ does not exist for L _s<L, we use Equation 29. Then, by replacing Σ _n in Equation 26 with the identity matrix and by substituting it to Equation 34 and Equation 35, we obtain:

$$ {\xi_{\text{nr}}}(\mathbf{H}_{n}) = \frac{L}{\sum\limits_{i=1}^{L} \sum\limits_{j=1}^{L_{s}} \mathbf{V}_{(n-1)L+i,j}^{2}} > 1 $$

((37))

$$ {\xi_{\text{sr}}}(\mathbf{H}_{n}) = 1. $$

((38))

Since ξ _nr(H _n)>ξ _sr(H _n), we always have SNR(H)>SNR, or an improvement of the SNR.

5 Simulations

Although a full evaluation of the proposed approach, including listening tests, is out of the scope of this article, we have conducted some experiments using objective measurements. In this section, we compare the performances of the conventional time-domain implementation of the STP method and of the proposed approach based on the signal subspace.

5.1 Implementation

Both methods have been implemented in MATLAB. Instead of recalculating the filter from sample to sample, we collect the microphone recordings in overlapped buffers and process them frame-by-frame in a similar way as in [8] or [19]. Namely, we divide the microphone signals into frames of length N _f with 50% overlap. Each frame is partitioned into M=N _f−L+1 shorter overlapping L-dimensional vectors. The sequence of these vectors is arranged into the trajectory matrix of size L-by-M. The trajectory matrices for all microphones are concatenated together so as to form the noisy speech matrix Y(k) of size LN-by-M so that:

$$ \mathbf{Y}(k) = \left[ {\begin{array}{*{20}c} \mathbf{y}(k) & \mathbf{y}(k-1) & \cdots & \mathbf{y}(k-M+1) \\ \end{array}} \right]. $$

((39))

As all required parameters are estimated, the effective filter matrix H _n is computed, and then all in-frame vectors are processed using the same matrix, i.e. $\hat {\mathbf {Y}}(k) = \mathbf {H}_{n} \mathbf {Y}(k)$. The enhanced vectors are obtained from the matrix $\hat {\mathbf {Y}}(k)$ using the diagonal averaging technique [19]. Finally, the frames are multiplied by the Hanning window and synthesized using the overlap-add method.

The correlation matrix of the noisy speech can be estimated according to:

$$ \mathbf{R}_{\mathbf{yy}}(k) \approx \frac{1}{MN} \mathbf{Y}(k) \mathbf{Y}(k)^{T}, $$

((40))

being the outer product of the matrix Y(k). This estimate is the basis for computing both noise statistics and the KLT of the whitened signal (Equation 20). The matrix R _vv is estimated only during speech pauses as:

$$ \mathbf{R}_{\mathbf{vv}}(k) \approx \left\{ \begin{array}{lll} \alpha \mathbf{R_{vv}}(k-1) + (1-\alpha) \mathbf{R}_{\mathbf{yy}}(k), && \text{if} \quad I(k)=1\\ \mathbf{R_{vv}}(k-1), && \text{otherwise}\\ \end{array} \right. $$

((41))

where 0<α<1 is the forgetting factor, and I(k) is the VAD output of the kth frame. In our simulations, the VAD was not implemented, and the speech pause/activity regions were marked manually.

In most cases, the noise correlation matrix is positive definite, so that the computations of both whitening and unwhitening transformations ($\mathbf {R}_{\mathbf {vv}}^{-1/2}, \mathbf {R}_{\mathbf {vv}}^{1/2}$, respectively) should be numerically stable. The transformations can be calculated at once using the eigenstructures of the matrix R _{v
v}=V _v Λ _v V _v ^T in the following way:

$$ \begin{array}{lll} \mathbf{R}_{\mathbf{vv}}^{-1/2} &=& \mathbf{V_{v}} \Lambda_{\mathbf{v}}^{-1/2} \mathbf{V_{v}}^{T},\\ \mathbf{R}_{\mathbf{vv}}^{1/2} &=& \mathbf{V_{v}} \Lambda_{\mathbf{v}}^{1/2} \mathbf{V_{v}}^{T},\\ \end{array} $$

((42))

where V _v denotes the orthogonal matrix of the eigenvectors, and Λ _v is the diagonal matrix of the corresponding eigenvalues.

In our experiments, we take α=0.75, N _f=400, and L=20. A proper choice of the value of the parameter θ seems to be crucial for the proposed implementation. In general, greater values of θ lead to cancellation of the residual noise, but a special care must be taken because low-power speech components can be also nullified. Therefore, the simplest solution is to fix this threshold, so that it is large enough to give L _s=0 (or equivalently θ≫λ ₁) during speech pauses. We found empirically that its value depends mainly on the bias of the estimator of the noise correlation matrix, i.e. on the forgetting factor α and the frame/window size N _f. In Figure 2c, we present the variability of the estimated dimension of the signal-plus-noise subspace for the parameter θ=3. Further experiments show that the optimal value of the parameter θ (in terms of speech distortion) does not depend on the input SNR. It can be observed that L _s<L occurs fairly commonly, not only at transients, but also during speech activity.

In the case of the conventional implementation, all inverses in Equation 17 and Equation 13 were replaced with pseudoinverses. They were computed using singular value decomposition (SVD), and all singular values less than some tolerance were treated as zeros. In fact, that tolerance plays the same role as the parameter θ in the signal subspace approach. Thus, by setting it sufficiently large, it is possible to increase noise reduction. Unfortunately, the speech reduction factor is also increased. Additionally, we have found empirically that the optimal tolerance is SNR dependent. Therefore, during our simulations, all SVD-based pseudoinverses were computed using the default tolerance set by MATLAB.

5.2 Objective evaluation

The acoustic environment was simulated using the image method [20]. We assumed that the enclosure is rectangular with dimensions 6×5×2.8 (all dimensions and coordinates are in meters). A uniform linear array of eight microphones was placed along the x-axis, with spacing 0.1 and beginning from the first microphone at the position (2.65,4,1). The locations of the microphones and the sound sources are shown in Figure 3. The source speech signal was sampled at 16 kHz. The signal was about 14-s-long and comprised of four short sentences uttered by male and female speakers (see Additional file 1). In order to represent general broadband signals the pink noise was chosen. The microphone signals were obtained by convolving the source speech signal with the generated impulse responses of a room, and by adding noise signals at SNRs ranged from −5 to 20 dB, in accordance with Equation 1. An example noisy speech sample is provided as the Additional file 2. In all experiments, we estimated the noise-free signal only at the first microphone, n=1, which served as the reference microphone.

The SNR-based measures were used for evaluating the objective performance. The speech distortion measure (SD) was defined as the segmental signal-to-noise ratio, in which the noise was identified with the difference between the source signal and enhanced speech. The higher the value of this factor, the better the performance. The amount of reduced noise was measured using the noise attenuation (NA) factor defined as the mean ratio between the input noise power and output noise power.

Firstly, taking into consideration only on the first four microphones, we have evaluated the impact of the parameter θ on speech distortion and noise attenuation. The measured speech distortion, which is shown in Figure 4a, indicates rather weak influence of the parameter θ on the input SNR. The optimal value of θ is between 3 to 4 for all SNRs. On the other hand, the plot of the noise attenuation factor in Figure 4b, demonstrates that the higher the value of the θ, the higher noise attenuation.

The subsequent simulations were performed for θ=3 and N=2,3,…,8. For conciseness, we present in Figure 5 only the results of objective measurements of the systems with N=2,4, and eight microphones. Example recordings of the speech enhanced using conventional and proposed method are provided as Additional files 3 and 4, respectively.

It can easily be seen that the proposed method outperforms the conventional one, as it provides lower speech distortions and higher noise attenuation. Surprisingly, the speech distortion for the system with N=2 microphones was lower than for the eight-microphone system, especially at high SNRs. A possible explanation of this phenomena is that for more microphones, the correlation matrix is larger, which makes the estimation less accurate. In practice, it makes sense to use more microphones only in the conventional time-domain method (in order to improve the noise attenuation). Figure 5a shows that the speech distortion can be also decreased but only at low SNRs.

Unlike the conventional method, the signal subspace approach does not require many microphones to work reasonably well. The proposed method removes the residual noise almost completely (NA = 70 to 90 dB) without introducing speech distortions or unnatural discontinuity effects at transients. This is not surprising, since the matrix Σ _n may contain only zeros during silence intervals, which is highly desirable in speech coding or automatic speech recognition (ASR) systems. On the other hand, complete cancellation of the noise is neither necessary nor desired in some applications, like mobile communication. In such cases, zero diagonal coefficients in Σ _n can be replaced with some small positive numbers.

The objective evaluation has been validated using spectrograms. Figure 6a shows the spectrogram of the noisy speech signal recorded at the first microphone, at SNR = 10 dB. The enhancement results for the conventional and proposed methods with N=4 are presented below. Once again, we see that the proposed method offers incomparably higher noise attenuation during both speech pauses and voice activity periods. Unlike the time-domain implementation, the signal subspace approach does not generate musical tones (random peaks in the time-frequency plane). However, one should remember that this is an idealized situation, because the VAD has not been implemented, and speech/pause frames were marked manually. In practice, the VAD is difficult to implement, and its performance generally depends on the input SNR. Therefore, we expect some performance drop in real applications.

6 Conclusions

We have shown that the STP method can be implemented using a signal subspace approach. The conditions for uniqueness of a solution have been provided. We proposed Equation 29 as a simple rule that can be used when the speech correlation matrix is rank deficient. It has been verified analytically that the proposed approach can reduce noise without distorting the speech (as long as the parameter L _s is not less than the true rank of R _yy). In order to estimate the dimension of the speech-plus-noise subspace, we also used some sort of the thresholding technique. However, we have found empirically that, unlike in the conventional SVD-based regularization, a corresponding threshold (or the parameter θ) is not SNR dependent and can be adjusted to fixed value. The objective measurements show that the signal subspace approach outperforms the conventional one providing higher noise attenuation and lower speech distortion. We have also reported that the proposed implementation does not require as many microphones as its time-domain counterpart to work reasonably well.

Listening tests are usually difficult and time-consuming, thus they were not used to evaluate our approach.

In this article, we have introduced a novel notation that allows for estimating the speech signals at all microphones at once. This can potentially be useful if the system has to work as a preprocessor for a beamformer. Since the STP method relies only on the second-order statistics, it may find other applications in areas where multi-sensor data are processed, i.e. in the electroencephalography, as a means for enhancing EEG signals. These points have not been discussed here, but they are promising directions for future work.

References

OL Frost, An algorithm for linearly constrained adaptive array processing. Proc. IEEE. 60, pp. 926–935 (1972).
Article Google Scholar
LJ Griffiths, CW Jim, An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag.AP-30(1), 27–34 (1982).
Article Google Scholar
S Gannot, D Burshtein, E Winstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process. 49(8), 1614–1626 (2001).
Article Google Scholar
S Affes, Y Grenier, A signal subspace tracking algorithm for microphone array processing of speech. IEEE Trans. Speech Audio Process. 5(5), 425–437 (1997).
Article Google Scholar
Y Huang, J Benesty, J Chen, Analysis and comparison of multichannel noise reduction methods in a common framework. IEEE Trans. Audio, Speech, Lang. Process. 16(5), 957–968 (2008).
Article Google Scholar
Y Ephraim, HL Van Trees, A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 3(4), 251–266 (1995).
Article Google Scholar
D Virette, P Scalart, C Lamblin, Analysis of background noise reduction techniques for robust speech coding. Proc. EUSIPCO. 3, 297–300 (2002).
Google Scholar
A Borowicz, A Petrovsky, Signal subspace approach for psychoacoustically motivated speech enhancement. Speech Comm. 53(2), 210–219 (2011).
Article Google Scholar
F Jabloun, B Champagne, Incorporating the human hearing properties in the signal subspace approach for speech enhancemnt. IEEE Trans. Speech Audio Process. 11(6), 700–708 (2003).
Article Google Scholar
A Borowicz, A Petrovsky, Incorporating auditory properties into generalised sidelobe canceller. Paper presented at the 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest, Romania, 27–31 August 2012.
J Chen, J Benesty, Y Huang, A minimum distortion noise reduction algorithm with multiple microphones. IEEE Trans. Audio, Speech, Lang. Process. 16(3), 481–493 (2008).
Article Google Scholar
J Benesty, J Chen, Y Huang, Microphone Array Signal Processing (Springer, Berlin, Germany, 2008).
Google Scholar
B Cornelis, M Moonen, J Wouters, Comparison of frequency domain noise reduction strategies based on multichannel wiener filtering and spatial prediction. Paper presented at the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. Taipei, 19–24 April 2009.
J Benesty, J Chen, EAP Habets, Speech Enhancement in the STFT Domain. SpringerBriefs in Electrical and Computer Engineering (Springer, Berlin, Germany, 2012).
Book Google Scholar
EAP Habets, A distortionless subband beamformer for noise reduction in reverberant environments. Paper presented at the Proc. IWAENC, Tel Aviv. Israel, August 2010.
PC Hansen, The truncated SVD as a method for regularization. BIT. 27, 534–553 (1987).
Article MATH MathSciNet Google Scholar
Y Hu, PC Loizou, A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Trans. Speech Audio Process. 11(4), 334–341 (2003).
Article Google Scholar
H Lev-Ari, Y Ephraim, Extension of the signal subspace enhancement to colored noise. IEEE Signal Process. Lett. 10(4), 104–106 (2003).
Article Google Scholar
R Vetter, N Virag, P Renevey, JM Vesin, Single channel speech enhancement using principal component analysis and MDL subspace selection. Paper presented at the Proc. EUROSPEECH. Budapest, Hungary, 5–9 September 1999.
JB Allen, DA Berkley, Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 65(4), 943 (1979).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Polish National Science Centre under Decision No. DEC-2012/07/D/ST6/02454.

Author information

Authors and Affiliations

Department of Computer Graphics and Digital Media, Faculty of Computer Science, Bialystok University of Technology, Wiejska str. 45A, Bialystok, 15-351, Poland
Adam Borowicz

Authors

Adam Borowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Borowicz.

Additional information

Competing interests

The author declares that he has no competing interests.

Additional files

Additional file 1

Clean speech sample recorded at microphone number 1.

Additional file 2

Noisy speech sample recorded at microphone number 1 (at SegSNR = 0 dB).

Additional file 3

Speech signal enhanced using time-domain STP method.

Additional file 4

Speech signal enhanced using the proposed method.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Borowicz, A. A signal subspace approach to spatio-temporal prediction for multichannel speech enhancement. J AUDIO SPEECH MUSIC PROC. 2015, 5 (2015). https://doi.org/10.1186/s13636-015-0051-z

Download citation

Received: 23 October 2014
Accepted: 20 January 2015
Published: 10 February 2015
DOI: https://doi.org/10.1186/s13636-015-0051-z

A signal subspace approach to spatio-temporal prediction for multichannel speech enhancement

Abstract

1 Introduction

2 Signal model and linear filtering

3 Spatio-temporal prediction

4 Signal subspace approach

5 Simulations

5.1 Implementation

5.2 Objective evaluation

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Additional files

Additional file 1

Additional file 2

Additional file 3

Additional file 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords