Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering

Mahbub, Upal; Fattah, Shaikh Anowarul; Zhu, Wei-Ping; Ahmad, M Omair

doi:10.1186/1687-4722-2014-20

Research
Open access
Published: 03 May 2014

Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering

Upal Mahbub¹,
Shaikh Anowarul Fattah¹,
Wei-Ping Zhu² &
…
M Omair Ahmad²

EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 20 (2014) Cite this article

2626 Accesses
9 Citations
Metrics details

Abstract

In this paper, a two-stage scheme is proposed to deal with the difficult problem of acoustic echo cancellation (AEC) in single-channel scenario in the presence of noise. In order to overcome the major challenge of getting a separate reference signal in adaptive filter-based AEC problem, the delayed version of the echo and noise suppressed signal is proposed to use as reference. A modified objective function is thereby derived for a gradient-based adaptive filter algorithm, and proof of its convergence to the optimum Wiener-Hopf solution is established. The output of the AEC block is fed to an acoustic noise cancellation (ANC) block where a spectral subtraction-based algorithm with an adaptive spectral floor estimation is employed. In order to obtain fast but smooth convergence with maximum possible echo and noise suppression, a set of updating constraints is proposed based on various speech characteristics (e.g., energy and correlation) of reference and current frames considering whether they are voiced, unvoiced, or pause. Extensive experimentation is carried out on several echo and noise corrupted natural utterances taken from the TIMIT database, and it is found that the proposed scheme can significantly reduce the effect of both echo and noise in terms of objective and subjective quality measures.

1 Introduction

The phenomenon of acoustic echo occurs when the output speech signal from a loudspeaker gets reflected from different surfaces, like ceilings, walls, and floors and then fed back to the microphone. In its worst case, acoustic echo can cause howling of a significant portion of sound energy [1, 2]. In real life applications, such as a lecture in a large conference hall or in the public address system of a trade fair, the presence of acoustic echo along with the environmental noise is a very common phenomenon, which degrades the speech quality even leading to complete loss of intelligibility.

In order to deal with the problem of acoustic echo cancellation (AEC), conventionally echo suppressors, earphones, and directional microphones have been used, which generally place restrictions on the talkers’ movement [2]. As an alternate of such hardware-based solutions, adaptive filter algorithms are widely being applied where apart from the input channel, a separate echo-free reference channel is required [3–13]. Among different adaptive filter algorithms, the least mean squares (LMS) algorithm and its different variants are very popular for their satisfactory performances and less computational burden [4, 10, 12–14]. Besides these algorithms, the recursive least squares (RLS) algorithm is well-known for its fast convergence at the expense of computational complexity [13]. The adaptive filter algorithms have also been used for acoustic noise cancellation (ANC) [15].

There are some methods that deal with both acoustic echo and noise cancellation (AENC) [16–18]. The echo canceller used in [16] utilizes a sub-band noise cancellation scheme. In [17], echo cancellation is done by an adaptive LMS filter while a linear prediction error filter removes the residual echo and noise. In [18], a single Wiener filter is employed to simultaneously suppress the echo and noise. It is to be mentioned that all these AENC methods employ more than one microphone, while the solutions using single microphone are favorable in most of the real-life applications.

In this paper, an AENC scheme is proposed which can efficiently deal with the single-channel scenario. First, unlike conventional LMS algorithm, considering the delayed version of the previously echo- and noise-suppressed signal as reference, a gradient-based adaptive LMS algorithm is developed for single channel AEC. Preliminary results obtained by using this idea is reported in [19]. However, in the current paper, analytical proof of convergence towards the optimum Wiener-Hopf solution is presented. Next, a single-channel ANC algorithm based on spectral subtraction with an adaptive spectral floor estimation is developed, which reduces not only the effect of noise but also some residual echo. Finally, analyzing different speech characteristics of the reference and current frames, multiconditional updating constraints are proposed in order to obtain precise control on convergence characteristics. For performance evaluation, extensive experimentation is conducted on several real-life echo and noise corrupted speech signals at different acoustic environments.

2 Problem formulation

In order to formulate the problem of single-channel AENC, for a better understanding, first, a dual channel AENC scheme is presented in Figure 1 (according to [17]). Here, s₁(n) and s₂(n) are speech signals corresponding to near-end and far-end speakers, while v₁(n) and v₂(n) are additive noises, respectively. The noise corrupted far-end signal (s₂(n)+v₂(n)) is played through a loudspeaker at the near-end acoustic room environment and the echo signal x₂(n) is generated. Thus, the input y₁(n) to the near-end microphone is given by

\begin{array}{lcr} y_{1} (n) = s_{1} (n) + v_{1} (n) + x_{2} (n) . \end{array}

(1)

The task of the adaptive filter-based AEC block placed at the near-end is to produce an estimate ${\hat{x}}_{2} (n)$ of the echo x₂(n) by minimizing the error

\begin{array}{lcr} e_{1} (n) = y_{1} (n) - {\hat{x}}_{2} (n) . \end{array}

(2)

Two major issues in dual channel system are (i) availability of a separate reference signal required for the adaptive filter, for example, here the delayed version of (s₂(n)+v₂(n)) and (ii) different speakers for input and echo signals. Moreover, use of the double talk detector (DTD) helps in controlling the update process. Unfortunately, these features are absent in single-channel scenario as shown in Figure 2. Instead of two speakers, in this case, the microphone receives the input s(n) corrupted by noise v(n) and echo generated from the same speaker.

In the presence of noise v(n), the sole microphone input signal in single-channel scenario is given by

y (n) = s (n) + v (n) + x_{s} (n) + x_{v} (n),

(3)

where x_s(n) and x_v(n) denote the echo of the input speech and noise, respectively. The echo signals can be expressed as

\begin{array}{lcr} x_{s} (n) = a_{n}^{T} s (n - k_{0}), \end{array}

(4)

\begin{array}{lcr} x_{v} (n) = a_{n}^{T} v (n - k_{0}), \end{array}

(5)

where s(n−k₀)=[s(n−k₀−1),s(n−k₀−2),…,s(n−k₀−p)]^T and v(n−k₀)=[v(n−k₀−1),v(n−k₀−2),…,v(n−k₀−p)]^T with k₀ being a predefined flat delay and a_n=[a_n(1),a_n(2),…,a_n(p)]^T consists of the coefficients corresponding to the acoustic room transfer function A(z). The order p and coefficient values of A(z) depend on the room characteristics. It is to be noted that in this case, there is no scope of obtaining a separate echo-free reference or a separate noise-only reference, which makes the single-channel AENC problem extremely difficult to handle.

3 Proposed single-channel AENC scheme

3.1 Proposed two-stage setup

In Figure 3a, a simple block diagram showing two stages of the proposed AENC scheme is presented and in Figure 3b, more detail of the adaptive filter-based AEC algorithm involved in the first stage is shown. Similar to Figure 2, the input to the microphone y(n) can be described by (3). For the case of single-channel AEC, for example, while delivering a lecture in a large conference hall, the microphone in front of the speaker receives input speech s(n) corrupted by v(n). Once this noise-corrupted speech is transmitted through loudspeaker, echo signal is generated and thus the microphone after some initial time delay will receive noise-corrupted speech and echo of previously uttered speech. The task of AEC is to cancel the echo part from this input by using adaptive filter algorithm. In order to obtain adaptively an estimate ${\hat{x}}_{s} (n) + {\hat{x}}_{v} (n)$ of the echo signal, we propose to utilize delayed versions of the previously echo-suppressed samples of the noisy speech as reference signal [19]. A symbol hat on the variable is used to indicate estimated value. The error signal e(n) thus obtained is given by

\begin{array}{lcr} e (n) = y (n) - [{\hat{x}}_{s} (n) + {\hat{x}}_{v} (n)] . \end{array}

(6)

The estimate of the echo signal can be expressed as

\begin{array}{lcr} {\hat{x}}_{s} (n) + {\hat{x}}_{v} (n) = {\hat{w}}_{n}^{T} [\hat{s} (n - k_{0}) + \hat{v} (n - k_{0})], \end{array}

(7)

where ${\hat{w}}_{n} = {[{\hat{w}}_{n} (1), {\hat{w}}_{n} (2) \dots {\hat{w}}_{n} (p)]}^{T}$ is the estimated coefficient vector. The task of the adaptive filter is to obtain an optimum ${\hat{w}}_{n}$ by minimizing the error in (6) i.e.,

\begin{array}{lcr} e (n) = s (n) + {(v (n) + δ_{s} (n)) + δ_{v} (n)}, \end{array}

(8)

where $δ_{s} (n) = x_{s} (n) - {\hat{x}}_{s} (n)$ and $δ_{v} (n) = x_{v} (n) - {\hat{x}}_{v} (n)$ are the residual echo of the speech and noise portions of the input signal, respectively, and it is assumed that these signals exhibit the properties of white Gaussian noise. Next, e(n) is passed through a spectral subtraction-based single-channel ANC block which produces output $\tilde{s} (n) \approx s (n) + Ψ (n)$ that closely resembles s(n) provided that the residual echo-noise portion Ψ(n) becomes very small.

It is to be noted that the task of noise reduction, unlike the proposed AENC scheme, may be carried out prior to the AEC block. However, because of possible nonlinearities introduced by the prior noise reduction block, no proper reference would be available for the single-channel AEC block [17]. Hence, the arrangement shown in Figure 3a is adopted, in which the noise reduction block also serves as a post-processor for attenuating the residual echo.

3.2 Development of proposed gradient-based single-channel LMS AEC scheme

A delayed version of the adaptive filter output e(n) is proposed to use as the reference signal, and from (8), filter output e(n) can be written as

e (n) = \hat{s} (n) + \hat{v} (n),

(9)

where $\hat{s} (n) = s (n) + δ_{s} (n)$ and $\hat{v} (n) = v (n) + δ_{v} (n)$ . The objective function of the adaptive filter involves minimization of the mean square estimation of the error function and using (6) it can be written as

\begin{array}{l} E {e^{2} (n)} & = E {{(s (n) + v (n))}^{2}} + E {(x_{s} (n) + x_{v} (n) \\ - {\hat{x}}_{s} (n) - {\hat{x}}_{v} (n))^{2}} + 2 E {(s (n) + v (n)) \\ \times (x_{s} (n) + x_{v} (n) - {\hat{x}}_{s} (n) - {\hat{x}}_{v} (n))}, \end{array}

(10)

where E{.} denotes the expectation operator. In (10), it is intended to use the basic definition of cross-correlation operation, for example, the cross-correlation function between s(n) and v(n) is defined as

r_{sv} (m) = E {s (n) v (n - m)},

(11)

where m denotes the lag. Using (4), (5), (7), and the above definition, the last term of (10) can be expressed as

\begin{array}{l} 2 E {[(s (n) + v (n)) (x_{s} (n) + x_{v} (n) - {\hat{x}}_{s} (n) - {\hat{x}}_{v} (n))]} \\ = 2 \sum_{k = 1}^{k = p} {(a_{n} (k) - {\hat{w}}_{n} (k)) (r_{ss} (k_{0} + k) + r_{sv} (k_{0} + k) \\ + r_{vs} (k_{0} + k) + r_{vv} (k_{0} + k)) - r_{s δ_{s}} (k_{0} + k) \\ - r_{s δ_{v}} (k_{0} + k) - r_{v δ_{s}} (k_{0} + k) - r_{v δ_{v}} (k_{0} + k)} . \end{array}

(12)

Here, r_{s
s}(k₀+k) corresponds to the (k₀+k)th lag of the cross-correlation between s(n) and its previous samples s(n−k₀−k), and r_{s
v}(k₀+k) corresponds to the (k₀+k)th lag of the cross-correlation between s(n) and v(n−k₀−k). In a similar way, r_{v
s}(k₀+k), r_{v
v}(k₀+k), $r_{s δ_{s}} (k_{0} + k)$ , $r_{s δ_{v}} (k_{0} + k)$ , $r_{v δ_{s}} (k_{0} + k)$ , and $r_{v δ_{v}} (k_{0} + k)$ can be defined. It is well known that the value of cross-correlation decreases rapidly with the increasing lags when two signals are uncorrelated. In ideal case, the cross-correlation function between two random noise signals would be nonzero only at the zero lag. Since v(n) is assumed to be white Gaussian noise and, generally, the value of k₀ is very large, in (12), the effect of the terms r_{s
v}(k₀+k), r_{v
s}(k₀+k), and r_{v
v}(k₀+k) can be neglected. Moreover, because of noise-like characteristics of δ_s(n) and δ_v(n), in (12), one can neglect $r_{s δ_{v}} (k_{0} + k)$ , $r_{v δ_{s}} (k_{0} + k)$ , and $r_{v δ_{v}} (k_{0} + k)$ too. Hence, it can easily be comprehended that optimal filter performance occurs when r_{s
s}(n) is minimum, i.e., the least possible correlation between s(n−k₀−k) and s(n) is desired. As a result, (10) reduces to

\begin{array}{l} E {e^{2} (n)} & = E {{(s (n) + v (n))}^{2}} \\ + E {{[x_{s} (n) + x_{v} (n) - {\hat{x}}_{s} (n) - {\hat{x}}_{v} (n)]}^{2}} \\ + 2 \sum_{k = 1}^{k = p} (a_{n} (k) - {\hat{w}}_{n} (k)) r_{ss} (k_{0} + k) . \end{array}

(13)

Here, the magnitude of r_{s
s}(k₀+k) strongly depends on speech characteristics and the amount of flat delay k₀. For a reasonably large k₀, the effect of r_{s
s}(k₀+k) in 13 can be neglected, and minimization of (13) results in

\begin{array}{l} \frac{∂E {e^{2} (n)}}{\partial {\hat{w}}_{n}^{T}} = 0 \\ E [{x_{s} (n) + x_{v} (n) - {\hat{x}}_{s} (n) - {\hat{x}}_{v} (n)} {\hat{s} (n - k_{0}) + \hat{v} (n - k_{0})}] = 0 . \end{array}

(14)

Hence, we obtain

\begin{array}{l} E {(x_{s} (n) + x_{v} (n)) (\hat{s} (n - k_{0}) + \hat{v} (n - k_{0}))} \\ = {\hat{w}}_{n}^{T} E [{\hat{s} (n - k_{0}) + \hat{v} (n - k_{0})} {\hat{s} (n - k_{0}) + \hat{v} (n - k_{0})}] . \end{array}

(15)

The above equation is similar to Wiener-Hopf equation and its solution can be written as

{\hat{w}}_{n}^{T} = {R_{(s + v) (s + v)} (n - k_{0})}^{- 1} r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}),

(16)

where $r_{(x_{s} + x_{v}) (s + v)} (n - k_{0})$ consists of different lags of cross-correlation between the echo signal x_s(n)+x_v(n) and the noisy input signal s(n)+v(n), while R_(s+v)(s+v) is the auto-correlation matrix of s(n)+v(n). There is no doubt that ${\hat{w}}_{n}$ is the most optimum solution possible. Hence, it is shown that even for a single-channel noise corrupted AEC problem, the most optimum solution ${\hat{w}}_{n}$ can be achieved under the assumptions stated earlier.

For iterative estimation of optimal filter coefficients, the adaptive LMS algorithm is very popular. It is fast and efficient, and it does not require any correlation measurements or matrix inversion [13]. The update equation of the LMS adaptive algorithm is generally expressed as

{\hat{w}}_{n + 1}^{T} = {\hat{w}}_{n}^{T} - μ \nabla ξ (n),

(17)

where μ is the step factor controlling the stability and rate of convergence, ξ(n) is the cost function, and ∇ is the gradient operator. The LMS algorithm simply approximates the mean square error by the square of the instantaneous error, i.e., ξ(n)=e²(n), and therefore, from (6) and (7), the gradient of ξ(n) can be expressed as

\begin{array}{lcr} \nabla ξ (n) = \frac{∂ξ (n)}{\partial {\hat{w}}_{n}^{T}} = - 2 e (n) (\hat{s} (n - k_{0}) + \hat{v} (n - k_{0})) . \end{array}

Thus, the update equation for the proposed single-channel LMS adaptive scheme can be written as

\begin{array}{lcr} {\hat{w}}_{n + 1}^{T} = {\hat{w}}_{n}^{T} + 2 μe (n) (\hat{s} (n - k_{0}) + \hat{v} (n - k_{0})) . \end{array}

(18)

3.3 Convergence analysis of the proposed AEC scheme

Considering expectation operation on both sides of the update Eq. 18, one can obtain

\begin{array}{lcr} {\underset{̲}{\hat{w}}}_{n + 1}^{T} = {\underset{̲}{\hat{w}}}_{n}^{T} + 2 μE {e (n) (\hat{s} (n - k_{0}) + \hat{v} (n - k_{0}))} . \end{array}

(19)

Here, an underline beneath ${\hat{w}}_{n}$ is introduced to represent the expected value $E {{\hat{w}}_{n}}$ . For the k th unknown weight vector (where k=1,2,…,p), using (6) and neglecting the effect of r_{s
s}(n) that has already been discussed in the previous subsection, the last term of (19) can be written as

\begin{array}{l} 2 μE {e (n) (\hat{s} (n - k_{0}) + \hat{v} (n - k_{0}))} \\ = 2 μE {[x_{s} (n) + x_{v} (n) - {\hat{x}}_{s} (n) - {\hat{x}}_{v} (n)] \times (\hat{s} (n - k_{0}) \\ + \hat{v} (n - k_{0}))} . \end{array}

(20)

Based on the assumptions on cross-correlation terms stated in the previous subsection, one can obtain

\begin{array}{lcr} E {e (n) (\hat{s} (n - k_{0}) + \hat{v} (n - k_{0}))} & = & r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}) \\ - R_{(s + v) (s + v)} (n - k_{0}) {\hat{w}}_{n}^{T} . \end{array}

(21)

Using (21), the update Eq. 19 can be written as

\begin{array}{lcr} {\underset{̲}{\hat{w}}}_{n + 1}^{T} & = & {\underset{̲}{\hat{w}}}_{n}^{T} - 2 μ R_{(s + v) (s + v)} (n - k_{0}) {\underset{̲}{\hat{w}}}_{n}^{T} \\ + 2 μ r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}) . \end{array}

(22)

Evaluating the homogeneous and particular solutions of (22), the total solution can be obtained as (see Appendix)

\begin{array}{lcr} {\underset{̲}{\hat{w}}}_{n + 1}^{U} (k) = C_{k} {(1 - 2 μλ (k))}^{n} + \frac{1}{λ (k)} r^{U} (n - k_{0} - k), \end{array}

(23)

where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R_(s+v)(s+v)(n−k₀) and r^U(n−k₀−k) is the k th element of $U^{T} r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}) = r_{(x_{s} + x_{v}) (s + v)}^{U} (n - k_{0})$ with the matrix U consisting of eigenvectors corresponding to eigenvalues. Since in the iterative update procedure, the homogeneous part (1−2μ λ(k))ⁿ diminishes with iterations, (23) in a matrix form can be expressed as

\begin{array}{lcr} {\underset{̲}{\hat{w}}}^{T} & = & U Λ^{- 1} U^{T} r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}) \\ = & R_{(s + v) (s + v)}^{- 1} (n - k_{0}) r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}) . \end{array}

(24)

Thus, it is found that the average value of the weight vector converges to the Wiener-Hopf solution, which is the optimum solution with increasing number of iteration.

3.4 Noise reduction in spectral domain

In the proposed AENC scheme, the operation of the ANC block is processed frame by frame for noise reduction based on single-channel spectral subtraction algorithm [20–22]. According to (9), for the i th frame, the error signal for the duration of a frame length can be written as

e_{i} (n) = {\hat{s}}_{i} (n) + {\hat{v}}_{i} (n) .

(25)

Corresponding frequency domain representation is given by

E_{i} (ω) = {\hat{S}}_{i} (ω) + {\hat{V}}_{i} (ω) .

(26)

The magnitude squared spectrum of ${\hat{s}}_{i} (n)$ can be written as

\begin{array}{lcr} ∣ {\hat{S}}_{i} (ω) ∣^{2} = ∣ E_{i} (ω) ∣^{2} - ∣ {\hat{V}}_{i} (ω) ∣^{2} - {\hat{V}}_{i} (ω) {\hat{S}}_{i}^{*} (ω) - {\hat{S}}_{i} (ω) {\hat{V}}_{i}^{*} (ω) . \end{array}

(27)

It is desired to choose an estimate ${\tilde{S}}_{i} (ω)$ that will minimize

Er r_{i} (ω) = ∣ ∣ {\tilde{S}}_{i} (ω) ∣^{2} - ∣ {\hat{S}}_{i} (ω) ∣^{2} ∣ .

(28)

Since the noise is assumed to be zero mean and uncorrelated with the signal, the expected values of the last two terms of (27) can be neglected. Thus, (28) can be expressed as

Er r_{i} (ω) = ∣ {\tilde{S}}_{i} (ω) ∣^{2} - ∣ E_{i} (ω) ∣^{2} + E {∣ {\hat{V}}_{i} (ω) ∣^{2}} .

(29)

This expression of E r r_i(ω) can be minimized by choosing

∣ {\tilde{S}}_{i} (ω) ∣^{2} = ∣ E_{i} (ω) ∣^{2} - E {∣ {\hat{V}}_{i} (ω) ∣^{2}} .

(30)

With an estimate of noise spectrum $E {∣ {\hat{V}}_{i} (ω) ∣^{2}}$ , signal spectrum ${\tilde{S}}_{i} (ω)$ can be computed as

{\tilde{S}}_{i} (ω) = ∣ {\tilde{S}}_{i} (ω) ∣ e^{jarg [E_{i} (ω)]},

(31)

where the phase (arg[E_i(ω)]) is generally assumed to be the phase of the noise corrupted signal without causing significant degradation in terms of loss of intelligibility of the speech signal [20]. It can be seen that an estimate of the magnitude spectrum $∣ {\tilde{S}}_{i} (ω) ∣$ of the signal can be obtained provided an estimate of noise spectrum $E {∣ {\hat{V}}_{i} (ω) ∣^{2}}$ is available, which is generally computed during the periods when speech is known a priori not to be present.

Final output of the AENC system is the speech frame $({\tilde{s}}_{i} (n))$ , which consists of the original speech s_i(n) and a negligible amount of noise-like signal Ψ_i(n). The signal Ψ_i(n), although very weak, may contain some signature of the input noise v(n), the residual echo δ_s(n), and the residual noise δ_v(n). In order to overcome the problem of musical noise and to avoid the speech distortion caused by speech subtraction, in (31), an over estimate of the noise power spectrum can be subtracted carefully such that the spectral floor is preserved [21]. Thus, (31) can be modified as

\begin{array}{lcr} ∣ {\tilde{S}}_{i} (ω) ∣^{2} & = & ∣ {\hat{E}}_{i} (ω) ∣^{2} - α_{ss} E {∣ {\hat{V}}_{i} (ω) ∣^{2}}, \\ if ∣ {\tilde{S}}_{i} (ω) ∣^{2} > β_{ss} {∣ {\hat{V}}_{i} (ω) ∣^{2}} \\ = & β_{ss} {∣ {\hat{V}}_{i} (ω) ∣^{2}}, otherwise. \end{array}

(32)

Here, α_{s
s} is the subtraction factor and β_{s
s} is the spectral floor parameter with α_{s
s}≥1 and 0≤β_{s
s}≤1. The task of noise power spectral density estimation is carried out based on the minimum statistics noise estimator proposed in [23] which can handle the time-varying nature of the noise.

4 Development of adaptive update constraints

The AEC part of the proposed AENC scheme may suffer from some common problems of adaptive filter-based algorithms, such as slow convergence rate and fluctuation around the desired estimates, especially in practical cases where the assumption on negligibility of cross-correlation terms (stated in the previous section) may not strictly hold. In order to overcome such problems, some updating constraints are proposed based on the following speech characteristics:

(i)
The level of cross-correlation
(ii)
The amount of signal power
(iii)
The mean square error (MSE) between consecutive estimates of the unknown filter coefficients.

Through extensive experimentation on different speech frames, it is found that the negligibility of the cross-correlation terms r_{s
s}(n), $r_{s δ_{v}} (n)$ , $r_{v δ_{s}} (n)$ , and $r_{v δ_{v}} (n)$ (as described after (12)) strongly depends on the voicing characteristics of speech frames and the input noise. Because of inherent periodicity of the voiced speech frame, the degree of cross-correlation between two voiced speech frames of a person becomes higher in comparison to that between two unvoiced speech frames which are random in nature. Regarding signal power, the ratio of power of a voiced speech frame and an unvoiced speech frame is found to be higher in comparison to that of the two voiced speech frames. As white Gaussian noise is considered, the degree of cross-correlation between the speech and noise is found to be negligible and the noise powers in two different frames may not differ significantly. As a result, the effect of input noise is found to be negligible on the power ratio.

For a flat delay of k₀ samples, the initial k₀ samples of the utterance s(n)+v(n) can be treated as a reference signal (echo-free signal) responsible for the generation of echo signal that corrupts the current samples at or after k₀ samples. Considering a window of M samples with M≪K₀, power of the reference signal $(\hat{s} (n - k_{0}) + \hat{v} (n - k_{0}))$ can be computed as

\begin{array}{lcr} P_{ref} (n) = \frac{1}{M} \sum_{i = - \frac{M}{2}}^{\frac{M}{2} - 1} {[\hat{s} (n - k_{0} + i) + \hat{v} (n - k_{0} + i)]}^{2} . \end{array}

(33)

For a window of last M samples of the echo-suppressed speech signal $\hat{s} (n)$ , the average power P_sup(n) can be computed as

\begin{array}{lcr} P_{sup} (n) = \frac{1}{M} \sum_{j = 0}^{M - 1} {[\hat{s} (n - j) + \hat{v} (n - j)]}^{2} . \end{array}

(34)

The ratio of P_ref(n) and P_sup(n) is denoted as the power ratio P_rs(n) and considered as one of the control characteristics.

Another important characteristic criterion is the correlation coefficient C_rs(n) between a frame of the noisy reference signal $(\hat{s} (n - k_{0}) + \hat{v} (n - k_{0}))$ and a frame of the current noisy signal $(\hat{s} (n) + \hat{v} (n))$ . For a frame length of M samples, correlation coefficient C_rs(n) is defined as

\begin{array}{lcr} C_{rs} (n) & = & \frac{1}{σ_{\hat{s} (n - k_{0} + i) + \hat{v} (n - k_{0} + i)} σ_{\hat{s} (n - j) + \hat{v} (n - j)}} \\ \times {cov ((\hat{s} (n - k_{0} + i) + \hat{v} (n - k_{0} + i)) \\ \times (\hat{s} (n - j) + \hat{v} (n - j)))} \end{array}

(35)

where −M/2≤i≤M/2−1 and 0≤j≤(M−1).

Finally, the parameter estimation accuracy is also considered for the purpose of analyzing the convergence property. In this regard, the mean square error MSE_ideal(n) between the values of estimated coefficients ${\hat{w}}_{n}$ and those of true coefficients a_n is computed as

\begin{array}{lcr} {MSE}_{ideal} (n) = \frac{1}{p} \sum_{k = 1}^{p} {[{\hat{w}}_{n} (k) - a_{n} (k)]}^{2} . \end{array}

(36)

In Figure 4, considering a real-life speech utterance of 250 ms corrupted by echo and noise, behavior of the control parameters obtained by using (33), (34), (35), and (36) is shown. The speech utterance (/i y/−/i x/) contains a voiced phoneme followed by another voiced phoneme [24]. Here k₀=1,000, M=100, N_f=1002, sampling frequency 16 kHz and S N R=15 db is used.

In a similar fashion, in Figure 5, a speech utterance consisting of a voiced phoneme /ih/ followed by an unvoiced phoneme /sh/ and, in Figure 6, a voiced phoneme /ih/ followed by pause are considered. It is observed that the characteristic parameters vary depending on the nature of reference and current frames. When the current frame is a pause or weakly unvoiced, the power ratio becomes higher in comparison to the case when the current frame is a voiced one. On the contrary, the correlation coefficient becomes smaller when measured between a voiced and an unvoiced frame, but it becomes quite larger when measured between two voiced frames. It is also found that the presence of voiced frame as a reference strongly governs the rate of convergence and the estimation error of the proposed LMS algorithm. In Figure 4, because of all through presence of the voiced frame as the reference as well as the current frame, it is found that the convergence performance is not very satisfactory and the estimation error is relatively higher. On the other hand, in Figure 6, it is observed that when the current frame is pause, even in the presence of voiced reference frame, a very fast convergence is obtained with a little estimation error. In Figure 5, as the current frame is unvoiced instead of pause, a comparatively slower convergence is observed with higher estimation error.

Next, in Figures 7, 8, 9, the reference frame is considered unvoiced, and in Figures 10, 11, 12, it is considered pause. When the reference frame is considered unvoiced because of the existence of a little correlation between the current and reference frames, the convergence performance of the proposed LMS algorithm is found quite satisfactory irrespective of the power of the reference signal (strong unvoiced or weakly unvoiced). In the case when the current frame is pause, no matter whether the reference frame is voiced or unvoiced, a fast convergence with high estimation accuracy is achieved using the proposed LMS algorithm. The reasons behind are (i) negligible cross-correlation between reference frame and current frame and (ii) a comparatively higher power ratio. In Figures 10, 11, 12, it is observed that even the reference frame is a pause or stop because of the presence of additive white noise, the reference frame may contain significant energy. In these cases, a reasonable estimation of the room response can be obtained given that the noise power is quite high. Findings in the above cases are summarized in Table 1.

Table 1 Variation of LMS updating performance due to various characteristics of reference and current speech frame

Full size table

First of all, it is observed that a better convergence in terms of iterations and estimation error is obtained when the current frame is a pause (P) or stop and the reference frame is either voiced (V) or unvoiced (U), namely, V-P and U-P. This fact leads to a decision that the updating needs to be carried out at high level of power ratio, i.e.,

P_{rs} (n) = \frac{P_{ref} (n)}{P_{sup} (n)} \geq ζ,

(37)

where P_ref(n) and P_sup(n) are defined in (33) and (34), respectively. If the value of the lower bound ζ is chosen too large, the updating would be postponed for most of the instances resulting in very slow convergence. On the other hand, a very small value of ζ may cause more frequent updates where possibility of wrong estimations of filter coefficients would be higher, especially in V-P, U-P, and P-P cases. It is to be noted that considering only a lower bound of P_rs(n) may not always be sufficient to ensure that the reference frame possesses significant energy. For example in Figure 13, it is shown that high value of P_rs(n) may arise (marked block in the figure) from an initial silence frame where only a very little amount of noise is present. In order to prevent the updating in these situations, a lower bound β on the power of the reference frame is employed, i.e., P_ref(n)≥β. The value of β should surpass the power of speech pauses and ensure that the LMS update is postponed even if a frame of speech containing a partial pause is available as the reference. Hence, the first constraint for updating the algorithm is proposed as Condition I: P_rs(n)≥ζ and P_ref(n)≥β.

In some cases, it is observed that though the power ratio is very small, quite satisfactory updating is obtained, such as the U-V case shown in Figure 7. Another characteristic observed here is lower value of correlation coefficient C_rs(n) with higher value of P_ref(n). It is to be mentioned that the proposed AEC algorithm is developed on the assumption of negligibility of the cross correlation between current frame and reference frame. However, since both reference and current frame may belong to the same person, in case of high degree of correlation, the adaptive algorithm would try to suppress portion from the echo-corrupted signal resulting in unusual degradation= in convergence performance. Hence, introducing an upper bound on C_rs(n), the second condition is proposed as Condition II: C_rs(n)≤Υ 1 and P_ref≥β.

The presence of a certain level of noise can be utilized as an advantage in pause instances where generally the updating is not performed. Since noise is considered uncorrelated to itself, updating at frames where only noise is present would be quite satisfactory. In this case, the value of C_rs(n) must be very small and thus another condition on updating is proposed as Condition III: C_rs(n)≤Υ 2≤Υ 1.

Another important factor is the MSE of the estimations of successive iterations, which is defined as

e_{coeff} (n) = \sum_{K = 1}^{p} {({\hat{w}}_{n} (k) - {\hat{w}}_{n - 1} (k))}^{2} / p.

(38)

In order to continue the updating, an upper bound on the variation of successive estimates is set as following condition: Condition IV: e_{c
o
e
f
f}(n)≤ℵ.

Considering smaller values of e_coeff(n) allows to avoid updating at those instances where abrupt and significant changes occur in the estimated coefficients. In the proposed method, in order to carry out the LMS update, at least one of the above four conditions must be fulfilled.

5 Simulation results and comments

Performance of the proposed algorithm is investigated in different echo-generating environments at various input noise levels considering several male and female utterances available in the TIMIT database [24]. An acoustic room environment is simulated using an FIR filter of length N_f, where as per conventional approaches, filter coefficients during the flat delay portion are assumed to be zero. The flat delay time (k₀) can be pre-calculated based on the distance between the microphone and the speaker [25]. Because of the implicit zeros corresponding to the flat delay, it is evident that a few number (N_f−k₀) of unknown coefficients has to be determined. In the proposed method, a smaller step size is used to obtain a smooth convergence.

First, a subjective evaluation is carried out based on the feedback about the quality of the echo- and noise-suppressed signal provided by five individual listeners at different noisy echo-generating environments. From the overall response of the listeners in terms of mean objective score (MOS), a very satisfactory performance of the proposed method is obtained even under severe echo-generating conditions in noise.

Next, two objective measures, namely, echo return loss enhancement (ERLE) and signal-to-distortion ratio (SDR) are employed. The ERLE is defined as the ratio of the instantaneous power of the residual echo signal η_ς(n) and that of the input echo signal η_x(n) and expressed in dB as [1]

ERLE (n) = - 10 log \frac{η_{ς} (n)}{η_{x} (n)} .

(39)

The average value of ERLE(n) over time is considered. The input and output SDRs in dB are respectively defined as

\begin{array}{lcr} SD R_{in} & = & 10 log \frac{P_{s}}{P_{x + v}} \end{array}

(40)

\begin{array}{lcr} SD R_{out} & = & 10 log \frac{P_{s}}{P_{\hat{s} + \hat{v} - s}}, \end{array}

(41)

where P_s is the power of original signal s(n), P_x+v is the power of microphone input, and $P_{\hat{s} + \hat{v} - s} (n)$ is the power of distortion present in the echo-suppressed output signal. The SDR improvement is given by

SDRI = SD R_{out} - SD R_{in},

(42)

which indicates the overall distortion removal.

The proposed algorithm has been tested on several different sentences taken from the TIMIT database. In order to demonstrate the principle of selecting different threshold values required in the proposed updating constraints, as a typical example, a sample utterance ‘Good service should be rewarded by big tips’ is shown in Figure 14[24]. Voicing decisions are marked in the figure as ‘P’ for pause, ‘V’ for voiced, and ‘U’ for unvoiced. Considering white Gaussian noise with SNR = 15 dB, N_f=1,002, k₀=1,000, and M=100 in Figure 14b,c,d,e, P_rs(n), P_ref(n), C_rs(n), and MSE_ideal(n) are shown, respectively. Note that in this case, the proposed algorithm is used without the update constraints, and thus, the MSE_ideal(n) exhibits some higher values. The comments provided in Table 1 can be better visualized from different marked zones of this figure. From extensive experimentations, it is found that a better update requires P_ref(n) to be at least twice of P_supp(n) and a small percentage (1% to 5%) of the power of a regular voiced frame can be chosen as the lower bound of β for P_ref(n). Analyzing C_rs(n) in different speech frames, Υ 1 in condition 2 is chosen as 0.25 to ensure that no speech is being suppressed during the update procedure by confusing it with the echo and Υ 2 is kept very small, i.e, Υ 2≈0.1 to allow updating for cases where there exists no correlation or extremely low correlation between the reference signal and echo-suppressed signal. The value of the threshold ℵ for e_coeff(n) in condition IV is chosen to be very small (0.7×10⁻⁴) such that there will be no update of the LMS algorithm when the magnitude of e_coeff(n) is comparatively much larger.

In Figure 15, the effect of incorporating the proposed conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo- and noise-corrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed

In order to show the effectiveness of the proposed conditions, the MSE_ideal(n) obtained in Figure 14e is redrawn in Figure 15. In Figure 15, the effect of incorporating the conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo- and noise-corrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed. For a better understanding, another TIMIT utterance ‘She had your dark suit in greasy wash water all year’, under similar acoustic environment as used in Figure 14, is considered and corresponding echo- and noise-corrupted speech signal is shown in Figure 17a. The MSEs obtained by using the proposed method with and without the conditions are presented in Figure 17b,c, which clearly demonstrate the performance improvement in the later case.

In Table 2, the performance of the proposed algorithm with and without applying the conditions is shown in terms of the SDR improvement (dB) and ERLE (dB) for utterance 1. In order to evaluate the performance under different room environments, length (N_f) and parameter values of the room response filter are varied while keeping the input SNR constant to 15 dB. Considering k₀=1,000, N_f−k₀ is varied from 2 to 14. Results shown in the table clearly demonstrate the effectiveness of using the conditions on performance measures; in all cases, higher values of SDR and ERLE are obtained.

Table 2 Performance comparison with varying room acoustics

Full size table

In Table 3, the performance of the proposed algorithm with and without applying the conditions is evaluated for different levels of input SNR ranging from 25 to −5 dB for the first utterance considering white Gaussian noise and N_f=1014. It can be seen that the proposed method provides satisfactory performance at all SNR levels. Especially, the use of proposed conditions exhibits comparatively better performance.

Table 3 Performance comparison with noise level variation

Full size table

6 Conclusion

The problem of echo cancellation in the presence of noise, especially in single-channel environment, is a very challenging task, which has been efficiently tackled in this paper. First, the single-channel AEC block is designed based on the gradient-based adaptive LMS filter where to overcome the problem of getting a separate reference signal, we propose to use the delayed version of the echo-suppressed signal. Such a unique proposal of getting the reference signal is justified by presenting a detailed mathematical proof of achieving the most optimum Wiener-Hopf solution of the estimated filter coefficients, and a convergence analysis is carried out. Moreover, in order to achieve fast and smooth convergence, a set of updating constraints is proposed by analyzing the speech characteristics of different types of speech frames, such as voiced, unvoiced, and pause. In the ANC block, a modified single-channel spectral subtraction method is considered for its robust performance. It is shown that the proposed AENC scheme with updating constraints provides a very satisfactory performance in different echo-generating conditions and various levels of SNR in terms of SDR and ERLE.

Appendix

Derivation of the solution of the LMS update

In order to obtain a homogeneous solution of the update Eq. 22, one may consider

{\underset{̲}{\hat{w}}}_{n + 1}^{T} = {\underset{̲}{\hat{w}}}_{n}^{T} - 2 μ R_{(s + v) (s + v)} (n - k_{0}) {\underset{̲}{\hat{w}}}_{n}^{T} .

(43)

Eigenvalue decomposition of the correlation matrix R_(s+v)(s+v)(n−k₀) results in

R_{(s + v) (s + v)} (n - k_{0}) = U Λ U^{T},

(44)

where each column of the matrix U consists of eigenvectors corresponding to eigenvalues constituting the diagonal elements of the matrix Λ and U^TU=I. Forward multiplication by U^T on both sides of (43) results in

\begin{array}{lcr} {\underset{̲}{\hat{w}}}_{n + 1}^{T^{U}} = {\underset{̲}{\hat{w}}}_{n}^{T^{U}} - 2 μ Λ {\underset{̲}{\hat{w}}}_{n}^{T^{U}}, \end{array}

(45)

where $U^{T} {\underset{̲}{\hat{w}}}_{n}^{T} = {\underset{̲}{\hat{w}}}_{n}^{T^{U}}$ . The k th coefficient of the weight vector can be expressed as

\begin{array}{lcr} {\underset{̲}{\hat{w}}}_{n + 1}^{U} (k) = (1 - 2 μλ (k)) {\underset{̲}{\hat{w}}}_{n}^{U} (k), \end{array}

(46)

where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R_(s+v)(s+v)(n−k₀). Hence, the homogeneous solution can be obtained as

\begin{array}{lcr} {\hat{w}}_{h.s} = C_{k} {(1 - 2 μλ (k))}^{n}, \end{array}

(47)

where C_k is a constant. Next, in order to obtain the particular solution for the k th coefficient, based on (22) one can get

\begin{array}{lcr} {\hat{w}}_{p.s} = {\hat{w}}_{p.s} - 2 μλ (k) {\hat{w}}_{p.s} + 2 μ r^{U} (n - k_{0} - k) . \end{array}

(48)

Here, r^U(n−k₀−k) is the k th element of U^T $r_{(x_{s} + x_{v}) (s + v)} (n - k_{0}) = r_{(x_{s} + x_{v}) (s + v)}^{U} (n - k_{0})$ . For a particular solution ${\hat{w}}_{p.s} = K_{p} r^{U} (n - k_{0} - k)$ , (48) can be written as

\begin{array}{lcr} K_{p} r^{U} (n - k_{0} - k) & = & K_{p} r^{U} (n - k_{0} - k) \\ - 2 μλ (k) K_{p} r^{U} (n - k_{0} - k) \\ + 2 μ r^{U} (n - k_{0} - k), \end{array}

(49)

which leads to $K_{p} = \frac{1}{λ (k)}$ and the particular solution

\begin{array}{lcr} {\hat{w}}_{p.s} = \frac{1}{λ (k)} r^{U} (n - k_{0} - k) . \end{array}

(50)

References

Vaseghi SV: Advanced Digital Signal Processing and Noise Reduction. Wiley, Chichester; 2000.
Google Scholar
Kuo SM, Lee BH: Real-Time Digital Signal Processing. Wiley; 2001.
Book Google Scholar
Breining C, Dreiseitel P, Hänsler E, Mader A, Nitsch B, Puder H, Schertler T, Schmidt G, Tilp J: Acoustic echo control - an application of very-high-order adaptive filters. IEEE Signal Process. Mag 1999, 16(4):42-69. 10.1109/79.774933
Article Google Scholar
Hänsler E: The hands-free telephone problem: an annotated bibliography. Signal Process 1992, 27(3):259-271. 10.1016/0165-1684(92)90074-7
Article Google Scholar
Khong AWH, Naylor PA: Stereophonic acoustic echo cancellation employing selective-tap adaptive algorithms. IEEE Trans. Audio, Speech, Lang. Process 2006, 14(3):785-796.
Article Google Scholar
Lindstrom F, Schuldt C, Claesson I: An improvement of the two-path algorithm transfer logic for acoustic echo cancellation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(4):1320-1326.
Article Google Scholar
Wu S, Qiu X, Wu M: Stereo acoustic echo cancellation employing frequency-domain preprocessing and adaptive filter. IEEE Trans. Audio, Speech, Lang. Process 2011, 19(3):614-623.
Article Google Scholar
Nath R: Adaptive echo cancellation based on a multipath model of acoustic channel. Circuits, Syst. Signal Process., Springer US 2013, 32(4):1673-1698. 10.1007/s00034-012-9529-4
Article Google Scholar
Yukawa M, de Lamare RC, Sampaio-Neto R: Efficient acoustic echo cancellation with reduced-rank adaptive filtering based on selective decimation and adaptive interpolation. IEEE Trans. Audio, Speech, Lang. Process 2008, 16(4):696-710.
Article Google Scholar
Hänsler E, Schmidt G: Acoustic Echo and Noise Control: a Practical Approach. Wiley, New York; 2004.
Book Google Scholar
Myllylä V: Residual echo filter for enhanced acoustic echo control. Signal Process 2006, 86(6):1193-1205. 10.1016/j.sigpro.2005.07.036
Article Google Scholar
Topa R, Muresan I, Kirei BS, Homana I: A digital adaptive echo-canceller for room acoustics improvement. Adv. Electrical Comput. Eng 2004, 10: 450-453.
Google Scholar
Haykin S: Adaptive Filter Theory. Prentice-Hall, Inc., Upper Saddle River, NJ; 1996.
Google Scholar
Schmidt G: Applications of acoustic echo control: an overview. In Proc. Eur. Signal Process. Conf.. EUSIPCO, Vienna; 2004:9-16.
Google Scholar
Widrow B, Glover JRJ, McCool JM, Kaunitz J, Williams CS, Hearn RH, Zeidler JR, Dong JE, Goodlin RC: Adaptive noise cancelling: principles and applications. Proc. IEEE 1975, 63(12):1692-1716.
Article Google Scholar
Yasukawa H: An acoustic echo canceller with sub-band noise cancelling. IEICE Trans. Fundamentals Electron. Commun. Comput. Sci 1992, E75–A(11):1516-1523.
Google Scholar
Park SJ, Cho CG, Lee C, Youn DH: Integrated echo and noise canceller for hands-free applications. IEEE Trans. Circuits Syst.-II: Analog Digital Signal Process 2002., 49(3):
Beaugeant C, Turbin V, Scalart P, Gilloire A: New optimal filtering approaches for hands-free telecommunication terminals. Signal Process 1998, 64(1):33-47. 10.1016/S0165-1684(97)00174-6
Article Google Scholar
Mahbub U, Fattah SA: Gradient based adaptive filter algorithm for single channel acoustic echo cancellation in noise. In Proc. Int. Conf. Electrical Computer Engineering (ICECE), 2012 7th International Conference On. Dhaka, 688 Bangladesh; 2012:880-883.
Chapter Google Scholar
Boll S: A spectral subtraction algorithm for suppression of acoustic noise in speech. Proc. IEEE Int. Conf. Acoust. Speech, Signal Process. (ICASSP) ’79 1979, 200-203.
Chapter Google Scholar
Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. IEEE Conf. Acoust. Speech Signal Process. (ICASSP) 1979, 208-211.
Google Scholar
Lim JS: Evaluation of a correlation subtraction method for enhancing speech degraded by additive white noise. IEEE Trans. Acoust. Speech Signal Process 1978, 26(5):471-472. 10.1109/TASSP.1978.1163129
Article Google Scholar
Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504-512. 10.1109/89.928915
Article Google Scholar
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V: Timit acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia; 1993.
Google Scholar
Guangzeng F, Feng L: A new echo caneller with the estimation of flat delay. In IEEE Region Ten Conf. TENCON 92. Melbourne, Australia; 1992. vol. 1, pp. 1–5, Print ISBN 0-7803-0849-2, DOI- 10.1109/TENCON.1992.271995
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1000, Bangladesh
Upal Mahbub & Shaikh Anowarul Fattah
Department of Electrical and Computer Engineering, Concordia University, Montreal, Quebec, H3G 1M8, Canada
Wei-Ping Zhu & M Omair Ahmad

Authors

Upal Mahbub
View author publications
You can also search for this author in PubMed Google Scholar
Shaikh Anowarul Fattah
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ping Zhu
View author publications
You can also search for this author in PubMed Google Scholar
M Omair Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Upal Mahbub.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Mahbub, U., Fattah, S.A., Zhu, WP. et al. Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering. J AUDIO SPEECH MUSIC PROC. 2014, 20 (2014). https://doi.org/10.1186/1687-4722-2014-20

Download citation

Received: 12 November 2013
Accepted: 25 March 2014
Published: 03 May 2014
DOI: https://doi.org/10.1186/1687-4722-2014-20

Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering

Abstract

1 Introduction

2 Problem formulation

3 Proposed single-channel AENC scheme

3.1 Proposed two-stage setup

3.2 Development of proposed gradient-based single-channel LMS AEC scheme

3.3 Convergence analysis of the proposed AEC scheme

3.4 Noise reduction in spectral domain

4 Development of adaptive update constraints

5 Simulation results and comments

6 Conclusion

Appendix

Derivation of the solution of the LMS update

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords