- Research
- Open Access

# Single-channel acoustic echo cancellation in noise based on gradient-based adaptive filtering

- Upal Mahbub
^{1}Email author, - Shaikh Anowarul Fattah
^{1}, - Wei-Ping Zhu
^{2}and - M Omair Ahmad
^{2}

**2014**:20

https://doi.org/10.1186/1687-4722-2014-20

© Mahbub et al.; licensee Springer. 2014

**Received: **12 November 2013

**Accepted: **25 March 2014

**Published: **3 May 2014

## Abstract

In this paper, a two-stage scheme is proposed to deal with the difficult problem of acoustic echo cancellation (AEC) in single-channel scenario in the presence of noise. In order to overcome the major challenge of getting a separate reference signal in adaptive filter-based AEC problem, the delayed version of the echo and noise suppressed signal is proposed to use as reference. A modified objective function is thereby derived for a gradient-based adaptive filter algorithm, and proof of its convergence to the optimum Wiener-Hopf solution is established. The output of the AEC block is fed to an acoustic noise cancellation (ANC) block where a spectral subtraction-based algorithm with an adaptive spectral floor estimation is employed. In order to obtain fast but smooth convergence with maximum possible echo and noise suppression, a set of updating constraints is proposed based on various speech characteristics (e.g., energy and correlation) of reference and current frames considering whether they are voiced, unvoiced, or pause. Extensive experimentation is carried out on several echo and noise corrupted natural utterances taken from the TIMIT database, and it is found that the proposed scheme can significantly reduce the effect of both echo and noise in terms of objective and subjective quality measures.

## Keywords

## 1 Introduction

The phenomenon of acoustic echo occurs when the output speech signal from a loudspeaker gets reflected from different surfaces, like ceilings, walls, and floors and then fed back to the microphone. In its worst case, acoustic echo can cause howling of a significant portion of sound energy [1, 2]. In real life applications, such as a lecture in a large conference hall or in the public address system of a trade fair, the presence of acoustic echo along with the environmental noise is a very common phenomenon, which degrades the speech quality even leading to complete loss of intelligibility.

In order to deal with the problem of acoustic echo cancellation (AEC), conventionally echo suppressors, earphones, and directional microphones have been used, which generally place restrictions on the talkers’ movement [2]. As an alternate of such hardware-based solutions, adaptive filter algorithms are widely being applied where apart from the input channel, a separate echo-free reference channel is required [3–13]. Among different adaptive filter algorithms, the least mean squares (LMS) algorithm and its different variants are very popular for their satisfactory performances and less computational burden [4, 10, 12–14]. Besides these algorithms, the recursive least squares (RLS) algorithm is well-known for its fast convergence at the expense of computational complexity [13]. The adaptive filter algorithms have also been used for acoustic noise cancellation (ANC) [15].

There are some methods that deal with both acoustic echo and noise cancellation (AENC) [16–18]. The echo canceller used in [16] utilizes a sub-band noise cancellation scheme. In [17], echo cancellation is done by an adaptive LMS filter while a linear prediction error filter removes the residual echo and noise. In [18], a single Wiener filter is employed to simultaneously suppress the echo and noise. It is to be mentioned that all these AENC methods employ more than one microphone, while the solutions using single microphone are favorable in most of the real-life applications.

In this paper, an AENC scheme is proposed which can efficiently deal with the single-channel scenario. First, unlike conventional LMS algorithm, considering the delayed version of the previously echo- and noise-suppressed signal as reference, a gradient-based adaptive LMS algorithm is developed for single channel AEC. Preliminary results obtained by using this idea is reported in [19]. However, in the current paper, analytical proof of convergence towards the optimum Wiener-Hopf solution is presented. Next, a single-channel ANC algorithm based on spectral subtraction with an adaptive spectral floor estimation is developed, which reduces not only the effect of noise but also some residual echo. Finally, analyzing different speech characteristics of the reference and current frames, multiconditional updating constraints are proposed in order to obtain precise control on convergence characteristics. For performance evaluation, extensive experimentation is conducted on several real-life echo and noise corrupted speech signals at different acoustic environments.

## 2 Problem formulation

*s*

_{1}(

*n*) and

*s*

_{2}(

*n*) are speech signals corresponding to near-end and far-end speakers, while

*v*

_{1}(

*n*) and

*v*

_{2}(

*n*) are additive noises, respectively. The noise corrupted far-end signal (

*s*

_{2}(

*n*)+

*v*

_{2}(

*n*)) is played through a loudspeaker at the near-end acoustic room environment and the echo signal

*x*

_{2}(

*n*) is generated. Thus, the input

*y*

_{1}(

*n*) to the near-end microphone is given by

*x*

_{2}(

*n*) by minimizing the error

*s*

_{2}(

*n*)+

*v*

_{2}(

*n*)) and (ii) different speakers for input and echo signals. Moreover, use of the double talk detector (DTD) helps in controlling the update process. Unfortunately, these features are absent in single-channel scenario as shown in Figure 2. Instead of two speakers, in this case, the microphone receives the input

*s*(

*n*) corrupted by noise

*v*(

*n*) and echo generated from the same speaker.

*v*(

*n*), the sole microphone input signal in single-channel scenario is given by

*x*

_{ s }(

*n*) and

*x*

_{ v }(

*n*) denote the echo of the input speech and noise, respectively. The echo signals can be expressed as

where **s**(*n*−*k*_{0})=[*s*(*n*−*k*_{0}−1),*s*(*n*−*k*_{0}−2),…,*s*(*n*−*k*_{0}−*p*)]^{
T
} and **v**(*n*−*k*_{0})=[*v*(*n*−*k*_{0}−1),*v*(*n*−*k*_{0}−2),…,*v*(*n*−*k*_{0}−*p*)]^{
T
} with *k*_{0} being a predefined flat delay and **a**_{
n
}=[*a*_{
n
}(1),*a*_{
n
}(2),…,*a*_{
n
}(*p*)]^{
T
} consists of the coefficients corresponding to the acoustic room transfer function *A*(*z*). The order *p* and coefficient values of *A*(*z*) depend on the room characteristics. It is to be noted that in this case, there is no scope of obtaining a separate echo-free reference or a separate noise-only reference, which makes the single-channel AENC problem extremely difficult to handle.

## 3 Proposed single-channel AENC scheme

### 3.1 Proposed two-stage setup

*y*(

*n*) can be described by (3). For the case of single-channel AEC, for example, while delivering a lecture in a large conference hall, the microphone in front of the speaker receives input speech

*s*(

*n*) corrupted by

*v*(

*n*). Once this noise-corrupted speech is transmitted through loudspeaker, echo signal is generated and thus the microphone after some initial time delay will receive noise-corrupted speech and echo of previously uttered speech. The task of AEC is to cancel the echo part from this input by using adaptive filter algorithm. In order to obtain adaptively an estimate ${\hat{x}}_{s}\left(n\right)+{\hat{x}}_{v}\left(n\right)$ of the echo signal, we propose to utilize delayed versions of the previously echo-suppressed samples of the noisy speech as reference signal [19]. A symbol hat on the variable is used to indicate estimated value. The error signal

*e*(

*n*) thus obtained is given by

where ${\delta}_{s}\left(n\right)={x}_{s}\left(n\right)-{\hat{x}}_{s}\left(n\right)$ and ${\delta}_{v}\left(n\right)={x}_{v}\left(n\right)-{\hat{x}}_{v}\left(n\right)$ are the residual echo of the speech and noise portions of the input signal, respectively, and it is assumed that these signals exhibit the properties of white Gaussian noise. Next, *e*(*n*) is passed through a spectral subtraction-based single-channel ANC block which produces output $\stackrel{~}{s}\left(n\right)\approx s\left(n\right)+\Psi \left(n\right)$ that closely resembles *s*(*n*) provided that the residual echo-noise portion *Ψ*(*n*) becomes very small.

It is to be noted that the task of noise reduction, unlike the proposed AENC scheme, may be carried out prior to the AEC block. However, because of possible nonlinearities introduced by the prior noise reduction block, no proper reference would be available for the single-channel AEC block [17]. Hence, the arrangement shown in Figure 3a is adopted, in which the noise reduction block also serves as a post-processor for attenuating the residual echo.

### 3.2 Development of proposed gradient-based single-channel LMS AEC scheme

*e*(

*n*) is proposed to use as the reference signal, and from (8), filter output

*e*(

*n*) can be written as

*E*{.} denotes the expectation operator. In (10), it is intended to use the basic definition of cross-correlation operation, for example, the cross-correlation function between

*s*(

*n*) and

*v*(

*n*) is defined as

*m*denotes the lag. Using (4), (5), (7), and the above definition, the last term of (10) can be expressed as

*r*

_{ s s }(

*k*

_{0}+

*k*) corresponds to the (

*k*

_{0}+

*k*)th lag of the cross-correlation between

*s*(

*n*) and its previous samples

*s*(

*n*−

*k*

_{0}−

*k*), and

*r*

_{ s v }(

*k*

_{0}+

*k*) corresponds to the (

*k*

_{0}+

*k*)th lag of the cross-correlation between

*s*(

*n*) and

*v*(

*n*−

*k*

_{0}−

*k*). In a similar way,

*r*

_{ v s }(

*k*

_{0}+

*k*),

*r*

_{ v v }(

*k*

_{0}+

*k*), ${r}_{s{\delta}_{s}}({k}_{0}+k)$, ${r}_{s{\delta}_{v}}({k}_{0}+k)$, ${r}_{v{\delta}_{s}}({k}_{0}+k)$, and ${r}_{v{\delta}_{v}}({k}_{0}+k)$ can be defined. It is well known that the value of cross-correlation decreases rapidly with the increasing lags when two signals are uncorrelated. In ideal case, the cross-correlation function between two random noise signals would be nonzero only at the zero lag. Since

*v*(

*n*) is assumed to be white Gaussian noise and, generally, the value of

*k*

_{0}is very large, in (12), the effect of the terms

*r*

_{ s v }(

*k*

_{0}+

*k*),

*r*

_{ v s }(

*k*

_{0}+

*k*), and

*r*

_{ v v }(

*k*

_{0}+

*k*) can be neglected. Moreover, because of noise-like characteristics of

*δ*

_{ s }(

*n*) and

*δ*

_{ v }(

*n*), in (12), one can neglect ${r}_{s{\delta}_{v}}({k}_{0}+k)$, ${r}_{v{\delta}_{s}}({k}_{0}+k)$, and ${r}_{v{\delta}_{v}}({k}_{0}+k)$ too. Hence, it can easily be comprehended that optimal filter performance occurs when

*r*

_{ s s }(

*n*) is minimum, i.e., the least possible correlation between

*s*(

*n*−

*k*

_{0}−

*k*) and

*s*(

*n*) is desired. As a result, (10) reduces to

*r*

_{ s s }(

*k*

_{0}+

*k*) strongly depends on speech characteristics and the amount of flat delay

*k*

_{0}. For a reasonably large

*k*

_{0}, the effect of

*r*

_{ s s }(

*k*

_{0}+

*k*) in 13 can be neglected, and minimization of (13) results in

where ${\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}(n-{k}_{0})$ consists of different lags of cross-correlation between the echo signal *x*_{
s
}(*n*)+*x*_{
v
}(*n*) and the noisy input signal *s*(*n*)+*v*(*n*), while **R**_{(s+v)(s+v)} is the auto-correlation matrix of *s*(*n*)+*v*(*n*). There is no doubt that ${\hat{\mathbf{\text{w}}}}_{n}$ is the most optimum solution possible. Hence, it is shown that even for a single-channel noise corrupted AEC problem, the most optimum solution ${\hat{\mathbf{\text{w}}}}_{n}$ can be achieved under the assumptions stated earlier.

*μ*is the step factor controlling the stability and rate of convergence,

*ξ*(

*n*) is the cost function, and ∇ is the gradient operator. The LMS algorithm simply approximates the mean square error by the square of the instantaneous error, i.e.,

*ξ*(

*n*)=

*e*

^{2}(

*n*), and therefore, from (6) and (7), the gradient of

*ξ*(

*n*) can be expressed as

### 3.3 Convergence analysis of the proposed AEC scheme

*k*th unknown weight vector (where

*k*=1,2,…,

*p*), using (6) and neglecting the effect of

*r*

_{ s s }(

*n*) that has already been discussed in the previous subsection, the last term of (19) can be written as

*λ*(

*k*) is the

*k*th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of

**R**

_{(s+v)(s+v)}(

*n*−

*k*

_{0}) and

*r*

^{ U }(

*n*−

*k*

_{0}−

*k*) is the

*k*th element of ${\mathbf{\text{U}}}^{T}{\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}(n-{k}_{0})={\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}^{U}(n-{k}_{0})$ with the matrix

**U**consisting of eigenvectors corresponding to eigenvalues. Since in the iterative update procedure, the homogeneous part (1−2

*μ*

*λ*(

*k*))

^{ n }diminishes with iterations, (23) in a matrix form can be expressed as

Thus, it is found that the average value of the weight vector converges to the Wiener-Hopf solution, which is the optimum solution with increasing number of iteration.

### 3.4 Noise reduction in spectral domain

*i*th frame, the error signal for the duration of a frame length can be written as

*E*

*r*

*r*

_{ i }(

*ω*) can be minimized by choosing

where the phase (arg[*E*_{
i
}(*ω*)]) is generally assumed to be the phase of the noise corrupted signal without causing significant degradation in terms of loss of intelligibility of the speech signal [20]. It can be seen that an estimate of the magnitude spectrum $\mid {\stackrel{~}{S}}_{i}\left(\omega \right)\mid $ of the signal can be obtained provided an estimate of noise spectrum $E\{\mid {\hat{V}}_{i}(\omega \left){\mid}^{2}\right\}$ is available, which is generally computed during the periods when speech is known *a priori* not to be present.

*s*

_{ i }(

*n*) and a negligible amount of noise-like signal

*Ψ*

_{ i }(

*n*). The signal

*Ψ*

_{ i }(

*n*), although very weak, may contain some signature of the input noise

*v*(

*n*), the residual echo

*δ*

_{ s }(

*n*), and the residual noise

*δ*

_{ v }(

*n*). In order to overcome the problem of musical noise and to avoid the speech distortion caused by speech subtraction, in (31), an over estimate of the noise power spectrum can be subtracted carefully such that the spectral floor is preserved [21]. Thus, (31) can be modified as

Here, *α*_{
s
s
} is the subtraction factor and *β*_{
s
s
} is the spectral floor parameter with *α*_{
s
s
}≥1 and 0≤*β*_{
s
s
}≤1. The task of noise power spectral density estimation is carried out based on the minimum statistics noise estimator proposed in [23] which can handle the time-varying nature of the noise.

## 4 Development of adaptive update constraints

- (i)
The level of cross-correlation

- (ii)
The amount of signal power

- (iii)
The mean square error (MSE) between consecutive estimates of the unknown filter coefficients.

Through extensive experimentation on different speech frames, it is found that the negligibility of the cross-correlation terms *r*_{
s
s
}(*n*), ${r}_{s{\delta}_{v}}\left(n\right)$, ${r}_{v{\delta}_{s}}\left(n\right)$, and ${r}_{v{\delta}_{v}}\left(n\right)$ (as described after (12)) strongly depends on the voicing characteristics of speech frames and the input noise. Because of inherent periodicity of the voiced speech frame, the degree of cross-correlation between two voiced speech frames of a person becomes higher in comparison to that between two unvoiced speech frames which are random in nature. Regarding signal power, the ratio of power of a voiced speech frame and an unvoiced speech frame is found to be higher in comparison to that of the two voiced speech frames. As white Gaussian noise is considered, the degree of cross-correlation between the speech and noise is found to be negligible and the noise powers in two different frames may not differ significantly. As a result, the effect of input noise is found to be negligible on the power ratio.

*k*

_{0}samples, the initial

*k*

_{0}samples of the utterance

*s*(

*n*)+

*v*(

*n*) can be treated as a reference signal (echo-free signal) responsible for the generation of echo signal that corrupts the current samples at or after

*k*

_{0}samples. Considering a window of

*M*samples with

*M*≪

*K*

_{0}, power of the reference signal $\left(\hat{s}\right(n-{k}_{0})+\hat{v}(n-{k}_{0}\left)\right)$ can be computed as

*M*samples of the echo-suppressed speech signal $\hat{s}\left(n\right)$, the average power

*P*

_{sup}(

*n*) can be computed as

The ratio of *P*_{ref}(*n*) and *P*_{sup}(*n*) is denoted as the power ratio *P*_{rs}(*n*) and considered as one of the control characteristics.

*C*

_{rs}(

*n*) between a frame of the noisy reference signal $\left(\hat{s}\right(n-{k}_{0})+\hat{v}(n-{k}_{0}\left)\right)$ and a frame of the current noisy signal $\left(\hat{s}\right(n)+\hat{v}(n\left)\right)$. For a frame length of

*M*samples, correlation coefficient

*C*

_{rs}(

*n*) is defined as

where −*M*/2≤*i*≤*M*/2−1 and 0≤*j*≤(*M*−1).

_{ideal}(

*n*) between the values of estimated coefficients ${\hat{w}}_{n}$ and those of true coefficients

*a*

_{ n }is computed as

*i*

*y*/−/

*i*

*x*/) contains a voiced phoneme followed by another voiced phoneme [24]. Here

*k*

_{0}=1,000,

*M*=100,

*N*

_{ f }=1002, sampling frequency 16 kHz and

*S*

*N*

*R*=15 db is used.

**Variation of LMS updating performance due to various characteristics of reference and current speech frame**

Reference speech sample | Current noise- and echo-corrupted speech sample | LMS update performance |
---|---|---|

Voiced | Voiced | Poor |

Voiced | Unvoiced | Unsatisfactory |

Voiced | Pause | Satisfactory/Excellent |

Unvoiced | Voiced | Excellent |

Unvoiced | Unvoiced | Excellent |

Unvoiced | Pause | Excellent |

Pause | Voiced | Poor |

Pause | Unvoiced | Poor |

Pause | Pause | Poor |

*P*

_{ref}(

*n*) and

*P*

_{sup}(

*n*) are defined in (33) and (34), respectively. If the value of the lower bound

*ζ*is chosen too large, the updating would be postponed for most of the instances resulting in very slow convergence. On the other hand, a very small value of

*ζ*may cause more frequent updates where possibility of wrong estimations of filter coefficients would be higher, especially in V-P, U-P, and P-P cases. It is to be noted that considering only a lower bound of

*P*

_{rs}(

*n*) may not always be sufficient to ensure that the reference frame possesses significant energy. For example in Figure 13, it is shown that high value of

*P*

_{rs}(

*n*) may arise (marked block in the figure) from an initial silence frame where only a very little amount of noise is present. In order to prevent the updating in these situations, a lower bound

*β*on the power of the reference frame is employed, i.e.,

*P*

_{ref}(

*n*)≥

*β*. The value of

*β*should surpass the power of speech pauses and ensure that the LMS update is postponed even if a frame of speech containing a partial pause is available as the reference. Hence, the first constraint for updating the algorithm is proposed as

**Condition I:**

*P*

_{rs}(

*n*)≥

*ζ*and

*P*

_{ref}(

*n*)≥

*β*.

In some cases, it is observed that though the power ratio is very small, quite satisfactory updating is obtained, such as the U-V case shown in Figure 7. Another characteristic observed here is lower value of correlation coefficient *C*_{rs}(*n*) with higher value of *P*_{ref}(*n*). It is to be mentioned that the proposed AEC algorithm is developed on the assumption of negligibility of the cross correlation between current frame and reference frame. However, since both reference and current frame may belong to the same person, in case of high degree of correlation, the adaptive algorithm would try to suppress portion from the echo-corrupted signal resulting in unusual degradation= in convergence performance. Hence, introducing an upper bound on *C*_{rs}(*n*), the second condition is proposed as **Condition II:** *C*_{rs}(*n*)≤*Υ* 1 and *P*_{ref}≥*β*.

The presence of a certain level of noise can be utilized as an advantage in pause instances where generally the updating is not performed. Since noise is considered uncorrelated to itself, updating at frames where only noise is present would be quite satisfactory. In this case, the value of *C*_{rs}(*n*) must be very small and thus another condition on updating is proposed as **Condition III:** *C*_{rs}(*n*)≤*Υ* 2≤*Υ* 1.

In order to continue the updating, an upper bound on the variation of successive estimates is set as following condition: **Condition IV:** *e*_{
c
o
e
f
f
}(*n*)≤*ℵ*.

Considering smaller values of *e*_{coeff}(*n*) allows to avoid updating at those instances where abrupt and significant changes occur in the estimated coefficients. In the proposed method, in order to carry out the LMS update, at least one of the above four conditions must be fulfilled.

## 5 Simulation results and comments

Performance of the proposed algorithm is investigated in different echo-generating environments at various input noise levels considering several male and female utterances available in the TIMIT database [24]. An acoustic room environment is simulated using an FIR filter of length *N*_{
f
}, where as per conventional approaches, filter coefficients during the flat delay portion are assumed to be zero. The flat delay time (*k*_{0}) can be pre-calculated based on the distance between the microphone and the speaker [25]. Because of the implicit zeros corresponding to the flat delay, it is evident that a few number (*N*_{
f
}−*k*_{0}) of unknown coefficients has to be determined. In the proposed method, a smaller step size is used to obtain a smooth convergence.

First, a subjective evaluation is carried out based on the feedback about the quality of the echo- and noise-suppressed signal provided by five individual listeners at different noisy echo-generating environments. From the overall response of the listeners in terms of mean objective score (MOS), a very satisfactory performance of the proposed method is obtained even under severe echo-generating conditions in noise.

*η*

_{ ς }(

*n*) and that of the input echo signal

*η*

_{ x }(

*n*) and expressed in dB as [1]

*n*) over time is considered. The input and output SDRs in dB are respectively defined as

*P*

_{ s }is the power of original signal

*s*(

*n*),

*P*

_{x+v}is the power of microphone input, and ${P}_{\hat{s}+\hat{v}-s}\left(n\right)$ is the power of distortion present in the echo-suppressed output signal. The SDR improvement is given by

which indicates the overall distortion removal.

*N*

_{ f }=1,002,

*k*

_{0}=1,000, and

*M*=100 in Figure 14b,c,d,e,

*P*

_{rs}(

*n*),

*P*

_{ref}(

*n*),

*C*

_{rs}(

*n*), and MSE

_{ideal}(

*n*) are shown, respectively. Note that in this case, the proposed algorithm is used without the update constraints, and thus, the MSE

_{ideal}(

*n*) exhibits some higher values. The comments provided in Table 1 can be better visualized from different marked zones of this figure. From extensive experimentations, it is found that a better update requires

*P*

_{ref}(

*n*) to be at least twice of

*P*

_{supp}(

*n*) and a small percentage (1

*%*to 5

*%*) of the power of a regular voiced frame can be chosen as the lower bound of

*β*for

*P*

_{ref}(

*n*). Analyzing

*C*

_{rs}(

*n*) in different speech frames,

*Υ*1 in condition 2 is chosen as 0.25 to ensure that no speech is being suppressed during the update procedure by confusing it with the echo and

*Υ*2 is kept very small, i.e,

*Υ*2≈0.1 to allow updating for cases where there exists no correlation or extremely low correlation between the reference signal and echo-suppressed signal. The value of the threshold

*ℵ*for

*e*

_{coeff}(

*n*) in condition

*IV*is chosen to be very small (0.7×10

^{−4}) such that there will be no update of the LMS algorithm when the magnitude of

*e*

_{coeff}(

*n*) is comparatively much larger.

_{ideal}(

*n*) obtained in Figure 14e is redrawn in Figure 15. In Figure 15, the effect of incorporating the conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo- and noise-corrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed. For a better understanding, another TIMIT utterance ‘She had your dark suit in greasy wash water all year’, under similar acoustic environment as used in Figure 14, is considered and corresponding echo- and noise-corrupted speech signal is shown in Figure 17a. The MSEs obtained by using the proposed method with and without the conditions are presented in Figure 17b,c, which clearly demonstrate the performance improvement in the later case.

*N*

_{ f }) and parameter values of the room response filter are varied while keeping the input SNR constant to 15 dB. Considering

*k*

_{0}=1,000,

*N*

_{ f }−

*k*

_{0}is varied from 2 to 14. Results shown in the table clearly demonstrate the effectiveness of using the conditions on performance measures; in all cases, higher values of SDR and ERLE are obtained.

**Performance comparison with varying room acoustics**

No condition | With conditions | |||
---|---|---|---|---|

N | SDRI (dB) | Avg. ERLE (dB) | SDR (dB) | Avg. ERLE (dB) |

2 | 4.9921 | 8.8496 | 6.9848 | 10.6772 |

4 | 4.9027 | 2.0696 | 5.7731 | 2.2787 |

6 | 8.391 | 4.6507 | 9.2744 | 5.0313 |

8 | 6.4551 | 2.4214 | 6.5558 | 2.6797 |

10 | 6.0507 | 2.6341 | 6.1730 | 2.854 |

12 | 6.7127 | 3.0277 | 7.0978 | 3.2048 |

14 | 7.8763 | 3.7481 | 8.2515 | 3.8909 |

*N*

_{ f }=1014. It can be seen that the proposed method provides satisfactory performance at all SNR levels. Especially, the use of proposed conditions exhibits comparatively better performance.

**Performance comparison with noise level variation**

No condition | With conditions | |||
---|---|---|---|---|

Input noise | SDRI | Avg. ERLE | SDR | Avg. ERLE |

Level (dB) | (dB) | (dB) | (dB) | (dB) |

25 | 7.4065 | 3.183 | 7.8189 | 3.2759 |

20 | 7.613 | 3.5382 | 7.9346 | 3.6171 |

15 | 7.8763 | 3.7481 | 8.2515 | 3.8909 |

10 | 8.2085 | 3.5999 | 8.386 | 3.6064 |

5 | 8.2434 | 3.0533 | 8.8839 | 3.0765 |

0 | 8.7968 | 2.4493 | 9.4557 | 2.542 |

-5 | 8.2259 | 2.0032 | 10.5136 | 2.2912 |

## 6 Conclusion

The problem of echo cancellation in the presence of noise, especially in single-channel environment, is a very challenging task, which has been efficiently tackled in this paper. First, the single-channel AEC block is designed based on the gradient-based adaptive LMS filter where to overcome the problem of getting a separate reference signal, we propose to use the delayed version of the echo-suppressed signal. Such a unique proposal of getting the reference signal is justified by presenting a detailed mathematical proof of achieving the most optimum Wiener-Hopf solution of the estimated filter coefficients, and a convergence analysis is carried out. Moreover, in order to achieve fast and smooth convergence, a set of updating constraints is proposed by analyzing the speech characteristics of different types of speech frames, such as voiced, unvoiced, and pause. In the ANC block, a modified single-channel spectral subtraction method is considered for its robust performance. It is shown that the proposed AENC scheme with updating constraints provides a very satisfactory performance in different echo-generating conditions and various levels of SNR in terms of SDR and ERLE.

## Appendix

### Derivation of the solution of the LMS update

**R**

_{(s+v)(s+v)}(

*n*−

*k*

_{0}) results in

**U**consists of eigenvectors corresponding to eigenvalues constituting the diagonal elements of the matrix

**Λ**and

**U**

^{ T }

**U**=

**I**. Forward multiplication by

**U**

^{ T }on both sides of (43) results in

*k*th coefficient of the weight vector can be expressed as

*λ*(

*k*) is the

*k*th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of

**R**

_{(s+v)(s+v)}(

*n*−

*k*

_{0}). Hence, the homogeneous solution can be obtained as

*C*

_{ k }is a constant. Next, in order to obtain the particular solution for the

*k*th coefficient, based on (22) one can get

*r*

^{ U }(

*n*−

*k*

_{0}−

*k*) is the

*k*th element of

**U**

^{ T }${\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}(n-{k}_{0})={\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}^{U}(n-{k}_{0})$. For a particular solution ${\hat{w}}_{\mathrm{p.s}}={K}_{p}{r}^{U}(n-{k}_{0}-k)$, (48) can be written as

## Declarations

## Authors’ Affiliations

## References

- Vaseghi SV:
*Advanced Digital Signal Processing and Noise Reduction*. Wiley, Chichester; 2000.Google Scholar - Kuo SM, Lee BH:
*Real-Time Digital Signal Processing*. Wiley; 2001.View ArticleGoogle Scholar - Breining C, Dreiseitel P, Hänsler E, Mader A, Nitsch B, Puder H, Schertler T, Schmidt G, Tilp J: Acoustic echo control - an application of very-high-order adaptive filters.
*IEEE Signal Process. Mag*1999, 16(4):42-69. 10.1109/79.774933View ArticleGoogle Scholar - Hänsler E: The hands-free telephone problem: an annotated bibliography.
*Signal Process*1992, 27(3):259-271. 10.1016/0165-1684(92)90074-7View ArticleGoogle Scholar - Khong AWH, Naylor PA: Stereophonic acoustic echo cancellation employing selective-tap adaptive algorithms.
*IEEE Trans. Audio, Speech, Lang. Process*2006, 14(3):785-796.View ArticleGoogle Scholar - Lindstrom F, Schuldt C, Claesson I: An improvement of the two-path algorithm transfer logic for acoustic echo cancellation.
*IEEE Trans. Audio, Speech, Lang. Process*2007, 15(4):1320-1326.View ArticleGoogle Scholar - Wu S, Qiu X, Wu M: Stereo acoustic echo cancellation employing frequency-domain preprocessing and adaptive filter.
*IEEE Trans. Audio, Speech, Lang. Process*2011, 19(3):614-623.View ArticleGoogle Scholar - Nath R: Adaptive echo cancellation based on a multipath model of acoustic channel.
*Circuits, Syst. Signal Process., Springer US*2013, 32(4):1673-1698. 10.1007/s00034-012-9529-4View ArticleGoogle Scholar - Yukawa M, de Lamare RC, Sampaio-Neto R: Efficient acoustic echo cancellation with reduced-rank adaptive filtering based on selective decimation and adaptive interpolation.
*IEEE Trans. Audio, Speech, Lang. Process*2008, 16(4):696-710.View ArticleGoogle Scholar - Hänsler E, Schmidt G:
*Acoustic Echo and Noise Control: a Practical Approach*. Wiley, New York; 2004.View ArticleGoogle Scholar - Myllylä V: Residual echo filter for enhanced acoustic echo control.
*Signal Process*2006, 86(6):1193-1205. 10.1016/j.sigpro.2005.07.036View ArticleGoogle Scholar - Topa R, Muresan I, Kirei BS, Homana I: A digital adaptive echo-canceller for room acoustics improvement.
*Adv. Electrical Comput. Eng*2004, 10: 450-453.Google Scholar - Haykin S:
*Adaptive Filter Theory*. Prentice-Hall, Inc., Upper Saddle River, NJ; 1996.Google Scholar - Schmidt G: Applications of acoustic echo control: an overview. In
*Proc. Eur. Signal Process. Conf.*. EUSIPCO, Vienna; 2004:9-16.Google Scholar - Widrow B, Glover JRJ, McCool JM, Kaunitz J, Williams CS, Hearn RH, Zeidler JR, Dong JE, Goodlin RC: Adaptive noise cancelling: principles and applications.
*Proc. IEEE*1975, 63(12):1692-1716.View ArticleGoogle Scholar - Yasukawa H: An acoustic echo canceller with sub-band noise cancelling.
*IEICE Trans. Fundamentals Electron. Commun. Comput. Sci*1992, E75–A(11):1516-1523.Google Scholar - Park SJ, Cho CG, Lee C, Youn DH: Integrated echo and noise canceller for hands-free applications. IEEE Trans. Circuits Syst.-II: Analog Digital Signal Process 2002., 49(3):Google Scholar
- Beaugeant C, Turbin V, Scalart P, Gilloire A: New optimal filtering approaches for hands-free telecommunication terminals.
*Signal Process*1998, 64(1):33-47. 10.1016/S0165-1684(97)00174-6View ArticleGoogle Scholar - Mahbub U, Fattah SA: Gradient based adaptive filter algorithm for single channel acoustic echo cancellation in noise. In
*Proc. Int. Conf. Electrical Computer Engineering (ICECE), 2012 7th International Conference On*. Dhaka, 688 Bangladesh; 2012:880-883.View ArticleGoogle Scholar - Boll S: A spectral subtraction algorithm for suppression of acoustic noise in speech.
*Proc. IEEE Int. Conf. Acoust. Speech, Signal Process. (ICASSP) ’79*1979, 200-203.View ArticleGoogle Scholar - Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise.
*IEEE Conf. Acoust. Speech Signal Process. (ICASSP)*1979, 208-211.Google Scholar - Lim JS: Evaluation of a correlation subtraction method for enhancing speech degraded by additive white noise.
*IEEE Trans. Acoust. Speech Signal Process*1978, 26(5):471-472. 10.1109/TASSP.1978.1163129View ArticleGoogle Scholar - Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics.
*IEEE Trans. Speech Audio Process*2001, 9(5):504-512. 10.1109/89.928915View ArticleGoogle Scholar - Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V:
*Timit acoustic-phonetic continuous speech corpus*. Linguistic Data Consortium, Philadelphia; 1993.Google Scholar - Guangzeng F, Feng L: A new echo caneller with the estimation of flat delay. In
*IEEE Region Ten Conf. TENCON 92*. Melbourne, Australia; 1992. vol. 1, pp. 1–5, Print ISBN 0-7803-0849-2, DOI- 10.1109/TENCON.1992.271995Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.