 Research
 Open Access
 Published:
Singlechannel acoustic echo cancellation in noise based on gradientbased adaptive filtering
EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 20 (2014)
Abstract
In this paper, a twostage scheme is proposed to deal with the difficult problem of acoustic echo cancellation (AEC) in singlechannel scenario in the presence of noise. In order to overcome the major challenge of getting a separate reference signal in adaptive filterbased AEC problem, the delayed version of the echo and noise suppressed signal is proposed to use as reference. A modified objective function is thereby derived for a gradientbased adaptive filter algorithm, and proof of its convergence to the optimum WienerHopf solution is established. The output of the AEC block is fed to an acoustic noise cancellation (ANC) block where a spectral subtractionbased algorithm with an adaptive spectral floor estimation is employed. In order to obtain fast but smooth convergence with maximum possible echo and noise suppression, a set of updating constraints is proposed based on various speech characteristics (e.g., energy and correlation) of reference and current frames considering whether they are voiced, unvoiced, or pause. Extensive experimentation is carried out on several echo and noise corrupted natural utterances taken from the TIMIT database, and it is found that the proposed scheme can significantly reduce the effect of both echo and noise in terms of objective and subjective quality measures.
1 Introduction
The phenomenon of acoustic echo occurs when the output speech signal from a loudspeaker gets reflected from different surfaces, like ceilings, walls, and floors and then fed back to the microphone. In its worst case, acoustic echo can cause howling of a significant portion of sound energy [1, 2]. In real life applications, such as a lecture in a large conference hall or in the public address system of a trade fair, the presence of acoustic echo along with the environmental noise is a very common phenomenon, which degrades the speech quality even leading to complete loss of intelligibility.
In order to deal with the problem of acoustic echo cancellation (AEC), conventionally echo suppressors, earphones, and directional microphones have been used, which generally place restrictions on the talkers’ movement [2]. As an alternate of such hardwarebased solutions, adaptive filter algorithms are widely being applied where apart from the input channel, a separate echofree reference channel is required [3–13]. Among different adaptive filter algorithms, the least mean squares (LMS) algorithm and its different variants are very popular for their satisfactory performances and less computational burden [4, 10, 12–14]. Besides these algorithms, the recursive least squares (RLS) algorithm is wellknown for its fast convergence at the expense of computational complexity [13]. The adaptive filter algorithms have also been used for acoustic noise cancellation (ANC) [15].
There are some methods that deal with both acoustic echo and noise cancellation (AENC) [16–18]. The echo canceller used in [16] utilizes a subband noise cancellation scheme. In [17], echo cancellation is done by an adaptive LMS filter while a linear prediction error filter removes the residual echo and noise. In [18], a single Wiener filter is employed to simultaneously suppress the echo and noise. It is to be mentioned that all these AENC methods employ more than one microphone, while the solutions using single microphone are favorable in most of the reallife applications.
In this paper, an AENC scheme is proposed which can efficiently deal with the singlechannel scenario. First, unlike conventional LMS algorithm, considering the delayed version of the previously echo and noisesuppressed signal as reference, a gradientbased adaptive LMS algorithm is developed for single channel AEC. Preliminary results obtained by using this idea is reported in [19]. However, in the current paper, analytical proof of convergence towards the optimum WienerHopf solution is presented. Next, a singlechannel ANC algorithm based on spectral subtraction with an adaptive spectral floor estimation is developed, which reduces not only the effect of noise but also some residual echo. Finally, analyzing different speech characteristics of the reference and current frames, multiconditional updating constraints are proposed in order to obtain precise control on convergence characteristics. For performance evaluation, extensive experimentation is conducted on several reallife echo and noise corrupted speech signals at different acoustic environments.
2 Problem formulation
In order to formulate the problem of singlechannel AENC, for a better understanding, first, a dual channel AENC scheme is presented in Figure 1 (according to [17]). Here, s_{1}(n) and s_{2}(n) are speech signals corresponding to nearend and farend speakers, while v_{1}(n) and v_{2}(n) are additive noises, respectively. The noise corrupted farend signal (s_{2}(n)+v_{2}(n)) is played through a loudspeaker at the nearend acoustic room environment and the echo signal x_{2}(n) is generated. Thus, the input y_{1}(n) to the nearend microphone is given by
The task of the adaptive filterbased AEC block placed at the nearend is to produce an estimate ${\hat{x}}_{2}\left(n\right)$ of the echo x_{2}(n) by minimizing the error
Two major issues in dual channel system are (i) availability of a separate reference signal required for the adaptive filter, for example, here the delayed version of (s_{2}(n)+v_{2}(n)) and (ii) different speakers for input and echo signals. Moreover, use of the double talk detector (DTD) helps in controlling the update process. Unfortunately, these features are absent in singlechannel scenario as shown in Figure 2. Instead of two speakers, in this case, the microphone receives the input s(n) corrupted by noise v(n) and echo generated from the same speaker.
In the presence of noise v(n), the sole microphone input signal in singlechannel scenario is given by
where x_{ s }(n) and x_{ v }(n) denote the echo of the input speech and noise, respectively. The echo signals can be expressed as
where s(n−k_{0})=[s(n−k_{0}−1),s(n−k_{0}−2),…,s(n−k_{0}−p)]^{T} and v(n−k_{0})=[v(n−k_{0}−1),v(n−k_{0}−2),…,v(n−k_{0}−p)]^{T} with k_{0} being a predefined flat delay and a_{ n }=[a_{ n }(1),a_{ n }(2),…,a_{ n }(p)]^{T} consists of the coefficients corresponding to the acoustic room transfer function A(z). The order p and coefficient values of A(z) depend on the room characteristics. It is to be noted that in this case, there is no scope of obtaining a separate echofree reference or a separate noiseonly reference, which makes the singlechannel AENC problem extremely difficult to handle.
3 Proposed singlechannel AENC scheme
3.1 Proposed twostage setup
In Figure 3a, a simple block diagram showing two stages of the proposed AENC scheme is presented and in Figure 3b, more detail of the adaptive filterbased AEC algorithm involved in the first stage is shown. Similar to Figure 2, the input to the microphone y(n) can be described by (3). For the case of singlechannel AEC, for example, while delivering a lecture in a large conference hall, the microphone in front of the speaker receives input speech s(n) corrupted by v(n). Once this noisecorrupted speech is transmitted through loudspeaker, echo signal is generated and thus the microphone after some initial time delay will receive noisecorrupted speech and echo of previously uttered speech. The task of AEC is to cancel the echo part from this input by using adaptive filter algorithm. In order to obtain adaptively an estimate ${\hat{x}}_{s}\left(n\right)+{\hat{x}}_{v}\left(n\right)$ of the echo signal, we propose to utilize delayed versions of the previously echosuppressed samples of the noisy speech as reference signal [19]. A symbol hat on the variable is used to indicate estimated value. The error signal e(n) thus obtained is given by
The estimate of the echo signal can be expressed as
where ${\hat{\mathbf{\text{w}}}}_{n}={\left[{\hat{w}}_{n}\right(1),{\hat{w}}_{n}(2)\dots {\hat{w}}_{n}(p\left)\right]}^{T}$ is the estimated coefficient vector. The task of the adaptive filter is to obtain an optimum ${\hat{\mathbf{\text{w}}}}_{n}$ by minimizing the error in (6) i.e.,
where ${\delta}_{s}\left(n\right)={x}_{s}\left(n\right){\hat{x}}_{s}\left(n\right)$ and ${\delta}_{v}\left(n\right)={x}_{v}\left(n\right){\hat{x}}_{v}\left(n\right)$ are the residual echo of the speech and noise portions of the input signal, respectively, and it is assumed that these signals exhibit the properties of white Gaussian noise. Next, e(n) is passed through a spectral subtractionbased singlechannel ANC block which produces output $\stackrel{~}{s}\left(n\right)\approx s\left(n\right)+\Psi \left(n\right)$ that closely resembles s(n) provided that the residual echonoise portion Ψ(n) becomes very small.
It is to be noted that the task of noise reduction, unlike the proposed AENC scheme, may be carried out prior to the AEC block. However, because of possible nonlinearities introduced by the prior noise reduction block, no proper reference would be available for the singlechannel AEC block [17]. Hence, the arrangement shown in Figure 3a is adopted, in which the noise reduction block also serves as a postprocessor for attenuating the residual echo.
3.2 Development of proposed gradientbased singlechannel LMS AEC scheme
A delayed version of the adaptive filter output e(n) is proposed to use as the reference signal, and from (8), filter output e(n) can be written as
where $\hat{s}\left(n\right)=s\left(n\right)+{\delta}_{s}\left(n\right)$ and $\hat{v}\left(n\right)=v\left(n\right)+{\delta}_{v}\left(n\right)$. The objective function of the adaptive filter involves minimization of the mean square estimation of the error function and using (6) it can be written as
where E{.} denotes the expectation operator. In (10), it is intended to use the basic definition of crosscorrelation operation, for example, the crosscorrelation function between s(n) and v(n) is defined as
where m denotes the lag. Using (4), (5), (7), and the above definition, the last term of (10) can be expressed as
Here, r_{ s s }(k_{0}+k) corresponds to the (k_{0}+k)th lag of the crosscorrelation between s(n) and its previous samples s(n−k_{0}−k), and r_{ s v }(k_{0}+k) corresponds to the (k_{0}+k)th lag of the crosscorrelation between s(n) and v(n−k_{0}−k). In a similar way, r_{ v s }(k_{0}+k), r_{ v v }(k_{0}+k), ${r}_{s{\delta}_{s}}({k}_{0}+k)$, ${r}_{s{\delta}_{v}}({k}_{0}+k)$, ${r}_{v{\delta}_{s}}({k}_{0}+k)$, and ${r}_{v{\delta}_{v}}({k}_{0}+k)$ can be defined. It is well known that the value of crosscorrelation decreases rapidly with the increasing lags when two signals are uncorrelated. In ideal case, the crosscorrelation function between two random noise signals would be nonzero only at the zero lag. Since v(n) is assumed to be white Gaussian noise and, generally, the value of k_{0} is very large, in (12), the effect of the terms r_{ s v }(k_{0}+k), r_{ v s }(k_{0}+k), and r_{ v v }(k_{0}+k) can be neglected. Moreover, because of noiselike characteristics of δ_{ s }(n) and δ_{ v }(n), in (12), one can neglect ${r}_{s{\delta}_{v}}({k}_{0}+k)$, ${r}_{v{\delta}_{s}}({k}_{0}+k)$, and ${r}_{v{\delta}_{v}}({k}_{0}+k)$ too. Hence, it can easily be comprehended that optimal filter performance occurs when r_{ s s }(n) is minimum, i.e., the least possible correlation between s(n−k_{0}−k) and s(n) is desired. As a result, (10) reduces to
Here, the magnitude of r_{ s s }(k_{0}+k) strongly depends on speech characteristics and the amount of flat delay k_{0}. For a reasonably large k_{0}, the effect of r_{ s s }(k_{0}+k) in 13 can be neglected, and minimization of (13) results in
Hence, we obtain
The above equation is similar to WienerHopf equation and its solution can be written as
where ${\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}(n{k}_{0})$ consists of different lags of crosscorrelation between the echo signal x_{ s }(n)+x_{ v }(n) and the noisy input signal s(n)+v(n), while R_{(s+v)(s+v)} is the autocorrelation matrix of s(n)+v(n). There is no doubt that ${\hat{\mathbf{\text{w}}}}_{n}$ is the most optimum solution possible. Hence, it is shown that even for a singlechannel noise corrupted AEC problem, the most optimum solution ${\hat{\mathbf{\text{w}}}}_{n}$ can be achieved under the assumptions stated earlier.
For iterative estimation of optimal filter coefficients, the adaptive LMS algorithm is very popular. It is fast and efficient, and it does not require any correlation measurements or matrix inversion [13]. The update equation of the LMS adaptive algorithm is generally expressed as
where μ is the step factor controlling the stability and rate of convergence, ξ(n) is the cost function, and ∇ is the gradient operator. The LMS algorithm simply approximates the mean square error by the square of the instantaneous error, i.e., ξ(n)=e^{2}(n), and therefore, from (6) and (7), the gradient of ξ(n) can be expressed as
Thus, the update equation for the proposed singlechannel LMS adaptive scheme can be written as
3.3 Convergence analysis of the proposed AEC scheme
Considering expectation operation on both sides of the update Eq. 18, one can obtain
Here, an underline beneath ${\hat{\mathbf{\text{w}}}}_{n}$ is introduced to represent the expected value $E\left\{{\hat{\mathbf{\text{w}}}}_{n}\right\}$. For the k th unknown weight vector (where k=1,2,…,p), using (6) and neglecting the effect of r_{ s s }(n) that has already been discussed in the previous subsection, the last term of (19) can be written as
Based on the assumptions on crosscorrelation terms stated in the previous subsection, one can obtain
Using (21), the update Eq. 19 can be written as
Evaluating the homogeneous and particular solutions of (22), the total solution can be obtained as (see Appendix)
where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R_{(s+v)(s+v)}(n−k_{0}) and r^{U}(n−k_{0}−k) is the k th element of ${\mathbf{\text{U}}}^{T}{\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}(n{k}_{0})={\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}^{U}(n{k}_{0})$ with the matrix U consisting of eigenvectors corresponding to eigenvalues. Since in the iterative update procedure, the homogeneous part (1−2μ λ(k))^{n} diminishes with iterations, (23) in a matrix form can be expressed as
Thus, it is found that the average value of the weight vector converges to the WienerHopf solution, which is the optimum solution with increasing number of iteration.
3.4 Noise reduction in spectral domain
In the proposed AENC scheme, the operation of the ANC block is processed frame by frame for noise reduction based on singlechannel spectral subtraction algorithm [20–22]. According to (9), for the i th frame, the error signal for the duration of a frame length can be written as
Corresponding frequency domain representation is given by
The magnitude squared spectrum of ${\hat{s}}_{i}\left(n\right)$ can be written as
It is desired to choose an estimate ${\stackrel{~}{S}}_{i}\left(\omega \right)$ that will minimize
Since the noise is assumed to be zero mean and uncorrelated with the signal, the expected values of the last two terms of (27) can be neglected. Thus, (28) can be expressed as
This expression of E r r_{ i }(ω) can be minimized by choosing
With an estimate of noise spectrum $E\{\mid {\hat{V}}_{i}(\omega \left){\mid}^{2}\right\}$, signal spectrum ${\stackrel{~}{S}}_{i}\left(\omega \right)$ can be computed as
where the phase (arg[E_{ i }(ω)]) is generally assumed to be the phase of the noise corrupted signal without causing significant degradation in terms of loss of intelligibility of the speech signal [20]. It can be seen that an estimate of the magnitude spectrum $\mid {\stackrel{~}{S}}_{i}\left(\omega \right)\mid $ of the signal can be obtained provided an estimate of noise spectrum $E\{\mid {\hat{V}}_{i}(\omega \left){\mid}^{2}\right\}$ is available, which is generally computed during the periods when speech is known a priori not to be present.
Final output of the AENC system is the speech frame $\left({\stackrel{~}{s}}_{i}\right(n\left)\right)$, which consists of the original speech s_{ i }(n) and a negligible amount of noiselike signal Ψ_{ i }(n). The signal Ψ_{ i }(n), although very weak, may contain some signature of the input noise v(n), the residual echo δ_{ s }(n), and the residual noise δ_{ v }(n). In order to overcome the problem of musical noise and to avoid the speech distortion caused by speech subtraction, in (31), an over estimate of the noise power spectrum can be subtracted carefully such that the spectral floor is preserved [21]. Thus, (31) can be modified as
Here, α_{ s s } is the subtraction factor and β_{ s s } is the spectral floor parameter with α_{ s s }≥1 and 0≤β_{ s s }≤1. The task of noise power spectral density estimation is carried out based on the minimum statistics noise estimator proposed in [23] which can handle the timevarying nature of the noise.
4 Development of adaptive update constraints
The AEC part of the proposed AENC scheme may suffer from some common problems of adaptive filterbased algorithms, such as slow convergence rate and fluctuation around the desired estimates, especially in practical cases where the assumption on negligibility of crosscorrelation terms (stated in the previous section) may not strictly hold. In order to overcome such problems, some updating constraints are proposed based on the following speech characteristics:

(i)
The level of crosscorrelation

(ii)
The amount of signal power

(iii)
The mean square error (MSE) between consecutive estimates of the unknown filter coefficients.
Through extensive experimentation on different speech frames, it is found that the negligibility of the crosscorrelation terms r_{ s s }(n), ${r}_{s{\delta}_{v}}\left(n\right)$, ${r}_{v{\delta}_{s}}\left(n\right)$, and ${r}_{v{\delta}_{v}}\left(n\right)$ (as described after (12)) strongly depends on the voicing characteristics of speech frames and the input noise. Because of inherent periodicity of the voiced speech frame, the degree of crosscorrelation between two voiced speech frames of a person becomes higher in comparison to that between two unvoiced speech frames which are random in nature. Regarding signal power, the ratio of power of a voiced speech frame and an unvoiced speech frame is found to be higher in comparison to that of the two voiced speech frames. As white Gaussian noise is considered, the degree of crosscorrelation between the speech and noise is found to be negligible and the noise powers in two different frames may not differ significantly. As a result, the effect of input noise is found to be negligible on the power ratio.
For a flat delay of k_{0} samples, the initial k_{0} samples of the utterance s(n)+v(n) can be treated as a reference signal (echofree signal) responsible for the generation of echo signal that corrupts the current samples at or after k_{0} samples. Considering a window of M samples with M≪K_{0}, power of the reference signal $\left(\hat{s}\right(n{k}_{0})+\hat{v}(n{k}_{0}\left)\right)$ can be computed as
For a window of last M samples of the echosuppressed speech signal $\hat{s}\left(n\right)$, the average power P_{sup}(n) can be computed as
The ratio of P_{ref}(n) and P_{sup}(n) is denoted as the power ratio P_{rs}(n) and considered as one of the control characteristics.
Another important characteristic criterion is the correlation coefficient C_{rs}(n) between a frame of the noisy reference signal $\left(\hat{s}\right(n{k}_{0})+\hat{v}(n{k}_{0}\left)\right)$ and a frame of the current noisy signal $\left(\hat{s}\right(n)+\hat{v}(n\left)\right)$. For a frame length of M samples, correlation coefficient C_{rs}(n) is defined as
where −M/2≤i≤M/2−1 and 0≤j≤(M−1).
Finally, the parameter estimation accuracy is also considered for the purpose of analyzing the convergence property. In this regard, the mean square error MSE_{ideal}(n) between the values of estimated coefficients ${\hat{w}}_{n}$ and those of true coefficients a_{ n } is computed as
In Figure 4, considering a reallife speech utterance of 250 ms corrupted by echo and noise, behavior of the control parameters obtained by using (33), (34), (35), and (36) is shown. The speech utterance (/i y/−/i x/) contains a voiced phoneme followed by another voiced phoneme [24]. Here k_{0}=1,000, M=100, N_{ f }=1002, sampling frequency 16 kHz and S N R=15 db is used.
In a similar fashion, in Figure 5, a speech utterance consisting of a voiced phoneme /ih/ followed by an unvoiced phoneme /sh/ and, in Figure 6, a voiced phoneme /ih/ followed by pause are considered. It is observed that the characteristic parameters vary depending on the nature of reference and current frames. When the current frame is a pause or weakly unvoiced, the power ratio becomes higher in comparison to the case when the current frame is a voiced one. On the contrary, the correlation coefficient becomes smaller when measured between a voiced and an unvoiced frame, but it becomes quite larger when measured between two voiced frames. It is also found that the presence of voiced frame as a reference strongly governs the rate of convergence and the estimation error of the proposed LMS algorithm. In Figure 4, because of all through presence of the voiced frame as the reference as well as the current frame, it is found that the convergence performance is not very satisfactory and the estimation error is relatively higher. On the other hand, in Figure 6, it is observed that when the current frame is pause, even in the presence of voiced reference frame, a very fast convergence is obtained with a little estimation error. In Figure 5, as the current frame is unvoiced instead of pause, a comparatively slower convergence is observed with higher estimation error.
Next, in Figures 7, 8, 9, the reference frame is considered unvoiced, and in Figures 10, 11, 12, it is considered pause. When the reference frame is considered unvoiced because of the existence of a little correlation between the current and reference frames, the convergence performance of the proposed LMS algorithm is found quite satisfactory irrespective of the power of the reference signal (strong unvoiced or weakly unvoiced). In the case when the current frame is pause, no matter whether the reference frame is voiced or unvoiced, a fast convergence with high estimation accuracy is achieved using the proposed LMS algorithm. The reasons behind are (i) negligible crosscorrelation between reference frame and current frame and (ii) a comparatively higher power ratio. In Figures 10, 11, 12, it is observed that even the reference frame is a pause or stop because of the presence of additive white noise, the reference frame may contain significant energy. In these cases, a reasonable estimation of the room response can be obtained given that the noise power is quite high. Findings in the above cases are summarized in Table 1.
First of all, it is observed that a better convergence in terms of iterations and estimation error is obtained when the current frame is a pause (P) or stop and the reference frame is either voiced (V) or unvoiced (U), namely, VP and UP. This fact leads to a decision that the updating needs to be carried out at high level of power ratio, i.e.,
where P_{ref}(n) and P_{sup}(n) are defined in (33) and (34), respectively. If the value of the lower bound ζ is chosen too large, the updating would be postponed for most of the instances resulting in very slow convergence. On the other hand, a very small value of ζ may cause more frequent updates where possibility of wrong estimations of filter coefficients would be higher, especially in VP, UP, and PP cases. It is to be noted that considering only a lower bound of P_{rs}(n) may not always be sufficient to ensure that the reference frame possesses significant energy. For example in Figure 13, it is shown that high value of P_{rs}(n) may arise (marked block in the figure) from an initial silence frame where only a very little amount of noise is present. In order to prevent the updating in these situations, a lower bound β on the power of the reference frame is employed, i.e., P_{ref}(n)≥β. The value of β should surpass the power of speech pauses and ensure that the LMS update is postponed even if a frame of speech containing a partial pause is available as the reference. Hence, the first constraint for updating the algorithm is proposed as Condition I: P_{rs}(n)≥ζ and P_{ref}(n)≥β.
In some cases, it is observed that though the power ratio is very small, quite satisfactory updating is obtained, such as the UV case shown in Figure 7. Another characteristic observed here is lower value of correlation coefficient C_{rs}(n) with higher value of P_{ref}(n). It is to be mentioned that the proposed AEC algorithm is developed on the assumption of negligibility of the cross correlation between current frame and reference frame. However, since both reference and current frame may belong to the same person, in case of high degree of correlation, the adaptive algorithm would try to suppress portion from the echocorrupted signal resulting in unusual degradation= in convergence performance. Hence, introducing an upper bound on C_{rs}(n), the second condition is proposed as Condition II: C_{rs}(n)≤Υ 1 and P_{ref}≥β.
The presence of a certain level of noise can be utilized as an advantage in pause instances where generally the updating is not performed. Since noise is considered uncorrelated to itself, updating at frames where only noise is present would be quite satisfactory. In this case, the value of C_{rs}(n) must be very small and thus another condition on updating is proposed as Condition III: C_{rs}(n)≤Υ 2≤Υ 1.
Another important factor is the MSE of the estimations of successive iterations, which is defined as
In order to continue the updating, an upper bound on the variation of successive estimates is set as following condition: Condition IV: e_{ c o e f f }(n)≤ℵ.
Considering smaller values of e_{coeff}(n) allows to avoid updating at those instances where abrupt and significant changes occur in the estimated coefficients. In the proposed method, in order to carry out the LMS update, at least one of the above four conditions must be fulfilled.
5 Simulation results and comments
Performance of the proposed algorithm is investigated in different echogenerating environments at various input noise levels considering several male and female utterances available in the TIMIT database [24]. An acoustic room environment is simulated using an FIR filter of length N_{ f }, where as per conventional approaches, filter coefficients during the flat delay portion are assumed to be zero. The flat delay time (k_{0}) can be precalculated based on the distance between the microphone and the speaker [25]. Because of the implicit zeros corresponding to the flat delay, it is evident that a few number (N_{ f }−k_{0}) of unknown coefficients has to be determined. In the proposed method, a smaller step size is used to obtain a smooth convergence.
First, a subjective evaluation is carried out based on the feedback about the quality of the echo and noisesuppressed signal provided by five individual listeners at different noisy echogenerating environments. From the overall response of the listeners in terms of mean objective score (MOS), a very satisfactory performance of the proposed method is obtained even under severe echogenerating conditions in noise.
Next, two objective measures, namely, echo return loss enhancement (ERLE) and signaltodistortion ratio (SDR) are employed. The ERLE is defined as the ratio of the instantaneous power of the residual echo signal η_{ ς }(n) and that of the input echo signal η_{ x }(n) and expressed in dB as [1]
The average value of ERLE(n) over time is considered. The input and output SDRs in dB are respectively defined as
where P_{ s } is the power of original signal s(n), P_{x+v} is the power of microphone input, and ${P}_{\hat{s}+\hat{v}s}\left(n\right)$ is the power of distortion present in the echosuppressed output signal. The SDR improvement is given by
which indicates the overall distortion removal.
The proposed algorithm has been tested on several different sentences taken from the TIMIT database. In order to demonstrate the principle of selecting different threshold values required in the proposed updating constraints, as a typical example, a sample utterance ‘Good service should be rewarded by big tips’ is shown in Figure 14[24]. Voicing decisions are marked in the figure as ‘P’ for pause, ‘V’ for voiced, and ‘U’ for unvoiced. Considering white Gaussian noise with SNR = 15 dB, N_{ f }=1,002, k_{0}=1,000, and M=100 in Figure 14b,c,d,e, P_{rs}(n), P_{ref}(n), C_{rs}(n), and MSE_{ideal}(n) are shown, respectively. Note that in this case, the proposed algorithm is used without the update constraints, and thus, the MSE_{ideal}(n) exhibits some higher values. The comments provided in Table 1 can be better visualized from different marked zones of this figure. From extensive experimentations, it is found that a better update requires P_{ref}(n) to be at least twice of P_{supp}(n) and a small percentage (1% to 5%) of the power of a regular voiced frame can be chosen as the lower bound of β for P_{ref}(n). Analyzing C_{rs}(n) in different speech frames, Υ 1 in condition 2 is chosen as 0.25 to ensure that no speech is being suppressed during the update procedure by confusing it with the echo and Υ 2 is kept very small, i.e, Υ 2≈0.1 to allow updating for cases where there exists no correlation or extremely low correlation between the reference signal and echosuppressed signal. The value of the threshold ℵ for e_{coeff}(n) in condition IV is chosen to be very small (0.7×10^{−4}) such that there will be no update of the LMS algorithm when the magnitude of e_{coeff}(n) is comparatively much larger.
In Figure 15, the effect of incorporating the proposed conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo and noisecorrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed
In order to show the effectiveness of the proposed conditions, the MSE_{ideal}(n) obtained in Figure 14e is redrawn in Figure 15. In Figure 15, the effect of incorporating the conditions is shown. It is vividly observed from Figure 15 that by employing the proposed conditions, the convergence is improved to a greater extent. Moreover, in order to demonstrate the performance in frequency domain, spectrograms of the original signal, echo and noisecorrupted signal, and the output of the proposed AENC block are depicted in Figure 16a,b, respectively. For convenience, some zones are marked on the spectrograms where significant reduction in echo and noise can easily be observed. For a better understanding, another TIMIT utterance ‘She had your dark suit in greasy wash water all year’, under similar acoustic environment as used in Figure 14, is considered and corresponding echo and noisecorrupted speech signal is shown in Figure 17a. The MSEs obtained by using the proposed method with and without the conditions are presented in Figure 17b,c, which clearly demonstrate the performance improvement in the later case.
In Table 2, the performance of the proposed algorithm with and without applying the conditions is shown in terms of the SDR improvement (dB) and ERLE (dB) for utterance 1. In order to evaluate the performance under different room environments, length (N_{ f }) and parameter values of the room response filter are varied while keeping the input SNR constant to 15 dB. Considering k_{0}=1,000, N_{ f }−k_{0} is varied from 2 to 14. Results shown in the table clearly demonstrate the effectiveness of using the conditions on performance measures; in all cases, higher values of SDR and ERLE are obtained.
In Table 3, the performance of the proposed algorithm with and without applying the conditions is evaluated for different levels of input SNR ranging from 25 to −5 dB for the first utterance considering white Gaussian noise and N_{ f }=1014. It can be seen that the proposed method provides satisfactory performance at all SNR levels. Especially, the use of proposed conditions exhibits comparatively better performance.
6 Conclusion
The problem of echo cancellation in the presence of noise, especially in singlechannel environment, is a very challenging task, which has been efficiently tackled in this paper. First, the singlechannel AEC block is designed based on the gradientbased adaptive LMS filter where to overcome the problem of getting a separate reference signal, we propose to use the delayed version of the echosuppressed signal. Such a unique proposal of getting the reference signal is justified by presenting a detailed mathematical proof of achieving the most optimum WienerHopf solution of the estimated filter coefficients, and a convergence analysis is carried out. Moreover, in order to achieve fast and smooth convergence, a set of updating constraints is proposed by analyzing the speech characteristics of different types of speech frames, such as voiced, unvoiced, and pause. In the ANC block, a modified singlechannel spectral subtraction method is considered for its robust performance. It is shown that the proposed AENC scheme with updating constraints provides a very satisfactory performance in different echogenerating conditions and various levels of SNR in terms of SDR and ERLE.
Appendix
Derivation of the solution of the LMS update
In order to obtain a homogeneous solution of the update Eq. 22, one may consider
Eigenvalue decomposition of the correlation matrix R_{(s+v)(s+v)}(n−k_{0}) results in
where each column of the matrix U consists of eigenvectors corresponding to eigenvalues constituting the diagonal elements of the matrix Λ and U^{T}U=I. Forward multiplication by U^{T} on both sides of (43) results in
where ${\mathbf{\text{U}}}^{T}{\underline{\hat{\mathbf{\text{w}}}}}_{n}^{T}={\underline{\hat{\mathbf{\text{w}}}}}_{n}^{{T}^{U}}$. The k th coefficient of the weight vector can be expressed as
where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R_{(s+v)(s+v)}(n−k_{0}). Hence, the homogeneous solution can be obtained as
where C_{ k } is a constant. Next, in order to obtain the particular solution for the k th coefficient, based on (22) one can get
Here, r^{U}(n−k_{0}−k) is the k th element of U^{T}${\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}(n{k}_{0})={\mathbf{\text{r}}}_{({x}_{s}+{x}_{v})(s+v)}^{U}(n{k}_{0})$. For a particular solution ${\hat{w}}_{\mathrm{p.s}}={K}_{p}{r}^{U}(n{k}_{0}k)$, (48) can be written as
which leads to ${K}_{p}=\frac{1}{\lambda \left(k\right)}$ and the particular solution
References
 1.
Vaseghi SV: Advanced Digital Signal Processing and Noise Reduction. Wiley, Chichester; 2000.
 2.
Kuo SM, Lee BH: RealTime Digital Signal Processing. Wiley; 2001.
 3.
Breining C, Dreiseitel P, Hänsler E, Mader A, Nitsch B, Puder H, Schertler T, Schmidt G, Tilp J: Acoustic echo control  an application of veryhighorder adaptive filters. IEEE Signal Process. Mag 1999, 16(4):4269. 10.1109/79.774933
 4.
Hänsler E: The handsfree telephone problem: an annotated bibliography. Signal Process 1992, 27(3):259271. 10.1016/01651684(92)900747
 5.
Khong AWH, Naylor PA: Stereophonic acoustic echo cancellation employing selectivetap adaptive algorithms. IEEE Trans. Audio, Speech, Lang. Process 2006, 14(3):785796.
 6.
Lindstrom F, Schuldt C, Claesson I: An improvement of the twopath algorithm transfer logic for acoustic echo cancellation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(4):13201326.
 7.
Wu S, Qiu X, Wu M: Stereo acoustic echo cancellation employing frequencydomain preprocessing and adaptive filter. IEEE Trans. Audio, Speech, Lang. Process 2011, 19(3):614623.
 8.
Nath R: Adaptive echo cancellation based on a multipath model of acoustic channel. Circuits, Syst. Signal Process., Springer US 2013, 32(4):16731698. 10.1007/s0003401295294
 9.
Yukawa M, de Lamare RC, SampaioNeto R: Efficient acoustic echo cancellation with reducedrank adaptive filtering based on selective decimation and adaptive interpolation. IEEE Trans. Audio, Speech, Lang. Process 2008, 16(4):696710.
 10.
Hänsler E, Schmidt G: Acoustic Echo and Noise Control: a Practical Approach. Wiley, New York; 2004.
 11.
Myllylä V: Residual echo filter for enhanced acoustic echo control. Signal Process 2006, 86(6):11931205. 10.1016/j.sigpro.2005.07.036
 12.
Topa R, Muresan I, Kirei BS, Homana I: A digital adaptive echocanceller for room acoustics improvement. Adv. Electrical Comput. Eng 2004, 10: 450453.
 13.
Haykin S: Adaptive Filter Theory. PrenticeHall, Inc., Upper Saddle River, NJ; 1996.
 14.
Schmidt G: Applications of acoustic echo control: an overview. In Proc. Eur. Signal Process. Conf.. EUSIPCO, Vienna; 2004:916.
 15.
Widrow B, Glover JRJ, McCool JM, Kaunitz J, Williams CS, Hearn RH, Zeidler JR, Dong JE, Goodlin RC: Adaptive noise cancelling: principles and applications. Proc. IEEE 1975, 63(12):16921716.
 16.
Yasukawa H: An acoustic echo canceller with subband noise cancelling. IEICE Trans. Fundamentals Electron. Commun. Comput. Sci 1992, E75–A(11):15161523.
 17.
Park SJ, Cho CG, Lee C, Youn DH: Integrated echo and noise canceller for handsfree applications. IEEE Trans. Circuits Syst.II: Analog Digital Signal Process 2002., 49(3):
 18.
Beaugeant C, Turbin V, Scalart P, Gilloire A: New optimal filtering approaches for handsfree telecommunication terminals. Signal Process 1998, 64(1):3347. 10.1016/S01651684(97)001746
 19.
Mahbub U, Fattah SA: Gradient based adaptive filter algorithm for single channel acoustic echo cancellation in noise. In Proc. Int. Conf. Electrical Computer Engineering (ICECE), 2012 7th International Conference On. Dhaka, 688 Bangladesh; 2012:880883.
 20.
Boll S: A spectral subtraction algorithm for suppression of acoustic noise in speech. Proc. IEEE Int. Conf. Acoust. Speech, Signal Process. (ICASSP) ’79 1979, 200203.
 21.
Berouti M, Schwartz R, Makhoul J: Enhancement of speech corrupted by acoustic noise. IEEE Conf. Acoust. Speech Signal Process. (ICASSP) 1979, 208211.
 22.
Lim JS: Evaluation of a correlation subtraction method for enhancing speech degraded by additive white noise. IEEE Trans. Acoust. Speech Signal Process 1978, 26(5):471472. 10.1109/TASSP.1978.1163129
 23.
Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504512. 10.1109/89.928915
 24.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V: Timit acousticphonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia; 1993.
 25.
Guangzeng F, Feng L: A new echo caneller with the estimation of flat delay. In IEEE Region Ten Conf. TENCON 92. Melbourne, Australia; 1992. vol. 1, pp. 1–5, Print ISBN 0780308492, DOI 10.1109/TENCON.1992.271995
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Mahbub, U., Fattah, S.A., Zhu, W. et al. Singlechannel acoustic echo cancellation in noise based on gradientbased adaptive filtering. J AUDIO SPEECH MUSIC PROC. 2014, 20 (2014) doi:10.1186/16874722201420
Received
Accepted
Published
DOI
Keywords
 Adaptive filter
 Convergence analysis
 Echo cancellation
 Least mean squares algorithm
 Noise reduction
 Spectral subtraction
 Singlechannel communication