3.1 Proposed two-stage setup
In Figure 3a, a simple block diagram showing two stages of the proposed AENC scheme is presented and in Figure 3b, more detail of the adaptive filter-based AEC algorithm involved in the first stage is shown. Similar to Figure 2, the input to the microphone y(n) can be described by (3). For the case of single-channel AEC, for example, while delivering a lecture in a large conference hall, the microphone in front of the speaker receives input speech s(n) corrupted by v(n). Once this noise-corrupted speech is transmitted through loudspeaker, echo signal is generated and thus the microphone after some initial time delay will receive noise-corrupted speech and echo of previously uttered speech. The task of AEC is to cancel the echo part from this input by using adaptive filter algorithm. In order to obtain adaptively an estimate of the echo signal, we propose to utilize delayed versions of the previously echo-suppressed samples of the noisy speech as reference signal [19]. A symbol hat on the variable is used to indicate estimated value. The error signal e(n) thus obtained is given by
(6)
The estimate of the echo signal can be expressed as
(7)
where is the estimated coefficient vector. The task of the adaptive filter is to obtain an optimum by minimizing the error in (6) i.e.,
(8)
where and are the residual echo of the speech and noise portions of the input signal, respectively, and it is assumed that these signals exhibit the properties of white Gaussian noise. Next, e(n) is passed through a spectral subtraction-based single-channel ANC block which produces output that closely resembles s(n) provided that the residual echo-noise portion Ψ(n) becomes very small.
It is to be noted that the task of noise reduction, unlike the proposed AENC scheme, may be carried out prior to the AEC block. However, because of possible nonlinearities introduced by the prior noise reduction block, no proper reference would be available for the single-channel AEC block [17]. Hence, the arrangement shown in Figure 3a is adopted, in which the noise reduction block also serves as a post-processor for attenuating the residual echo.
3.2 Development of proposed gradient-based single-channel LMS AEC scheme
A delayed version of the adaptive filter output e(n) is proposed to use as the reference signal, and from (8), filter output e(n) can be written as
(9)
where and . The objective function of the adaptive filter involves minimization of the mean square estimation of the error function and using (6) it can be written as
(10)
where E{.} denotes the expectation operator. In (10), it is intended to use the basic definition of cross-correlation operation, for example, the cross-correlation function between s(n) and v(n) is defined as
(11)
where m denotes the lag. Using (4), (5), (7), and the above definition, the last term of (10) can be expressed as
(12)
Here, r
s
s
(k0+k) corresponds to the (k0+k)th lag of the cross-correlation between s(n) and its previous samples s(n−k0−k), and r
s
v
(k0+k) corresponds to the (k0+k)th lag of the cross-correlation between s(n) and v(n−k0−k). In a similar way, r
v
s
(k0+k), r
v
v
(k0+k), , , , and can be defined. It is well known that the value of cross-correlation decreases rapidly with the increasing lags when two signals are uncorrelated. In ideal case, the cross-correlation function between two random noise signals would be nonzero only at the zero lag. Since v(n) is assumed to be white Gaussian noise and, generally, the value of k0 is very large, in (12), the effect of the terms r
s
v
(k0+k), r
v
s
(k0+k), and r
v
v
(k0+k) can be neglected. Moreover, because of noise-like characteristics of δ
s
(n) and δ
v
(n), in (12), one can neglect , , and too. Hence, it can easily be comprehended that optimal filter performance occurs when r
s
s
(n) is minimum, i.e., the least possible correlation between s(n−k0−k) and s(n) is desired. As a result, (10) reduces to
(13)
Here, the magnitude of r
s
s
(k0+k) strongly depends on speech characteristics and the amount of flat delay k0. For a reasonably large k0, the effect of r
s
s
(k0+k) in 13 can be neglected, and minimization of (13) results in
(14)
Hence, we obtain
(15)
The above equation is similar to Wiener-Hopf equation and its solution can be written as
(16)
where consists of different lags of cross-correlation between the echo signal x
s
(n)+x
v
(n) and the noisy input signal s(n)+v(n), while R(s+v)(s+v) is the auto-correlation matrix of s(n)+v(n). There is no doubt that is the most optimum solution possible. Hence, it is shown that even for a single-channel noise corrupted AEC problem, the most optimum solution can be achieved under the assumptions stated earlier.
For iterative estimation of optimal filter coefficients, the adaptive LMS algorithm is very popular. It is fast and efficient, and it does not require any correlation measurements or matrix inversion [13]. The update equation of the LMS adaptive algorithm is generally expressed as
(17)
where μ is the step factor controlling the stability and rate of convergence, ξ(n) is the cost function, and ∇ is the gradient operator. The LMS algorithm simply approximates the mean square error by the square of the instantaneous error, i.e., ξ(n)=e2(n), and therefore, from (6) and (7), the gradient of ξ(n) can be expressed as
Thus, the update equation for the proposed single-channel LMS adaptive scheme can be written as
(18)
3.3 Convergence analysis of the proposed AEC scheme
Considering expectation operation on both sides of the update Eq. 18, one can obtain
(19)
Here, an underline beneath is introduced to represent the expected value . For the k th unknown weight vector (where k=1,2,…,p), using (6) and neglecting the effect of r
s
s
(n) that has already been discussed in the previous subsection, the last term of (19) can be written as
(20)
Based on the assumptions on cross-correlation terms stated in the previous subsection, one can obtain
(21)
Using (21), the update Eq. 19 can be written as
(22)
Evaluating the homogeneous and particular solutions of (22), the total solution can be obtained as (see Appendix)
(23)
where λ(k) is the k th diagonal element of the eigenvalue matrix obtained by eigenvalue decomposition of R(s+v)(s+v)(n−k0) and rU(n−k0−k) is the k th element of with the matrix U consisting of eigenvectors corresponding to eigenvalues. Since in the iterative update procedure, the homogeneous part (1−2μ λ(k))n diminishes with iterations, (23) in a matrix form can be expressed as
(24)
Thus, it is found that the average value of the weight vector converges to the Wiener-Hopf solution, which is the optimum solution with increasing number of iteration.
3.4 Noise reduction in spectral domain
In the proposed AENC scheme, the operation of the ANC block is processed frame by frame for noise reduction based on single-channel spectral subtraction algorithm [20–22]. According to (9), for the i th frame, the error signal for the duration of a frame length can be written as
(25)
Corresponding frequency domain representation is given by
(26)
The magnitude squared spectrum of can be written as
(27)
It is desired to choose an estimate that will minimize
(28)
Since the noise is assumed to be zero mean and uncorrelated with the signal, the expected values of the last two terms of (27) can be neglected. Thus, (28) can be expressed as
(29)
This expression of E r r
i
(ω) can be minimized by choosing
(30)
With an estimate of noise spectrum , signal spectrum can be computed as
(31)
where the phase (arg[E
i
(ω)]) is generally assumed to be the phase of the noise corrupted signal without causing significant degradation in terms of loss of intelligibility of the speech signal [20]. It can be seen that an estimate of the magnitude spectrum of the signal can be obtained provided an estimate of noise spectrum is available, which is generally computed during the periods when speech is known a priori not to be present.
Final output of the AENC system is the speech frame , which consists of the original speech s
i
(n) and a negligible amount of noise-like signal Ψ
i
(n). The signal Ψ
i
(n), although very weak, may contain some signature of the input noise v(n), the residual echo δ
s
(n), and the residual noise δ
v
(n). In order to overcome the problem of musical noise and to avoid the speech distortion caused by speech subtraction, in (31), an over estimate of the noise power spectrum can be subtracted carefully such that the spectral floor is preserved [21]. Thus, (31) can be modified as
(32)
Here, α
s
s
is the subtraction factor and β
s
s
is the spectral floor parameter with α
s
s
≥1 and 0≤β
s
s
≤1. The task of noise power spectral density estimation is carried out based on the minimum statistics noise estimator proposed in [23] which can handle the time-varying nature of the noise.