Robust time delay estimation for speech signals using information theory: A comparison study

Time delay estimation (TDE) is a fundamental subsystem for a speaker localization and tracking system. Most of the traditional TDE methods are based on second-order statistics (SOS) under Gaussian assumption for the source. This article resolves the TDE problem using two information-theoretic measures, joint entropy and mutual information (MI), which can be considered to indirectly include higher order statistics (HOS). The TDE solutions using the two measures are presented for both Gaussian and Laplacian models. We show that, for stationary signals, the two measures are equivalent for TDE. However, for non-stationary signals (e.g., noisy speech signals), maximizing MI gives more consistent estimate than minimizing joint entropy. Moreover, an existing idea of using modified MI to embed information about reverberation is generalized to the multiple microphones case. From the experimental results for speech signals, this scheme with Gaussian model shows the most robust performance in various noisy and reverberant environments.


Introduction
Time delay estimation (TDE) is a basic problem in modern signal processing and it has found extensive applications such as localizing and tracking radiating sources in radar and sonar. Nowadays, the same technique is used to localize and track acoustic sources in room environments. For example, in automatic camera tracking for video conferencing [1,2], the location of the current speaker is required for the camera to turn toward them; in speech enhancement [3,4] using a steerable microphone array, the speaker location is required for noise cancellation.
TDE for speech signals in adverse acoustic environments with strong noise and reverberation levels has long been a challenging problem. Among the traditional methods for TDE, the most popular one is the generalized cross-correlation (GCC) method proposed by Knapp and Carter [5]. The relative delay is estimated by maximizing the cross-correlation between filtered versions of the received signals. It has been shown in [6,7] that, the GCC method performs fairly well in moderately noisy and lightly reverberant environments. However, it degrades dramatically when noise or reverberation is high. In an attempt to deal better with noise and reverberation, an effective approach was introduced based on multichannel cross-correlation coefficient (MCCC) [8], which performs well in combating both noise and reverberation by taking advantage of the redundant information from multiple sensor pairs. It is found that the approach's robustness gets better as the number of sensors increases.
As a second-order statistics (SOS) measure of the dependence among multiple random variables, the MCCC is ideal for Gaussian signals. However, for non-Gaussian source signals, higher order statistics (HOS) have more to say about their dependence. More recently, the two information-theoretic concepts of joint entropy and mutual information (MI), which can be considered as higher order statistics [9], are used to develop new TDE estimators [10,11]. In [10], the Laplacian is employed to model the speech source, and the relative delay is estimated via minimizing the joint entropy of the multiple microphone output signals. In [11], based on characterizing the speech source as Gaussian, the MI measure is used for TDE, however, the method is restricted to the two microphone case.
Analysing further the work of [10,11], in this article, we present a framework that treats the TDE problem from an information theory point-of-view. Since the two information-theoretic measures have the freedom of selecting a specific distribution model for the source signal, the solutions based on minimizing the joint entropy and maximizing the MI of the multichannel output signals are derived for both Gaussian and Laplacian models. From the experimental results, the Gaussian, compared to the Laplacian, is a better model for the small frames of noisy speech signals used for TDE. Moreover, we show that the two measures are equivalent for TDE when the source signal is stationary. However, for non-stationary signals, maximizing the MI gives more stable and consistent estimate of the relative delay than minimizing the joint entropy.
In addition, in order to combat reverberation more effectively, the MI of multichannel outputs is modified to embed information about reverberation, which helps to improve the estimator's robustness against reverberation. The proposed scheme is verified by simulations in various noisy and reverberant environments.
This paper is organized as follows. 'Signal model' section describes the signal model used throughout this article. 'TDE based on information theory' section presents the joint entropy and MI based methods for both Gaussian and Laplacian models. 'Modified MI of multichannel outputs' section details how to modify the MI based estimator to be more robust against reverberation for multiple microphones. Simulations are presented in 'Simulations' section. 'Conclusion' section summarizes the conclusions of the article.

Signal model
In an attempt to estimate only one time delay, two sensors are enough. However, it has been shown in [8,10] that employing more than two sensors can significantly improve the estimator's robustness against noise and reverberation by taking advantage of the available redundant information. Consider that we have a linear microphone array consisting of N microphones positioned in an acoustical enclosure. When the reverberation is ignored, the received signals from a single farfield source can be denoted as for n = 1,2,...N, where l n are the attenuation factors, t is the propagation time from the source s(k) to microphone 1 (without loss of generality, microphone 1 is selected as the reference point), the noise term ω n (k) is assumed to be white Gaussian with zero mean and uncorrelated with the source signal and the noise signals at other microphones, n (τ) is the relative delay between microphones 1 and n (with 1 (τ) = 0 and 2 (τ) = τ). Since we consider only linear equispaced arrays and the far-field case, the function n (τ) solely depends on the delay τ ϕ n (τ ) = (n − 1)τ .
In other scenarios with linear but non-equispaced or non-linear arrays, the mathematical formulation of n (τ) can be obtained depending on the array geometry. In addition, we assume that the sampling rate was sufficiently high such that the value of j n (τ) can be treated as integer. However, the model described by (1) does not include the effect of reverberation in real room acoustic environments. In order to describe the TDE problem in a room environment where each microphone often receives a large number of echoes due to reflections of the wavefront from objects and room boundaries, we can use a more realistic reverberation model which models the received signals as [12] x n (k) = h n * s(k) + ω n (k) where h n denotes the reverberant impulse response between the source and the nth microphone and the symbol * denotes convolution. In this model, j n contains not only the effect of the direct path delay but also that of other reflected path delays. The size of j n is generally a function of the reverberation time.

TDE based on information theory
Most of the traditional TDE algorithms are proposed based on a SOS criterion. Since the sensor output signals are random variables, it makes more sense to take into account the probability density functions (pdfs) in quantifying the dependence among those multiple random variables by employing a HOS criterion.

Entropy and MI
In general, the entropy is a measure of uncertainty of a random variable. Shannon, using an axiomatic approach [13], defined entropy of a random variable x with a pdf f (x) as Let us now consider N random variables with joint density f(x), where [·] T denotes a vector/ matrix transpose. The corresponding joint entropy of the N random variables can be considered to be the entropy of the single vector-valued random variable x The MI is an information-theoretic measure of the information that one random variable contains about another random variable. If we consider two variables x 1 and x 2 , then the MI I(x 1 , x 2 ) is the Kullback-Leibler (KL) divergence between the joint density f(x 1 , x 2 ) and the factorized marginal density f(x 1 ) and I(x 2 ) [9], i.e., When multiple random variables are concerned, we use the total correlation [14], which is one of several generalizations of the MI in probability theory and in particular in information theory, to express the amount of dependency existing among the variables. The multivariate MI of x can be formulated as According to (1), we consider the following parameterized vector: Obviously, when we determine the correct delay m = τ, the signal components at different microphones will be synchronized, and the information that one microphone signal has about the others will be maximum. In this case, the entropy and MI of x(k, m) will reach minimum and maximum, respectively. Thus, the relative delay can be estimated by minimizing the entropy or maximizing the MÎ τ e = arg min m H(X(k, m)) (10) In order to apply the two measures, the joint density and marginal distributions of the multichannel output signals are required. Since the information-theoretic concepts have the advantage of freely source model selection, other potential density such as Laplacian can be tried as in this article or [10].

Gaussian signals
A Gaussian random variable x with mean zero and variance σ 2 x has a pdf given by The resulting entropy is Let that x 1 , x 2 ,...., x N follow a multivariate Gaussian distribution with mean 0 and covariance matrix The joint pdf of x 1 , x 2 ,...., x N is By substituting (15) into (6), the entropy of x can be obtained as [10] H(X) = 1 2 ln (2π e) N det(R) .
Accordingly, the MI of the jointly Gaussian distributed random vector x can be formulated as [11] In practice, with K observations of x, we firstly estimate the covariance matrix Then, we compute the entropy H(x(k, m)) (or the MI I (x(k, m))) for different m and choose the one that minimizes the entropy (or maximizes the MI) to be the optimal estimate of the relative delay.
It can be easily checked that maximizing the MI for Gaussian signals (17) is, indeed, equivalent to maximizing the squared MCCC among the N random variables, which is defined as [8] Furthermore, note that, the time shift independent variance σ 2 x n are constant if the signals are stationary and the data sample length K is sufficiently large (ideally K ∞). In this case, it is obvious that, minimizing the entropy (16) is equivalent to maximizing the MI (17) or MCCC (19) for TDE. However, for non-stationary signals, the entropy (16) is affected by the variance change. These findings will be verified by simulations later.

Laplacian signals
The univariate Laplacian distribution with mean zero and variance σ 2 x is given by The corresponding entropy is Suppose that the elements of the random vector x have a multivariate Laplacian distribution with mean 0 and covariance matrix R. The joint density is given by [15] where P = 1-N/2 and B P (·) is the modified Bessel function of the second kind.
The joint entropy can be obtained as [10] H By substituting (21) and (23) into (8), the MI is given by When the entropy (23) or MI (25) is applied to TDE, we use a numerical way to estimate E{ln(b/2))} and E{ln B P ( 2β)} from observed data since they do not seem to have a closed form. Suppose that we have K samples for each element of the observation vector x(k, m), we replace ensemble averages by time averages In practice, we estimate the covariance matrix R(m) firstly. Afterwards, (26) and (27) can be estimated immediately. Then, the entropy (23) or MI (25) can be computed to estimate the relative delay.
It has been shown that the Laplacian distribution is the best model for speech samples during voice activity intervals compared to the Gaussian, generalized Gaussian and gamma distribution [16], which has been taken into account for the estimation of entropy for speech signals in [10]. However, since the noise is typically Gaussian, assuming a Laplacian distribution for the noisy microphone array outputs is questionable, particularly for low SNR conditions.
In addition, similar to the solutions for Gaussian signal, the MI (25) is insensitive to variance change of the sensor outputs compared to the entropy (23).

Modified MI of multichannel outputs
It is shown in [11] that the estimator searching the relative delay between two microphone signals by directly maximizing the MI suffers from the same limitations of GCC, and it is not robust enough in reverberant acoustic environments.
Consider that the relative delay between the two signals x 1 (k) and x 2 (k) is τ. In the absence of reverberation, only a single delay is present between the two signals. Thus, the information contained in a sample l of x 1 (k) is only dependent on the information contained in the sample l -τ of x 2 (k). When reverberation is present, then, the information contained in a sample l of x 1 (k) is also contained in neighboring samples of the sample lτ of x 2 (k). In this scenario, the MI is not representative enough in the presence of reverberation. Thus, in order to better estimate the information conveyed by the two signals, the modified MI that consider jointly Q neighboring samples can be formulated as [11] When the condition of using multiple sensors is concerned, the modified MI of x(k, m) can be formulated as The length of x Q is N(Q + 1). We call Q the order of the system. Accordingly, with the K data samples, we compute the MI I(x Q (k, m)) for different m and choose the one that maximizes the MI to be a good estimation of the relative delay τ Q = arg max m I(X Q (k, m)). (32)

Simulations
In this section, we conduct experiments for speech signals to evaluate the estimators using both simulated and real impulse responses in reverberant room environments. A real female speech signal is convolved with the room impulse responses to generate microphone signals. The microphone signals are partitioned into non-overlapping frames with a frame size of 600 samples. In addition, mutually independent zero-mean white Gaussian noise is introduced to each microphone signal to control the SNR.
For each set of experimental conditions, the 100 frames are processed to generate 100 estimates. The TDE performance is evaluated in terms of the root mean-squared error (RMSE) of the estimates.

Simulated reverberant channels
The image model technology [17,18] is used to simulate real reverberant acoustic environments of a room with room dimensions of [8 6.5 3] m. A linear equispaced microphone array of six omni-directional receivers with inter-element spacing of 10 cm is considered. Two reverberation conditions are simulated for different reverberation time T 60 , which is defined as the time for the sound to decay to a level 60 dB below its original level. The two reverberation times are approximately 200 and 500 ms, respectively. The results are averaged A B Figure 1 Examples of simulated channel responses between the source and the first microphone for two reverberation conditions. (a) T 60 = 200 ms and (b) T 60 = 500 ms. over twenty random displacements and rotations of the relative geometry between the source and the array inside the room. Figure 1 shows two examples of the simulated channel responses between the source and the first microphone for the two reverberation conditions.
In the first experiment, the entropy, MI and modified MI based estimators for both Gaussian and Laplacian example, for two microphones, the RMSE of each approach for SNR = -5 dB is at least more than six times that for SNR = 25 dB in the moderate reverberation condition with T 60 = 200 ms. Meanwhile, when the number of microphones is fixed and in the same noise conditions, each approach shows much higher RMSE in the highly reverberant environment compared to the moderately reverberant environment. However, for the same noise and reverberation conditions, the RMSE drops evidently as the number of microphones increases for all the algorithms, particularly in the high noise condition. This indicates that better performance can be achieved by employing more microphones.
Moreover, it can be seen that the entropy and MI measures have comparable performance in the low noise condition with SNR = 25 dB. But in the high noise condition with SNR = -5 dB, the MI based approaches performs distinctly better than the entropy based ones. That can be interpreted as the MI, compared to entropy, is insensitive to the variance change caused by the non-stationary of the noise corrupted speech signals.
In addition, each of the three measures with the Gaussian model exhibits a better performance compared to Laplacian, especially for the high noise condition. This can be explained as follows. The speech samples during voice activity intervals are Laplacian random variables [16] and the noise is typically Gaussian. Thus, the noisy microphone output, which is a mixture of Laplacian and Gaussian random variables, cannot be well modeled by Laplacian, particularly when the noise is high. Moreover, it has been shown that, the joint distribution of two samples of speech with 0.1 ms distance looks very like Gaussian [16]. That is the case of this article, where the sampling period is approximately 0.1 ms.
In general, for the same number of microphones and the same noise and reverberation conditions, the modified MI based algorithms with an order of Q = 4 obviously performs better than their entropy based and MI based counterparts, which is demonstrated by their distinct lower RMSE in most cases.

Real reverberant channels
In this subsection, we repeat the first experiment using real measured room impulse responses from the Multichannel Acoustic Reverberation Database at York (MARDY) to evaluate the algorithms. The database comprises a collection of room impulse responses measured with a linear array for various source-array separations in a varechoic room. The collected data are available at http://www.commsp.ee.ic.ac.uk/sap/. Figure 4 shows one of the recorded channel responses. The reverberation time of the used channel responses is approximately 447 ms. Figure 5 presents the relationship between the estimate RMSE and the number of microphones for two noise conditions with SNR = -5 dB and SNR = 25 dB, respectively. The modified MI based algorithms distinctly performs better than other algorithms except for the six microphones case with SNR = 25 dB. Moreover, while the Gaussian model shows better performance than the Laplacian model in the low SNR condition with SNR = -5 dB, both the models in general give comparable performance in the high SNR condition with SNR = 25 dB.

Conclusions
In this article, the TDE problem is viewed from an information theory point. It is revealed that, maximizing the MI for TDE gives more consistent results compared to minimizing the joint entropy since it is insensitive to the variance change of sensor outputs. Moreover, an existing idea of using modified MI to embed information about reverberation is generalized to the multiple microphones case. The effectiveness of the proposed scheme is verified by simulations for speech signals in different reverberant environments. Simulation results also demonstrate that the Gaussian distribution models the small segments of noise speech signals better than the Laplacian distribution for TDE.