N-dimensional N-microphone sound source localization

Parsayan, Ali; Ahadi, Seyed Mohammad

doi:10.1186/1687-4722-2013-27

Research
Open access
Published: 06 December 2013

N-dimensional N-microphone sound source localization

Ali Parsayan¹ &
Seyed Mohammad Ahadi¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 27 (2013) Cite this article

38k Accesses
18 Citations
Metrics details

A Correction to this article was published on 27 September 2022

This article has been updated

Abstract

This paper investigates real-time N-dimensional wideband sound source localization in outdoor (far-field) and low-degree reverberation cases, using a simple N-microphone arrangement. Outdoor sound source localization in different climates needs highly sensitive and high-performance microphones, which are very expensive. Reduction of the microphone count is our goal. Time delay estimation (TDE)-based methods are common for N-dimensional wideband sound source localization in outdoor cases using at least N + 1 microphones. These methods need numerical analysis to solve closed-form non-linear equations leading to large computational overheads and a good initial guess to avoid local minima. Combined TDE and intensity level difference or interaural level difference (ILD) methods can reduce microphone counts to two for indoor two-dimensional cases. However, ILD-based methods need only one dominant source for accurate localization. Also, using a linear array, two mirror points are produced simultaneously (half-plane localization). We apply this method to outdoor cases and propose a novel approach for N-dimensional entire-space outdoor far-field and low reverberation localization of a dominant wideband sound source using TDE, ILD, and head-related transfer function (HRTF) simultaneously and only N microphones. Our proposed TDE-ILD-HRTF method tries to solve the mentioned problems using source counting, noise reduction using spectral subtraction, and HRTF. A special reflector is designed to avoid mirror points and source counting used to make sure that only one dominant source is active in the localization area. The simple microphone arrangement used leads to linearization of the non-linear closed-form equations as well as no need for initial guess. Experimental results indicate that our implemented method features less than 0.2 degree error for angle of arrival and less than 10% error for three-dimensional location finding as well as less than 150-ms processing time for localization of a typical wideband sound source such as a flying object (helicopter).

1. Introduction

Source localization has been one of the fundamental problems in sonar [1], radar [2], teleconferencing or videoconferencing [3], mobile phone location [4], navigation and global positioning systems (GPS) [5], localization of earthquake epicenters and underground explosions [6], microphone arrays [7], robots [8], microseismic events in mines [9], sensor networks [10, 11], tactile interaction in novel tangible human-computer interfaces [12], speaker tracking [13], surveillance [14], and sound source tracking [15]. Our goal is real-time sound source localization in outdoor environments, which necessitates a few points to be considered. For localizing such sound sources, a far-field assumption is usual. Furthermore, our experiments confirm that placing localization system in suitable higher heights often reduces the reverberation degree, especially for flying objects. Also, many such sound source signals are wideband signals. Moreover, outdoor high-accuracy sound source localization in different climates needs highly sensitive and high-performance microphones which are very expensive. Therefore, reduction of the number of microphones is very important, which in turn leads to reduced localization accuracies using conventional methods. Here, we intend to introduce a real-time accurate wideband sound source localization system in low degree reverberation far-field outdoor cases using fewer microphones.

The structure of this paper is as follows. After a literature review, we explain HRTF-, ILD-, and TDE-based methods and discuss TDE-based phase transform (PHAT). In Section 4, we explain sound source angle of arrival and location calculations using ILD and PHAT. Section 5 covers the introduction of TDE-ILD-based method to two-dimensional (2D) half-plane sound source localization using only two microphones. Section 6 includes simulation of this method for 2D cases, where according to simulation results and due to the use of ILD, we introduce source counting. In Section 7, we propose, and in Section 8, we implement our TDE-ILD-HRTF-based method for 2D whole-plane and three-dimensional (3D) entire-space sound source localization. Section 9 includes the implementations. Finally conclusions will be made in Section 10.

2. Literature review

Passive sound source localization methods, in general, can be divided into direction of arrival (DOA) [16], time difference of arrival (TDOA) or TDE or interaural time difference (ITD)- [17–20], ILD- [21–24], and HRTF-based methods [25–30]. DOA-based beamforming and sub-space methods typically need a large number of microphones for estimation of narrowband source locations in far-field cases and wideband source locations in near-field cases. Also, they have higher processing needs in comparison to other methods. Many localization methods for near-field cases have been proposed in the literature, such as maximum likelihood (ML), covariance approximation, multiple signal estimation (MUSIC), and estimation of signal parameters via rotational invariance techniques (ESPRIT) [16]. However, these methods are not applicable to the localization of wideband signal sources in far-field cases with small number of microphones. On the other hand, ILD-based methods are mostly applicable to the case of a single dominant sound source (high signal-to-noise ratio (SNR)) [21–24]. TDE-based methods with high sampling rates are commonly used for 2D and 3D high-accuracy wideband near-field and far-field sound source localization. In the case of ILD or TDE-based methods, minimum number of microphones required is three for 2D positioning and four for the 3D case [17–20, 31–39]. Finally, HRTF-based methods are applicable only to the case of calculating arrival angle in azimuth or elevation [30].

TDE- and ILD-based outdoor far-field accurate wideband sound source localization in different climates needs highly sensitive and high-performance microphones which are very expensive. In the last decade, some papers were published which introduce 2D sound source localization methods using just two microphones in indoor cases using TDE- and ILD-based methods simultaneously. However, it is noticeable that using ILD-based methods requires only one dominant source to be active, and it is known that, by using a linear array in the proposed TDE-ILD-based method, two mirror points will be produced simultaneously (half-plane localization) [40]. In this paper, we apply this method in outdoor (low-degree reverberation) cases for a dominant sound source. We also propose a novel method to have 2D whole-plane (without producing two mirror points) and 3D entire-space dominant sound source localization using TDE-, ILD-, and HRTF-based methods simultaneously (TDE-ILD-HRTF method). Based on the proposed method, a special reflector for the implemented simple microphone arrangement is designed, and source counting method is used to find that only one dominant sound source is active in the localization area.

In TDE and ILD localization approaches, calculations are carried out in two stages: estimation of time delay or intensity level differences and location calculation. Correlation is most widely used for time delay estimation [17–20, 31–39]. The most important issue in this approach is high-accuracy time delay estimation between microphone pairs. Meanwhile, the most important issue in ILD-based approach is high accuracy level difference measurement between microphone pairs [21–24]. Also, numerous results were published in the last decades for the second stage, i.e., location calculation. Equation complexities and large processing times are the important obstacles faced at this stage. In this paper, we propose a simple microphone arrangement that solves both these problems simultaneously.

In the above-mentioned first stage, the classic methods of source localization from time delay estimates by detecting radio waves were Loran and Decca [31]. However, generalized cross-correlation (GCC) using a ML estimator, proposed by Knapp and Carter [32], is the most widely used method for TDE. Later, a number of techniques were proposed to improve GCC in the presence of noise [3, 33–36]. As GCC is based on an ideal signal propagation model, it is believed to have a fundamental weakness of inability to cope well with reverberant environments. Some improvements were obtained by cepstral prefiltering by Stephenne and Champagne [37]. Even though more sophisticated techniques exist, they tend to be computationally expensive and are thus not well suited for real-time applications. Later, PHAT was proposed by Omologo and Svaizer [38]. More recently, a new PHAT-based method was proposed for high-accuracy robust speaker localization, known as steered response pattern-phase or power-phase transform (SRP-PHAT) [39]. Its disadvantage is higher processing time in comparison to PHAT, as it requires a search of a large number of candidate locations. According to the fact that in the application of this research, the number of candidate locations is much higher due to direction and distance estimation; this disadvantage does not allow us to use it in real-time applications.

In the last decade, according to the fact that the received signal energy is inversely proportional to the squared distance between the source and the receiving sensor, there has been some interest in using the received signal level at different sensors for source localization. Due to the spatial separation of the sensors, the source signal will arrive at the sensors with different levels so that the level differences can be utilized for source localization. Sheng and Hu [41] followed by Blatt and Hero [42] have proposed different algorithms for locating sources using a sensor network based on energy measurements. Birchfield and Gangishetty [22] applied ILD to sound source localization. While these works used only ILD-based methods to locate a source, Cui et al. [40] tried 2D sound source localization by a pair of microphones using TDE- and ILD-based methods simultaneously. When the source signal is captured at the sensors, both time delay and signal level information are used for source localization. This technique is applicable for 2D localization with two sensors only. Also, due to the use of a linear array, it generates two mirror points simultaneously (half-plane localization). Ho and Sun [24] addressed a more common scenario of 3D localization using more than four sensors to improve the source location accuracy.

Human hearing system allows finding sound sources direction of arrival in 3D with just two ears. Pinnas, shoulders, and head diffract the incoming sound waves [43]. These propagation effects collectively are termed the HRTF. Batteau reported that the external ears play an important role in estimating the elevation angle of arrival [44]. Roffler and Butler [45], Oldfield and Parker [46], and Hofman et al. [47] have tried to find experimental evidence for this claim by using a Plexiglas headband to flatten the pinna against the head [45]. Based on HRTF measuring, Brown and Duda have made an extensive experimental study and provided empirical formulas for the multipath delays produced by pinna [48]. Although more sophisticated HRTF models have been proposed [49], the Brown-Duda model has the advantage that it provides an analytical relationship between the multipath delays and the azimuth and elevation angles. Recently, Sen and Nehorai considered the Brown-Duda HRTF model as an example to model the frequency-dependent head-shadow effects and the multipath delays close to the sensors for analyzing a 3D direction finding system with only two sensors inspired by the human auditory system [43]. However, they did not consider white noise gain error or spatial aliasing error in their model. They computed the asymptotic frequency domain Cramer-Rao bound (CRB) on the error of the 3D direction estimate for zero-mean wide-sense stationary Gaussian source signals. It should be noted that HRTF-based works are just able to estimate the azimuth and elevation angles [50]. In the last decades, some papers were published which tried to apply HRTF along with TDE for azimuth and elevation angle of arrival estimation [30]. According to this ability, we apply HRTF in our TDE-ILD-based localization system for solving the ambiguity in the generation of two mirror location points. We named it TDE-ILD-HRTF method.

Given a set of TDEs and ILDs from a small set of microphones using PHAT- and ILD-based methods, respectively, the second stage of a two-stage algorithm determines the best point for the source location. The measurement equations are non-linear. The most straightforward way is to perform an exhaustive search in the solution space. However, this is computationally expensive and inefficient. If the sensor array is known to be linear, the position measurement equations are simplified. Carter focused on a simple beamforming technique [1]. However, it requires a search in the range and bearing space. Also, beamforming methods need many more microphones for high-accuracy source localization. The linearization solution based on Taylor-series expansion by Foy [51] involves iterative processing, typically incurs high computational complexity, and for convergence, requires a tolerable initial estimate of the position. Hahn proposed an approach [20] that assumes a distant source. Abel and Smith proposed an explicit solution that can achieve the Cramer-Rao Lower Bound (CRLB) in the small error region [52]. The situation is more complex when sensors are distributed arbitrarily. In this case, emitter position is determined from the intersection of a set of hyperbolic curves defined by the TDOA estimates using non-Euclidean geometry [53, 54]. Finding the solution is not easy as the equations are non-linear. Schmidt has proposed a formulation [18] in which the source location is found as the focus of a conic passing through three sensors. This method can be extended to an optimal closed-form localization technique [55]. Delosme [56] proposed a gradient method for search in a localization procedure leading to computation of optimal source locations from noisy TDOA's. Fang [57] gave an exact solution when the number of TDOA measurements is equal to the number of unknowns (coordinates of transmitter). This solution, however, cannot make use of extra measurements, available when there are extra sensors, to improve position accuracy. The more general situation with extra measurements was considered by Friedlander [58], Schau and Robinson [59], and Smith and Abel [55]. These methods are not optimum in the least-squares sense and perform worse in comparison to the Taylor-series method.

Although closed-form solutions have been developed, their estimators are not optimum. The divide and conquer (DAC) method [60] from Abel can achieve optimum performance, but it requires that the Fisher information is sufficiently large. To obtain a precise position estimate at reasonable noise levels, the Taylor-series method [51] is commonly employed. It is an iterative method that starts with an initial guess and improves the estimate at each step by determining the local linear least-squares (LS) solution. An initial guess close to the true solution is needed to avoid local minima. Selection of such a starting point is not simple in practice. Moreover, convergence of the iterative process is not assured. It also suffers from convergence problem and large LS computational burden as the method is iterative. Within the last few years, some papers were published on improving LS and closed-form methods [16, 23, 61–65]. Based on closed-form hyperbolic-intersection method, we will explain our proposed method, which using a simple arrangement of two microphones for 2D cases and three microphones for 3D cases, can simplify non-linear equations of this method to have a linear equation. Although there have been attempts to linearize closed-form non-linear equations through algebraic means, such as [7, 16, 56, 63], our proposed method with simple pure geometrical linearization needs less microphones and features accurate localization and less processing time.

3. Basic methods

3.1. HRTF

Human beings are able to locate sound sources in three dimensions in range and direction, using only two ears. The location of a source is estimated by comparing difference cues received at both ears, among which are time differences of arrival and intensity differences. The sound source specifications are modified before entering the ear canal. The head-related impulse response (HRIR) is the name given to the impulse response relating the source location and the ear location. Therefore, convolution of an arbitrary sound source with the HRIR leads to what would have been received at the ear canal. The HRTF is the Fourier transform of HRIR and thus represents the filtering properties due to diffractions and reflections at the head, pinna, and torso. Hence, HRTF can be computed through comparing the original and modified signals [25–29, 43–50]. Consider X_c(k) being the Fourier transform of sound source signal at one ear (modified signal at either left or right ear) and Y_c(k) to be that of the original signal. HRTF of that source can be found as [30]:

H_{c} (k) = \frac{Y_{c} (k)}{X_{c} (k)}

(1)

In detail:

|H_{c} (k)| = \frac{|Y_{c} (k)|}{|X_{c} (k)|}

(2)

arg H_{c} (k) = arg Y_{c} (k) - arg X_{c} (k)

(3)

H_{c} (k) = |H_{c} (k)| e^{jarg H_{c} (k)} .

(4)

In fact, H_c(k) contains all the direction-dependent and direction-independent components. Therefore, in order to have pure HRTF, the direction-independent elements have to be removed from H_c(k). If C_c(k) is the known CTF (common transfer function), then DTF (directional transfer function), D_c(k), can be found as:

D_{c} (k) = \frac{Yc (k)}{C_{c} (k) X_{c} (k)} .

(5)

3.2. ILD-based localization

We consider two microphones for localizing a sound source. Signal s(t) propagates through a generic free space with noise and no (or low degree of) reverberation. According to the so-called inverse-square law, the signal received by the two microphones can be modeled as [21–24, 40–42]:

s_{1} (t) = \frac{s (t - T_{1})}{d_{1}} + n_{1} (t)

(6)

and

s_{2} (t) = \frac{s (t - T_{2})}{d_{2}} + n_{2} (t),

(7)

where d₁ and d₂ are the distances and T₁ and T₂ are time delays from source to the first and second microphones, respectively. Also n₁(t) and n₂(t) are additive white Gaussian noises. The relative time shift between the signals is important for TDE but can be ignored in ILD. Therefore, if we find the delay between the two signals and shift the delayed signal in respect to the other one, the signal received by the two microphones can be modeled as:

s_{1} (t) = \frac{s (t)}{d_{1}} + n_{1} (t)

(8)

and

s_{2} (t) = \frac{s (t)}{d_{2}} + n_{2} (t) .

(9)

Now, we assume that the sound source is audible and in a fixed location. Also, it is available during the time interval [0, W] where W is the window size. The energy received by the two microphones can be obtained by integrating the square of the signal over this time interval [21–24, 40–42]:

E_{1} = \int_{o}^{w} s_{1}^{2} (t) dt = \frac{1}{d_{1}^{2}} \int_{o}^{w} s^{2} (t) dt + \int_{o}^{w} n_{1}^{2} (t) dt

(10)

E_{1} = \int_{o}^{w} s_{2}^{2} (t) dt = \frac{1}{d_{2}^{2}} \int_{o}^{w} s^{2} (t) dt + \int_{o}^{w} n_{2}^{2} (t) dt .

(11)

According to (10) and (11), the received energy decreases in relation to the inverse of the square of the distance to the source. These equations lead us to a simple relationship between the energies and distances:

E_{1} \cdot d_{1}^{2} = E_{2} \cdot d_{2}^{2} + η,

(12)

where $η = \int_{o}^{w} [d_{1}^{2} n_{1}^{2} (t) + d_{2}^{2} n_{2}^{2} (t)] dt$ is the error term. If (x₁,y₁) is the coordinates of the first microphone, (x₂,y₂) is the coordinates of the second microphone and (x_s,y_s) is the coordinates of the sound source with respect to the origin located at array center, then:

d_{1} = \sqrt{{(x_{1} - x_{s})}^{2} + {(y_{1} - y_{s})}^{2}}

(13)

d_{2} = \sqrt{{(x_{2} - x_{s})}^{2} + {(y_{2} - y_{s})}^{2}} .

(14)

Now using (12), (13), and (14), we can localize the sound source (Section 4.1.).

3.3. TDE-based localization

Correlation-based methods are the most widely used time delay estimation approaches. These methods use the following simple reasoning for the estimation of time delay. The autocorrelation function of s₁(t) can be written in time domain as [17–20, 31–39]:

{R_{s_{1}}}_{s_{1}} (τ) = \int_{- \infty}^{+ \infty} s_{1} (t) \cdot s_{1} (t - τ) dt .

(15)

Dualities between time and frequency domains for autocorrelation function of s₁(t) with the Fourier transform S₁(f), results in frequency domain presentation as:

{R_{s_{1}}}_{s_{1}} (τ) = \int_{- \infty}^{+ \infty} S_{1} (f) \cdot S_{2}^{*} (f) e^{j 2 πfτ} df .

(16)

According to (15) and (16), if the time delay τ is zero, this function's value will be maximized and will be equal to the energy of s₁(t). The cross-correlation of two signals s₁(t) and s₂(t) is defined as:

R_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} S_{1} (f) \cdot S_{2}^{*} (f) e^{j 2 πfτ} df .

(17)

If s₂(t) is considered to be the delayed version of s₁(t), this function features a peak at the point equal to the time delay. This delay can be expressed as:

τ_{12} = argma x_{τ} R_{s_{1} s_{2}} (τ) .

(18)

In an overall view, the time delay estimation methods are as follows [17–20, 31–39]:

Correlation-Based Methods: Cross-Correlation (CC), ML, PHAT, Average Square Difference Function (ASDF)
Adaptive Filter-Based Methods: Sync Filter, LMS.

Advantages of PHAT are accurate delay estimation in the case of wideband and quasi-periodic/periodic signals, good performance in noisy and reflective environments, sharper spectrum due to the use of better weighting function and higher recognition rate. Therefore, PHAT is used in cases where signals are detected using arrays of microphones and additive environmental and reflective noises are observed. In such cases, the signal delays cannot be accurately found using typical correlation-based methods as the correlation peaks cannot be precisely extracted. PHAT is a cross-correlation-based method used for finding the time delay between the signals. In PHAT, similar to ML, weighting functions are used along with the correlation function, i.e.,

ϕ_{PHAT} (f) = \frac{1}{|G_{s_{1} s_{2}} (f)|},

(19)

where $G_{s_{1} s_{2}} (f)$ is the cross-correlation-based power spectrum. The overall function used in PHAT for the estimation of delay between two signals is defined as:

{R_{s_{1}}}_{s_{2}} (τ) = \int_{- \infty}^{+ \infty} ϕ_{PHAT} (f) \cdot G_{s_{1} s_{2}} (f) e^{j 2 πfτ} df

(20)

D_{PHAT} (f) = argma x_{τ} R_{s_{1} s_{2}} (τ),

(21)

where D_PHAT is the delay calculated using PHAT. $G_{s_{1} s_{2}} (f)$ is found as:

G_{s_{1} s_{2}} (f) = \int_{- \infty}^{+ \infty} r_{s_{1} s_{2}} (τ) e^{j 2 πfτ} df

(22)

where

r_{s_{1} s_{2}} (τ) = E [s_{1} (t) s_{2} (t + τ)] .

(23)

4. ILD- and PHAT-based angle of arrival and location calculation methods

4.1. Using ILD method

Assuming two microphones are on x-axis and have a distance of R (R = D/2) from origin (Figure 1), we can rewrite (13) and (14) as:

d_{1} = \sqrt{{(R - x_{s})}^{2} + {y_{s}}^{2}}

(24)

d_{2} = \sqrt{{(- R - x_{s})}^{2} + {y_{s}}^{2}} .

(25)

Therefore, we can rewrite (12) as:

(\frac{E_{1}}{E_{2}}) \cdot d_{1}^{2} = d_{2}^{2} + (η / E_{2}) .

(26)

Assuming $m = \frac{E_{1}}{E_{2}}$ and n = η/E₂, (26) is written as:

m {[{(R - x_{s})}^{2} + y_{s}]}^{2} = [(- R - x_{s}) + {y_{s}}^{2}] + n .

(27)

Using x and y instead of x_s and y_s, (27) will become:

x^{2} - [\frac{2 R (m + 1)}{m - 1}] x + y^{2} = \frac{n}{m - 1} - R^{2}

(28)

{(x - \frac{R (m + 1)}{m - 1})}^{2} + y^{2} = \frac{1}{m - 1} (n + \frac{4 m}{m - 1} R^{2})

(29)

\begin{array}{l} \{\begin{cases} {(x - k)}^{2} + y^{2} = l \\ k = \frac{R (m + 1)}{m - 1} \\ l = \frac{1}{m - 1} (n + \frac{4 m}{m - 1} R^{2}) \end{cases} . \end{array}

(30)

Therefore, source location is on a circle with center coordinates (k, 0) and radius $(\sqrt{l})$ . Now, using a new microphone to find a new equation, in combination with one of the first or second microphones, helps us to have another circle which leads to source location with different center coordinates and different radii relative to the first circle. Intersection of the first and second circles gives us source location x and y[22, 40].

4.2. Using PHAT method

Assuming a single frequency sound source with a wavelength equal to λ to have a distance from the center of two microphones equal to r, this source will be in far-field if [51–65]:

r > \frac{2 D^{2}}{λ},

(31)

where D is the distance between two microphones. In the far-field case, the sound can be considered as having the same angle of arrival to all microphones, as shown in Figure 1. If s₁(t) is the output signal of the first microphone and s₁(t) is that of the second microphone (Figure 1), taking into account the environmental noise, and according to the so-called inverse-square law, the signal received by the two microphones can be modeled as (6) and (7). The relative time shift between the signals is important for TDOA but can be ignored in ILD. Also, the attenuation coefficients (1/d₁ and 1/d₂) are important for ILD but can be ignored in TDOA. Therefore, assuming T_D is the time delay between the two received signals, the cross-correlation between s₁(t) and s₂(t) is [51–65]:

{R_{s_{1}}}_{s_{2}} (τ) = \int_{- \infty}^{+ \infty} s_{1} (t) s_{2} (t + τ) dt .

(32)

Since n₁(t) and n₂(t) are independent, we can write

R_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} s (t) s (t - T_{D} + τ) dt .

(33)

Now, the time delay between these two signals can be measured as:

τ = argma x_{TD} {R_{s_{1}}}_{s_{2}} (τ) .

(34)

Correct measurement of the time delay needs the distance between the two microphones to be:

D \leq \frac{λ}{2},

(35)

since when D is greater than $\frac{λ}{2}$ , T_D is greater than π, and therefore, time delay is measured as τ = − (T_D − π). According to Figure 1, the cosine of the angle of arrival is

cos (ϕ) = \frac{d_{2} - d_{1}}{D} = \frac{(t_{2} - t_{1}) v_{sound}}{D} = \frac{T_{D \cdot} v_{sound}}{D} = \frac{{τ_{21}}_{\cdot} v_{sound}}{D} .

(36)

Here, v_sound is sound velocity in air. The delay time τ₂₁ is measurable using the cross-correlation function between the two signals. However, the location of source cannot be measured this way. We can measure the distance between source and each of the microphones as (13) and (14). The difference between these two distances will be

d_{2} - d_{1} = τ_{21} \cdot v_{sound} .

(37)

Using x and y instead of x_s and y_s, τ₂₁ will be

τ_{21} = \frac{\sqrt{{(x - x_{2})}^{2} + {(y - y_{2})}^{2} -} \sqrt{{(x - x_{1})}^{2} + {(y - y_{1})}^{2}}}{v_{sound}} .

(38)

This is an equation with two unknowns, x and y. Assuming the distances of both microphones from the origin to be R (D = 2R) and both located on x axis,

τ_{21} = \frac{\sqrt{{(x + R)}^{2} + y^{2}} - \sqrt{{(x - R)}^{2} + y^{2}}}{v_{sound}} .

(39)

Simplifying the above equation will result in:

\begin{array}{l} \{\begin{cases} y^{2} = a . x^{2} + b \\ a = \frac{4 R^{2}}{v_{sound} \cdot {τ_{21}}^{2}} - 1 \\ b = \frac{{v_{sound}}^{2} \cdot {τ_{21}}^{2}}{4} - R^{2} \end{cases}, \end{array}

(40)

where y has hyperbolic geometrical location relative to x, as shown in Figure 1. In order to find x and y, we need to add another equation to (38) for the first and a new (third) microphone so that:

\begin{array}{l} \{\begin{cases} τ_{21} = \frac{\sqrt{{(x - x_{2})}^{2} + {(y - y_{2})}^{2}} - \sqrt{{(x - x_{1})}^{2} +} {(y - y_{1})}^{2}}{v_{sound}} \\ τ_{31} = \frac{\sqrt{{(x - x_{2})}^{2} + {(y - y_{3})}^{2}} - \sqrt{{(x - x_{1})}^{2} + {(y - y_{1})}^{2}}}{v_{sound}} \end{cases} . \end{array}

(41)

It is noticeable that these are non-linear equations (Hyperbolic-intersection Closed-Form method) and numerical analysis should be used to calculate x and y, which will increase localization processing times. Also in this case, the solution may not converge.

5. TDE-ILD-based 2D sound source localization

Using either TDE or ILD to calculate source location (x and y) in 2D cases needs at least three microphones. Using TDE and ILD simultaneously helps us calculate source location using only two microphones. According to (26) and (37) and this fact that in a high SNR environment, the noise term η/E 2 can be neglected, after some algebraic manipulations, we derive [40]

{(x_{s} - x_{1})}^{2} + {(y_{s} - y_{1})}^{2} = {(\frac{τ_{21} \cdot v_{sound}}{1 - \sqrt{m}})}^{2} = r \frac{2}{1}

(42)

and

{(x_{s} - x_{2})}^{2} + {(y_{s} - y_{2})}^{2} = {(\frac{τ_{21} \cdot v_{sound} \cdot \sqrt{m}}{1 - \sqrt{m}})}^{2} = r_{2}^{2}

(43)

Intersection of two circles determined by (42) and (43), with center (x₁,y₁)a nd (x₁,y₁), and radius r₁ and r₂, respectively, gives the exact source position. In E₁ = E₂(m − 1) case, both the hyperbola and the circle determined by (26) and (37) degenerate a line perpendicular bisector of microphone pair. Consequently, there will be no intersection to determine source position. Trying to obtain a closed-form solution to this problem, we transform the expression by [40]:

x_{1} x_{s} + y_{1} y_{s} = \frac{1}{2} (k_{1}^{2} - r_{1}^{2} + R_{s}^{2})

(44)

and

x_{2} x_{s} + y_{2} y_{s} = \frac{1}{2} (k_{2}^{2} - r_{s}^{2} + R_{s}^{2}),

(45)

where

k_{1}^{2} = x_{1}^{2} + y_{1}^{2}, k_{2}^{2} = x_{2}^{2} + y_{2}^{2} and R_{s}^{2} = x_{s}^{2} + y_{s}^{2} .

(46)

Rewriting (44) and (45) into matrix form results in:

[\begin{array}{l} x_{1} y_{1} \\ x_{2} y_{2} \end{array}] [\begin{array}{l} x_{s} \\ y_{s} \end{array}] = \frac{1}{2} \{[\begin{array}{l} k_{1}^{2} - r_{1}^{2} \\ {k_{2}}^{2} - {r_{2}}^{2} \end{array}] + R_{s}^{2} [\begin{array}{l} 1 \\ 1 \end{array}]\}

(47)

and

[\begin{array}{l} x_{s} \\ y_{s} \end{array}] = {[\begin{array}{l} x_{1} y_{1} \\ x_{2} y_{2} \end{array}]}^{- 1} (\frac{1}{2} \{[\begin{array}{l} k_{1}^{2} - r_{1}^{2} \\ k_{2}^{2} - r_{2}^{2} \end{array}] + R_{s}^{2} [\begin{array}{l} 1 \\ 1 \end{array}]\}) .

(48)

If we define

p = [\begin{array}{l} p 1 \\ p 2 \end{array}] = \frac{1}{2} {[\begin{array}{l} x 1 y 1 \\ x 2 y 2 \end{array}]}^{- 1} [\begin{array}{l} 1 \\ 1 \end{array}]

(49)

and

q = [\begin{array}{l} q 1 \\ q 2 \end{array}] = \frac{1}{2} {[\begin{array}{l} x_{1} y_{1} \\ x_{2} y_{2} \end{array}]}^{- 1} [\begin{array}{l} k_{1}^{2} - r_{2}^{1} \\ k_{2}^{2} - r_{2}^{2} \end{array}],

(50)

then the source coordinates can be expressed with respect to Rs:

X = [\begin{array}{l} x_{s} \\ y_{s} \end{array}] = [\begin{array}{l} q_{1} + p_{1} R_{s}^{2} \\ q_{2} + p_{2} R_{s}^{2} \end{array}] .

(51)

Inserting (46) into (51), the solution to Rs is obtained as:

R_{s}^{2} = \frac{0_{1} \pm 0_{2}}{0_{3}},

(52)

where

0_{1} = 1 - p_{1} q_{1} + p_{2} q_{2,}

0_{2} = \sqrt{{(1 - p_{1} q_{1} + p_{2} q_{2})}^{2} + (p_{1}^{2} + p_{2}^{2}) (q_{1}^{2} + q_{2}^{2}),}

0_{3} = p_{1}^{2} + p_{2}^{2} .

The positive root gives the square of distance from source to origin. Substituting Rs into (51), the final source coordinate will be obtained [40].

However, a rational solution requires prior information of evaluation regions. It is known to us that, by using a linear array, two mirror points will be produced simultaneously. Assuming two microphones are on x-axis (y₁ = y₂ = 0) and have distance R from origin (Figure 1), According to (49) and (50), we cannot find p and q. Therefore, we cannot consider such a microphone arrangement. However, using this microphone arrangement simplifies equations. According to (26) and (37), we can intersect circle and hyperbola to find source location, x and y. For intersection of circle and hyperbola, firstly we rewrite (42) and (43), respectively, as

{(x_{s} - x_{1})}^{2} + {(y_{s} - y_{1})}^{2} = r_{1}^{2}

(53)

and

{(x_{s} - x_{2})}^{2} + {(y_{s} - y_{2})}^{2} = r_{2}^{2} .

(54)

Using microphones coordinate values x and y instead of x_s and y_s, we will have

{(x - R)}^{2} + {(y - 0)}^{2} = r_{1}^{2}

(55)

and

{(x + R)}^{2} + {(y - 0)}^{2} = r_{2}^{2};

(56)

therefore,

r_{1}^{2} - {(x - R)}^{2} = r_{2}^{2} - {(x + R)}^{2}

(57)

which results in

r_{2}^{2} - r_{1}^{2} = 4 Rx .

(58)

Hence, the sound source location can be calculated as:

x = (r_{2}^{2} - r_{1}^{2}) / 4 R

(59)

and

y = \pm \sqrt{r_{1}^{2} - {(x - R)}^{2}}

(60)

We remember again that by using a linear array, two mirror points will be produced simultaneously. This means that we can localize 2D sound source only in half-plane.

6. Simulations of TDE-ILD-based method and discussion

In order to use the introduced method for sound source localization in low reverberant outdoor cases, firstly we performed simulations. We tried to evaluate the accuracy of this method in a variety of SNRs for some environmental noises. We considered two microphones on x-axis (y₁ = y₂ = 0) with 1-m distance from the origin (x₁ = 1 m,x₂ = −1 m (D = 2R = 2)) (Figure 1). In order to use PHAT for the calculation of time delay between the signals of the two microphones, we used a helicopter sound (wideband and quasi-periodic signal) with a length of approximately 4 s, downloaded from the internet (http://www.freesound.com) as our sound source. For different source locations and for an ambient temperature of 15°C, first we calculated sound speed in air using:

v_{sound} = 20.05 \sqrt{273.15 + Temperature (Centigrade)} .

(61)

Then, we calculated d₁ and d₂ using (24) and (25), and using (37), we calculated time delay between the received signals of the two microphones. For time delay positive values, i.e., sound source nearer to the first microphone (mic1 in Figure 1), we delayed second microphone signal with respect to the first microphone signal, and for negative values, i.e., sound source nearer to the second microphone (mic2 in Figure 1), did the opposite. Then using (6) and (7), we divided the first microphone signal by d₁ and the second microphone signal by d₂ to have correct attenuation in signals according to the source distances from microphones. Finally, we tried to calculate source location using the proposed TDE-ILD method (Section 5) in a variety of SNRs for some environmental noises.

For a number of signal-to-noise ratios (SNRs) for white Gaussian, pink and babble noises from ‘NATO RSG-10 Noise Data,’ 16-bit Quantization, and 96,000-Hz sampling frequency (hence, we upsampled NATO RSG-10 from its original 190,00 to 96,000 Hz), simulation results are shown in Figure 2 for source location of (x = 10, y = 10). Assuming a high audio sampling rate, fractional delays are negligible and delays are rounded to nearest sampling point. Simulation results show larger localization error for SNRs lower than 10 dB. This issue occurs due to the use of ILD. Hence, we have to use this method for the case of only a dominant source to have accurate localization. Using spectral subtraction and source counting are useful for reducing localization error in SNRs lower than zero.

7. Our proposed TDE-ILD-HRTF method

Using TDE-ILD-based method, dual microphone 2D sound source localization is applicable. However, it is known that, by using a linear array in TDE-ILD-based method, two mirror points will be produced simultaneously (half-plane localization) [40]. Also, according to TDE-ILD-based simulation results (Section 6), it is noticeable that using ILD-based method needs only one dominant high SNR source to be active in localization area. Our proposed TDE-ILD-HRTF method tries to solve these problems using source counting, noise reduction using spectral subtraction, and HRTF.

7.1. Source counting method

According to previous discussions, ILD-based method needs to use source counting to find that one dominant source is active for high-resolution localization. If more than one source is active in localization area, it cannot calculate m = E₁/E₂ correctly. Therefore, we would need to count active and dominant sound sources and decide on localization of one sound source if only one source is dominant enough. PHAT gives us the cross-correlation vector of two microphone output signals. The number of dominant peaks of the cross-correlation vector gives us the number of dominant sound sources [66]. We consider only one source signal to be a periodic signal as:

s (t) = s (t + T) .

(62)

If the signals window is greater than T, calculating cross-correlation between the output signals of the two microphones gives us one dominant peak and some weak peaks with multiples of T distances. However, using signals window of approximately equal to T or using non-periodic source signal would lead to only one dominant peak when calculating cross-correlation between the output signals of the two microphones. This peak value is delayed equal to the number of samples between the two microphones' output signals. Therefore, if one sound source is dominant in the localization area, only one dominant peak value will be in cross-correlation vector. Now, we consider having two sound sources s(t) and s'(t) in high SNR localization area. According to (6) and (7), we have

s_{1} (t) = s (t - T_{1}) + s' (t - T'_{1})

(63)

s_{2} (t) = s (t - T_{2}) + s' (t - T'_{2}) .

(64)

According to (32), we have

\begin{array}{l} R_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} (s (t - T_{1}) & + s' (t - T'_{1})) (s (t - T_{2} + τ) \\ + s' (t - T'_{2} + τ)) dt \end{array}

(65)

R_{s_{1} s_{2}} (τ) = R 1_{s_{1} s_{2}} (τ) + R 2_{s_{1} s_{2}} (τ) + R 3_{s_{1} s_{2}} (τ) + R 4_{s_{1} s_{2}} (τ)

(66)

where

R 1_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} (t - T_{1}) \cdot s (t - T_{2} + τ) dt

(67)

R 2_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} (t - T_{1}) \cdot s' (t - T'_{2} + τ) dt

(68)

R 3_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} s' (t - T'_{1}) \cdot s (t - T_{2} + τ) dt

(69)

R 4_{s_{1} s_{2}} (τ) = \int_{- \infty}^{+ \infty} s' (t - T'_{1}) \cdot s' (t - T'_{2} + τ) dt .

(70)

Using (34), τ₁ = T₂ − T₁ gives us a maximum value for $R 1_{s_{1} s_{2}} (τ), τ_{2} = T'_{2} - T_{1}$ , gives us a maximum value for $R 2_{s_{1} s_{2}} (τ), τ_{3} = T_{2} - T'_{1}$ , gives us a maximum value for $R 3_{s_{1} s_{2}} (τ) and τ_{4} = T'_{2} - T'_{1}$ , and gives us a maximum value for $R 4_{s_{1} s_{2}} (τ)$ . Therefore, we will have four peak values in cross-correlation vector. However, according to this fact that (67) and (70) are cross-correlation functions of a signal with delayed version of itself, and (68) and (69) are cross-correlation functions of two different signals, τ₁ and τ₄ maximum values are dominant with respect to τ₂ and τ₃ values. Now, we conclude in two dominant sound sources area, cross-correlation vector will have two dominant values and therefore equal count dominant values for more than two dominant sound sources signals as multiple power spectrum peaks in DOA-based multiple sound source beamforming methods [16]. Therefore, counting dominant cross-correlation vector values, we can find the number of active and dominant sound sources in localization area.

7.2. Noise reduction using spectral subtraction

In order to apply ILD in TDE-ILD-based dual microphone 2D sound source localization, source counting is used to find that one dominant high SNR source is active in localization area. Source counting was proposed to calculate the number of active sources in localization area. Furthermore, spectral subtraction can be used for noise reduction and therefore increasing active source's SNR. Also, according to the background noise, such as wind, rain, and babble, we can consider a background spectrum estimator. In the most practical cases, we can assume that the noisy signal can be modeled as the sum of the clean signal and the noise [67]:

s_{n} (t) = s (t) + n (t) .

(71)

Also according to this fact that the signal and noise are generated by independent sources, they are considered uncorrelated. Therefore, the noisy spectrum can be written as:

S_{n} (w) = S (w) + N (w) .

(72)

During the silent periods, i.e., periods without target sound, it can be estimated background noise spectrum, considering the noise to be stationary. Then, the noise magnitude spectrum can be subtracted from the noisy input magnitude spectrum. In non-stationary noise cases, there can be used an adaptive background noise spectrum estimation procedure [67].

7.3. Using HRTF method

Using a linear array in TDE-ILD-based dual microphone 2D sound source localization method leads to two mirror points produced simultaneously. Adding an HRTF-inspired idea, whole-plane dual microphone 2D sound source localization would be possible. This idea was published in [68], and it is reviewed here again. Researchers have used an artificial ear with a spiral shape. This is a special type of pinna with a varying distance from a microphone placed in the center of its spiral [30]. However, we consider a half-cylinder instead of artificial ear [68]. Due to the use of such a reflector, a constant notch position is created for all variations (0 to180°) of sound signal angle of arrival (Figure 3). However, clearly, and as shown in the figure, a complete half-cylinder scatters the sound waves from sources behind it. In order to overcome this problem we consider slits on the surface of the half-cylinder (Figure 3). If d is the distance between the reflector (half-cylinder) and the microphone (placed at the center), a notch will be created when it is equal to quarter of the wavelength of sound, λ, plus any multiples of λ/2. For such wavelengths, the incident waves are reduced by reflected waves [30]:

n \cdot (\frac{λ}{2}) + (\frac{λ}{4}) = d n = 0, 1, 2, 3 ....

(73)

These notches will appear at the following frequencies:

f = \frac{c}{λ} = \frac{(2 n + 1) . c}{4 d} .

(74)

Covering only microphone 2 in Figure 1 by reflector, calculating the interaural spectral difference results in

\begin{array}{l} |ΔH (f)| & = |10 {log}_{10} H_{1} (f) - 10 {log}_{10} H_{2} (f)| \\ = |10 {log}_{10} \frac{H_{1} (f)}{H_{2} (f)}| . \end{array}

(75)

High |ΔH(f)| values indicate that the sound source is in front, while negligible values indicate that sound source is at the back. One important point is that in order to have the same spectrum in both microphones when the sound source is at the back, careful design of the slits is necessary.

7.4. Extension of dimensions to three

According to the use of half-cylinder reflector in the proposed localization system, this approach is only applicable in 2D cases. Using a half-sphere reflector instead of the half-cylinder makes it usable in 3D sound source localization (Figure 4). Adding the half-sphere reflector only to microphone 2 in Figure 5 allows the localization system to localize sound sources in 3D cases.

For 3D case, we can consider a plane that passes through source position and x-axis and which makes an angle of θ with the x-y plane, and we name it source-plane (Figure 5). This plane consists of the microphones 1 and 2. According to these assumptions, using (36), we can calculate ф which is the angle of arrival in source-plane and is equal to angle of arrival in x-y plane. Then, using (59) and (60), we can also calculate x and y in 2D source-plane (not in x-y plane in 3D case) and therefore calculate sound source distance $r \sqrt{x^{2} + y^{2}}$ . Introducing a new microphone (mic3 in Figure 5) with y = 0 and x = z = R helps us calculate the angle θ. Therefore, using (36), we can calculate θ as:

θ = co s^{‒ 1} (\frac{v_{sound} \cdot τ_{31}}{R}) .

(76)

Using half-sphere reflector decreases accuracy of time delay and intensity level deference estimation between microphones 1 and 2 due to the change in the spectrum of second microphone's signal. However, covering the third microphone instead of the second microphone by half-sphere reflector only decreases the accuracy of time delay estimation between the first and third microphones (Figure 5). Multiplying the inverse function of notch-filter in third microphone's spectrum leads to increase in accuracy. Now using r, ф, and θ, we can calculate source location x, y and z in 3D case:

\{\begin{cases} x = r . cos (ϕ) . sin (θ) \\ y = r . sin (ϕ), sin (θ) \\ z = r . cos (θ) \end{cases} .

(77)

The reasons for choosing the shape of sphere for the reflector are as follows [69]. The simplest type of reflector is a plane reflector introduced to direct signal in a desired direction. Clearly, using this type of reflector, the distance between reflector and microphone, d in (73), varies with respect to source position in 3D cases leading to a change in notch position within spectrum. The change in notch position may not be suitable as it might occur out of the spectral band of interest. To better adjust the energy in the forward direction, the geometrical shape of the plane reflector must be changed so as not to allow radiation in the back and side directions.

A possible arrangement for this purpose consists of two plane reflectors joined on one side to form a corner. This type of reflector returns the signal exactly in the same direction as it is received. Because of this unique feature, reflected wave acquired by the microphone is unique, which is our aim. Whereas having more than one reflected wave causes a higher energy value with respect to a single reflected wave at the microphone, which causes a deep notch. But using this type of reflector, d is varied with respect to the source position in 3D cases which is related to notch position in spectrum. It has been shown that if a beam of parallel waves is incident upon a parabola reflector, the radiation will focus at a spot which is known as the focal point. This point in spherical reflector does not match to the center in accordance with this fact that the center has equal distance (d) to all points of the reflector surface. Because of this feature, reflected wave which is received by the microphone is unique, whatever the position of the sound source, in 3D cases and belongs to the wave passing the center. The center of the spherical reflector with radius R is located at O (Figure 6). The sound wave axis strikes the reflector at B. From the Law of Reflection and angle geometry of parallel lines, the marked angles are equal. Hence, BFO is an equilateral triangle. Dropping the median from F which is stapled to BO and using trigonometry, the focal point is calculated as:

\frac{R}{2 b} = cos (θ) \to b = \frac{R}{2 cos θ}

(78)

\to F = R - \frac{R}{2 cos (θ)} .

(79)

F is not equal to R for all θ values. The half-spherical reflector scatters the sound source waves hitting it from back. Therefore, we consider some slits on its surface (Figure 4). Obviously, the slits need to be designed in a fashion that leads to the same spectrum in both microphones when sound source is in back. When a plane wave hits a barrier with a single circular slit narrower than signal wavelength (λ), the wave bends and emerges from the slit as a circular wave [70]. If L is the distance between the slit and the viewing screen and d is the slit width, based on the assumption that L> > d (Fraunhofer scattering), the wave intensity distribution observed (on a screen) at an angle θ with respect to the incident direction is given by:

if k = \frac{π . d}{λ} sin (θ) \to \frac{I (θ)}{I (θ)} = \frac{{sin}^{2} (k)}{k^{2}},

(80)

where I(θ) is wave intensity in direction of observation and I(0) is maximum wave intensity of diffraction pattern (central fringe). This relationship will have a value of zero each time sin²(k) = 0. This occurs when

k = \pm mπ or \frac{π . d}{λ} sin (θ) = \pm mπ,

(81)

yielding the following condition for observing minimum wave intensity from a single circular slit:

sin (θ) = \frac{mλ}{d} m = 0, \pm 2, ....

(82)

This relationship is satisfied for integer values of m. Increasing values of m gives minima at correspondingly larger angles. The first minimum will be found for m = 1, the second for m = 2, and so forth. If $\frac{d}{λ} sin (θ)$ is less than one for all values of θ, i.e., when the size of the aperture is smaller than a wavelength (d < λ), there will be no minima. As we need more than one circular slit on half-sphere reflector's surface, we consider a parallel wave incident on a barrier that consists of two closely spaced narrow circular slits S1 and S2. The narrow circular slits split the incident wave into two coherent waves. After passing through the slits, the two waves spread out due to diffraction interfere with one another. If this transmitted wave is made to fall on a screen some distance away, an interference pattern of bright and dark fringes are observed on the screen, with the bright ones corresponding to regions of maximum wave intensity while the dark ones corresponding to those of minimum wave intensity. As discussed before, for all points on the screen where the path difference is some integer multiple of the wavelength, the two waves from the slits S1 and S2 arrive in phase and bright fringes are observed. Thus, the condition for producing bright fringes is as in (82). Similarly, dark fringes are produced on the screen if the two waves arriving on the screen from slits S1 and S2 are exactly out of phase. This happens if the path difference between the two waves is an odd integer multiple of half-wavelengths:

sin (θ) = \frac{(m + \frac{1}{2}) . λ}{d} m = 0, \pm 1, \pm 2, ....

(83)

Of course, this condition is changed using a half-sphere instead of a plane (Figure 7), where according to the same distance R to the center, the two waves from the slits S1 and S2 arrive in phase at the center and bright fringes are observed there. This signal intensity magnification is not suitable for the case of estimating intensity level difference between two microphones where one of them is covered by this reflector. However, covering the third microphone by half-sphere reflector is suitable where there is no need to estimate the intensity level difference between two microphones 1 and 3.

8. TDE-ILD-HRTF approach algorithm

According to previous discussions, the following steps can be considered for our proposed method:

1.
Setup microphones and hardware.
2.
Calculate the sound recording hardware set (microphone, preamplifier, and sound card) amplification normalizing factor.
3.
Apply voice activity detection. Is there valuable signal? yes → go to 4 no → go to 3.
4.
Obtain s ₁(t), s ₂( t ), and s ₃(t) → m = E ₁/E ₂.
5.
Remove DC from the signals. Then normalize them regarding the sound intensity.
6.
Hamming window signals regarding their stationary parts (at least about 100 ms for wideband quasi-periodic helicopter sound or twice that).
7.
Apply FFT to signals.
8.
Cancel noise using spectral subtraction.
9.
Apply PHAT to the signals in order to calculate τ ₂₁ and τ ₃₁ in frequency domain (index of first maximum value). Then find the second maximum value of cross-correlation vector.
10.
If the first maximum value is not dominant enough with respect to the second maximum value, go to the next windows of signals and do not calculate sound source location, otherwise:
1. a.
  $F = \cos^{- 1} (\frac{v_{sound} . t_{21}}{2 R}) and ? = \cos^{- 1} (\frac{v_{sound} . t_{31}}{R})$
2. b.
  $v_{sound} = 20.05 \sqrt{273.15 + Temperature (Centigrade)}$
3. c.
  $r_{1} = \frac{t_{21} . v_{sound}}{1 - \sqrt{m}} and r_{2} = \frac{t_{21} . v_{sound} . \sqrt{m}}{1 - \sqrt{m}}$
4. d.
  $x = (r_{2}^{2} - r_{1}^{2}) / 4 R and y = \pm \sqrt{r_{1}^{2} - {(x - R)}^{2}}$
5. e.
  $|?H (f)| = |10 \log_{10} \frac{H_{1} (f)}{H_{3} (f)}|$
6. f.
  $if (|?H (f)| ˜ 0) y = - \sqrt{r_{1}^{2} - {(x - R)}^{2}}$
  
  $else y = \sqrt{r_{1}^{2} - {(x - R)}^{2}}$
  
  $? \{\begin{array}{c} x_{s} = r . \cos (F) . \sin (?) \\ y_{s} = r . \sin (F) . \sin (?) \\ z_{s} = r . \cos (?) \end{array}$
7. g.
  Go to 3.

9. Hardware and software implementations and results

We implemented our proposed hardware using ‘MB800H’ motherboard and DELTA IOIO LT sound card from M-Audio featuring eight analogue sound inputs, 96-kHz sampling frequency, and up to 16-bit resolution. Three MicroMic microphones of type ‘C 417’ with B29L preamplifier were used for the implementation (Figure 8). The reason for using this type of microphone is its very flat spectral response up to 5 kHz. Visual C++ used for software implementation. According to the very low sensitivity of the used microphones (10 mV/Pa), a loudspeaker was used for the sound source. Initially, we used microphones one and two to evaluate the accuracy of the angle of arrival ф as well as microphones one and three to evaluate the accuracy of the angle of arrival θ. We considered 1 m(R = D/2 = 1) distance between every microphone and the origin. Sound localization results showed that sound velocity changed in different temperatures. Since the sound velocity was used in most of the steps of the proposed methods, sound source location calculations were carried out inaccurately. Therefore, (61) was used based on a thermometer reading in order to more accurately find the sound velocity in different temperatures. Angle of arrival calculations using downloaded wideband quasi-periodic helicopter sound resulted in Tables 1 and 2. Acquisition hardware was initialized to 96-kHz sampling frequency and 16-bit resolution. Also signal acquisition window length was set to 100 ms. A custom 2-cm-diameter reflector (Figure 8) was placed behind the third microphone. Using the aforementioned sound source, the sound spectra shown in Figure 9 were measured using the third microphone. Figure 9a shows the spectrum when the sound source has been placed behind the reflector and Figure 9b when placed in front of it. Notches can be spotted in the second spectrum according to the reflector diameter and (74). Finally, using our proposed TDE-ILD-HRTF approach, first we tried to find that a dominant source is active in localization area. Then, using (72), we tried to reduce background noise and localize sound source in 3D space. Table 3 shows the improved results after the application of noise reduction method. Although PHAT features high performance in noisy and reflective environments, due to the rather constant and high valued signal-to-reverberation ratio (SRR) in outdoor cases, this feature could not be evaluated. We tried to find the effect of probable reverberation on our localization system results. In order to alleviate reverberation, our experiments indicated that at least a distance of 1.2 m with surrounding buildings would be necessary. Although the loudspeaker's power, the hardware set (microphone, preamplifier, and sound card) amplification factor and the distance of the microphones with the surrounding buildings indicate the SNR value for reverberate signals in a real environment, this distance can be calculated, but it can also be indicated experimentally, which is easier. Tables 1 and 2 indicate that our implemented 3D sound source localization method features less than 0.2 degree error for angle of arrival. Table 3 indicates less than 10% error for 3D location finding. It also indicates ambiguity in sign finding between 0.5° and −0.5° as well as 175° and 185° due to the implemented reflector edges. We measured processing times of less than 150 ms for every location calculation. Comparing angle of arrival results with the results reported in various cited papers, the best reported results (on approximately similar tasks) were 1 degree limit error in [71], 2 degree limit error in [72], and within 8 degree error in [73]. Note that the fundamental limit of bearing estimates determined by Cramer-Rao Lower Bound (CRLB) for D = 1 cm spacing of the microphone pair, sampling frequency of 25 kHz and estimation duration of 100 ms, is computed to be 1° [71]. This is found equal to 1.4° for our system. However, only few respectable references have calculated 2D or 3D source locations accurately. Most of such research reports, besides calculating the angle of arrival, used either least-square error estimation or Kalman filtering techniques to predict motion path of sound sources, in order to overcome their local calculation errors. Examples are [2], [13], [15], and [19]. Meanwhile, comparison of our results with these and other works (results on approximately similar tasks), such as [72] and [73], shows the approximately same accuracy nature of our proposed source location measurement approach.

Table 1 Results of the azimuth angle of arrival based on hardware implementation

Full size table

Table 2 Results of the elevation angle of arrival based on hardware implementation

Full size table

Table 3 Results for proposed 3D sound source localization method (Section 7) using noise reduction procedure

Full size table

10. Conclusion

In this paper, we reported on the simulation of TDE-ILD-based 2D half-plane sound source localization using only two microphones. Reduction of the microphone count was our goal. Therefore, we also proposed and implemented TDE-ILD-HRTF-based 3D entire-space sound source localization using only three microphones. Also, we used spectral subtraction and source counting methods in low-degree reverberation outdoor cases to increase localization accuracy. According to Table 3, implementation results show that the proposed method has led to less than 10% error for 3D location finding. This is a higher accuracy in source location measurement in comparison with similar researches which did not use spectral subtraction and source counting. Also, we indicated that partly covering one of the microphones by a half-sphere reflector leads to entire-space N-dimensional sound source localization using only N-microphones.

Authors’ information

AP was born in Azerbaijan. He has a Ph.D. in Electrical Engineering (Signal Processing) from the Electrical Engineering Department, Amirkabir University of Technology, Tehran, Iran. He is now an invited lecturer at the Electrical Engineering Department, Amirkabir University of Technology and has been teaching several courses (C++ programming, multimedia systems, digital signal processing, digital audio processing, and digital image processing). His research interests include statistical signal processing and applications, digital signal processing and applications, digital audio processing and applications (acoustic modeling, speech coding, text to speech, audio watermarking and steganography, sound source localization, determined and underdetermined blind source separation and scene analyzing), digital image processing and applications (image and video coding, image watermarking and steganography, video watermarking and steganography, object tracking and scene matching) as well as BCI.

SMA received his B.S. and M.S. degrees in Electronics from the Electrical Engineering Department, Amirkabir University of Technology, Tehran, Iran, in 1984 and 1987, respectively, and his Ph.D. in Engineering from the University of Cambridge, Cambridge, U.K., in 1996, in the field of speech processing. Since 1988, he has been a Faculty Member at the Electrical Engineering Department, Amirkabir University of Technology, where he is currently an associate professor and teaches several courses and conducts research in electronics and communications. His research interests include speech processing, acoustic modeling, robust speech recognition, speaker adaptation, speech enhancement as well as audio and speech watermarking.

Change history

27 September 2022
A Correction to this paper has been published: https://doi.org/10.1186/s13636-022-00258-3

References

Carter GC: Time delay estimation for passive sonar signal processing. IEEE T Acoust S. 1981, ASSP-29: 462-470.
Google Scholar
Weinstein E: Optimal source localization and tracking from passive array measurements. IEEE T. Acoust. S. 1982, ASSP-30: 69-76.
Article Google Scholar
Wang H, Chu P: Voice source localization for automatic camera pointing system in videoconferencing, in Proceedings of the ICASSP, New Paltz, NY, 19–22. 1997.
Google Scholar
Caffery J Jr, Stüber G: Subscriber location in CDMA cellular networks. IEEE Trans. Veh. Technol. 1998, 47: 406-416.
Article Google Scholar
Tsui JB: Fundamentals of Global Positioning System Receivers. New York: Wiley; 2000.
Book Google Scholar
Klein F: Finding an earthquake's location with modern seismic networks. Northern California: Earthquake Hazards Program; 2000.
Google Scholar
Huang Y, Benesty J, Elko GW, Mersereau RM: Real-time passive source localization: A practical linear-correction least-squares approach. IEEE T. Audio P. 2001, 9: 943-956.
Google Scholar
Michaud VJM, Rouat F, Letourneau J: Robust Sound Source Localization Using a Microphone Array on a Mobile Robot. Las Vegas: in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 2; 1228-1233.
Daku BLF, Salt JE, Sha L, Prugger AF: An algorithm for locating microseismic events, in Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering . Niagara Falls. May 2004, 2–5: 2311-2314.
Google Scholar
Patwari N, Ash JN, Kyperountas S, Hero AO III, Moses RL, Correal NS: Locating the nodes. IEEE Signal Process. Mag. 2005, 22: 54-69.
Article Google Scholar
Gezici S, Zhi T, Giannakis G, Kobayashi H, Molisch A, Poor H, Sahinoglu Z: Localization via ultra-wideband radios: a look at positioning aspects for future sensor networks. IEEE Signal Process. Mag. 2005, 22: 70-84.
Article Google Scholar
Lee JY, Ji SY, Hahn M, Young-Jo C: Real-Time Sound Localization Using Time Difference for Human-Robot Interaction. Prague: in Proceedings of the 16th IFAC World Congress; 2005.
Book Google Scholar
Ma WK, Vo BN, Singh SS, Baddeley A: Tracking an unknown time-varying number of speakers using TDOA measurements: a random finite set approach. IEEE Trans. Signal. Process. 2006, 54(9):3291-3304.
Article MATH Google Scholar
Ho KC, Lu X, Kovavisaruch L: Source localization using TDOA and FDOA measurements in the presence of receiver location errors: analysis and solution. IEEE Trans Signal. Process. 2007, 55(2):684-696.
Article MathSciNet MATH Google Scholar
Cevher V, Sankaranarayanan AC, McClellan JH, Chellappa R: Target tracking using a joint acoustic video system. IEEE Trans. Multimedia. 2007, 9(4):715-727.
Article Google Scholar
Zhengyuan X, Liu N, Sadler BM: A Simple Closed-Form Linear Source Localization Algorithm. Orlando, FL: in IEEE MILCOM; 2007:1-7.
Google Scholar
Etten JPV: Navigation systems: fundamentals of low and very-low frequency hyperbolic techniques. Elec. Comm. 1970, 45(3):192-212.
Google Scholar
Schmidt RO: A new approach to geometry of range difference location. IEEE Trans. Aerosp. Electron Syst. 1972, AES-8(6):821-835.
Article Google Scholar
Lee HB: A novel procedure for assessing the accuracy of hyperbolic multilateration systems. IEEE Trans. Aerosp. Electron Syst. 1975, 110(1):2-15.
Article MathSciNet Google Scholar
Hahn WR: Optimum signal processing for passive sonar range and bearing estimation. J. Acoust. Soc. Am. 1975, 58: 201-207.
Article Google Scholar
Julian P, Comparative A: Study of Sound Localization Algorithms for Energy Aware Sensor Network Nodes. IEEE Trans. Circuits and Systems - I. 2004., 51(4):
Birchfield ST, Gangishetty R: Acoustic Localization by Interaural Level Difference. Philadelphia, PA: in Proceedings of the ICASSP; 2005:1109-1112.
Google Scholar
Ho KC, Sun M: An accurate algebraic Closed-Form solution for energy-based source localization. IEEE Trans. Audio, Speech. Lang. Process 2007, 15: 2542-2550.
Article Google Scholar
Ho KC, Sun M: Passive Source Localization Using Time Differences of Arrival and Gain Ratios of Arrival. IEEE Trans. Signal. Process 2008, 56: 2.
Article MathSciNet MATH Google Scholar
Hebrank JH, Wright D: Spectral cues used in the localization of sound sources on the median plane. J. Acoust. Soc. Am. 1974, 56: 1829-1834.
Article Google Scholar
Middlebrooks JC, Makous JC, Green DM: Directional sensitivity of sound-pressure level in the human ear canal. J Acoust Soc Am.. 1989, 1: 86.
Google Scholar
Asano F, Suzuki Y, Sone T: Role of spectral cues in median plane localization. J. Acoust. Soc. Am. 1990, 88: 159-168.
Article Google Scholar
Duda RO, Martens WL: Range dependence of the response of a spherical head model. J. Acoust. Soc. Am. 1998, 104(5):3048-3058.
Article Google Scholar
Cheng CI, Wakefield GH: Introduction to head-related transfer functions (HRTFS): representations of HRTFs in time, frequency, and space. J. of the Audio Engin. Soc. 2001, 49(4):231-248.
Google Scholar
Kulaib A, Al-Mualla M, Vernon D in Proceedings of the 12th Int . In 2D Binaural Sound Localization for Urban Search and Rescue Robotic. Istanbul: Conference on Climbing and Walking Robots; 2009:9-11.
Google Scholar
Razin S: Explicit (non-iterative) Loran solution. J. Inst. Navigation 1967., 14(3):
Google Scholar
Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust., Speech. Signal Process. 1976, ASSP-24(4):320-327.
Google Scholar
Brandstein MS, Silverman H: A practical methodology for speech localization with microphone arrays. Comput Speech Lang. 1997, 11(2):91-126.
Article Google Scholar
Brandstein MS, Adcock JE, Silverman HF: A Closed-Form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Process. 1997, 5(1):45-50.
Article Google Scholar
Svnizer P, Matnssoni M, Omologo M: Acoustic Source Location in a THREE-Dimensional Space Using Crosspower Spectrum Phase. Munich: in Proceedings of the ICASSP; 1997:231.
Google Scholar
Lleida E, Fernandez J, Masgrau E: Robust continuous speech recog. sys. based on a microphone array. Seattle, WA: in Proceedings of the ICASSP; 1998:241-244.
Google Scholar
Stephenne A, Champagne B: Cepstral prefiltering for time delay estimation in reverberant environments. Detroit, MI: in Proceedings of the ICASSP; 1995:3055-3058.
Google Scholar
Omologo M, Svaizer P: Acoustic Event Localization using a Crosspower-Spectrum Phase based Technique. Adelaide: in Proceedings of the ICASSP, vol. 2; 1994:273-276.
Google Scholar
Zhang C, Florencio D, Zhang Z: Why Does PHAT Work Well in Low Noise, reverberative Environments. Las Vegas, NV: in Proceedings of the ICASSP; 2008:2565-2568.
Google Scholar
Cui W, Cao Z, Wei J: DUAL-Microphone Source Location Method in 2-D Space. Toulouse: in Proceedings of the ICASSP; 2006:845-848.
Google Scholar
Sheng X, Hu YH: Maximum likelihood multiple-source localization using acoustic energy measurements with wireless sensor networks. IEEE Trans Signal Process 2005, 53(1):44-53.
Article MathSciNet MATH Google Scholar
Blatt D, Hero AD III: Energy-based sensor network source localization via projection onto convex sets. IEEE Trans Signal Process 2006, 54(9):3614-3619.
Article MATH Google Scholar
Sen S, Nehorai A: Performance analysis of 3-D direction estimation based on head-related transfer function. IEEE Trans. Audio Speech Lang. Process 2009., 17(4):
Google Scholar
Batteau DW: The role of the pinna in human localization. Series B, Biol. Sci 1967, 168: 158-180.
Google Scholar
Roffler SK, Butler RA: Factors that influence the localization of sound in the vertical plane. J. Acoust. Soc. Am. 1968, 43(6):1255-1259.
Article Google Scholar
Oldfield SR, Parker SPA: Acuity of sound localization: a topography of auditory space. Perception 1984, 13(5):601-617.
Article Google Scholar
Hofman PM, Riswick JGAV, Opstal AJV: Relearing sound localization with new ears. Nature Neurosci. 1998, 1(5):417-421.
Article Google Scholar
Brown CP: Modeling the Elevation Characteristics of the Head-Related Impulse Response. M.S. Thesis: San Jose State University, San Jose, CA; 1996.
Book Google Scholar
Satarzadeh P, Algazi VR, Duda RO: Physical and filter pinna models based on anthropometry. Audio Eng. Soc, Vienna: in Proceedings of the 122nd Conv; 2007:7098.
Google Scholar
Hwang S, Park Y, Park Y: Sound Source Localization using HRTF database. Gyeonggi-Do: in ICCAS2005; 2005.
Google Scholar
Foy WH: Position-location solution by Taylor-series estimation. IEEE Trans. Aerosp. Electron Syst. 1976, AES-12: 187-194.
Article Google Scholar
Abel JS, Smith JO: Source range and depth estimation from multipath range difference measurements. IEEE Trans. Acoust. Speech. Signal. Process. 1989, 37: 1157-1165.
Article Google Scholar
Sommerville DMY: Elements of Non-Euclidean Geometry. New York: Dover; 1958. pp. 260–1958
MATH Google Scholar
Eisenhart LP: A Treatise on the Differential Geometry of Curves and Surfaces. New York, Dover; 1960:270-271. originally Ginn, 1909
Google Scholar
Smith JO, Abel JS: Closed-Form least-squares source location estimation from range-difference measurements. IEEE Trans. Acoust. Speech Signal Process. 1987, ASSP-35: 1661-1669.
Article Google Scholar
Delosme JM, Morf M, Friedlander B: A linear equation approach to locating sources from time-difference-of-arrival measurements. Denver, CO: in Proceedings of the ICASSP; 1980:09-11.
Google Scholar
Fang BT: Simple solutions for hyperbolic and related position fixes. IEEE Trans. Aerosp. Electron Syst. 1990, 26: 748-753.
Article Google Scholar
Friedlander B: A passive localization algorithm and its accuracy analysis. IEEE J. Oceanic Eng. 1987, OE-12: 234-245.
Article Google Scholar
Schau HC, Robinson AZ: Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Trans. Acoust. Speech Signal Process. 1987, ASSP-35: 1223-1225.
Article Google Scholar
Abel JS: A divide and conquer approach to least-squares estimation. IEEE Trans. Aerosp. Electron. Syst. 1990, 26: 423-427.
Article Google Scholar
Beck A, Stoica P, Li J: Exact and approximate solutions of source localization problems. IEEE Trans. Signal. Process. 2008, 56(5):1770-1778.
Article MathSciNet MATH Google Scholar
So HC, Chan YT, Chan FKW: Closed-Form Formulae for Time-Difference-of-Arrival Estimation. IEEE Trans Signal Process. 2008, 56: 6.
Google Scholar
Gillette MD, Silverman HF: A linear closed-form algorithm for source localization from time-differences of arrival. IEEE Signal Process. Lett. 2008, 15: 1-4.
Article Google Scholar
Larsson EG, Danev D: Accuracy comparison of LS and squared-range ls for source localization. IEEE Trans. Signal. Process 2010, 58: 2.
Article Google Scholar
Ono N, Sagayama S: R-means localization: a simple iterative algorithm for range-difference-based source localization. in Proceedings of the ICASSP. March 2010, 14–19: 2718-2721.
Google Scholar
Chami ZE, Guerin A, Pham A, Servière C: A PHASE-based Dual Microphone Method to Count and Locate Audio Sources in Reverberant Rooms. New Paltz, NY: in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics; 2000:18-21.
Google Scholar
Vaseghi SV: Advanced Digital Signal Processing and Noise Reduction. John Wiley & Sons, Ltd, Hoboken, NJ: 3rd edn; 2006.
Google Scholar
Pourmohammad A, Ahadi SM: TDE-ILD-HRTF-based 2D whole-plane sound source localization using only two microphones and source counting. Int. J. Info. Eng. 2012, 2(3):307-313.
Google Scholar
Balanis CA: Antenna Theory: Analysis and Design (John Wiley & Sons Inc. Hoboken: NJ; 2005.
Google Scholar
Giancoli DC: Physics for Scientists and Engineers. Prentice Hall: Upper Saddle River; 2000. Vol. I
Google Scholar
Friedlander B: On the Cramer-Rao bound for time delay and Doppler estimation. IEEE Trans. Inf. Theory. 1984, 30(3):575-580.
Article MATH Google Scholar
Gore A, Fazel A, Chakrabartty S: Far-Field Acoustic Source Localization and Bearing Estimation Using ∑∆ Learners. IEEE Trans. circuits and systems-I: regular papers 2010., 57(4):
Google Scholar
Chen SH, Wang JF, Chen MH, Sun ZW, Liao MJ, Lin SC, Chang SJ: A Design of Far-Field Speaker Localization System using Independent Component Analysis with Subspace Speech Enhancement. San Diego, CA: in Proc. 11th IEEE Int. Symp. Mult.; 2009:14-16.
Google Scholar

Download references

Author information

Authors and Affiliations

Electrical Engineering Department, Amirkabir University of Technology, 424 Hafez Ave., Tehran, 15914, Iran
Ali Parsayan & Seyed Mohammad Ahadi

Authors

Ali Parsayan
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Mohammad Ahadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Parsayan.

Additional information

Competing interests

The authors declare that they have no competing interests.

The authors want to change dr. Ali Parsayan’s family name. The incorrect author name is: Ali Pourmohammad; The correct author name is: Ali Parsayan; The original article has been corrected.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Parsayan, A., Ahadi, S.M. N-dimensional N-microphone sound source localization. J AUDIO SPEECH MUSIC PROC. 2013, 27 (2013). https://doi.org/10.1186/1687-4722-2013-27

Download citation

Received: 24 June 2013
Accepted: 21 November 2013
Published: 06 December 2013
DOI: https://doi.org/10.1186/1687-4722-2013-27

N-dimensional N-microphone sound source localization

Abstract

1. Introduction

2. Literature review

3. Basic methods

3.1. HRTF

3.2. ILD-based localization

3.3. TDE-based localization

4. ILD- and PHAT-based angle of arrival and location calculation methods

4.1. Using ILD method

4.2. Using PHAT method

5. TDE-ILD-based 2D sound source localization

6. Simulations of TDE-ILD-based method and discussion

7. Our proposed TDE-ILD-HRTF method

7.1. Source counting method

7.2. Noise reduction using spectral subtraction

7.3. Using HRTF method

7.4. Extension of dimensions to three

8. TDE-ILD-HRTF approach algorithm

9. Hardware and software implementations and results

10. Conclusion

Authors’ information

Change history

27 September 2022

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords