Using TDE-ILD-based method, dual microphone 2D sound source localization is applicable. However, it is known that, by using a linear array in TDE-ILD-based method, two mirror points will be produced simultaneously (half-plane localization) [40]. Also, according to TDE-ILD-based simulation results (Section 6), it is noticeable that using ILD-based method needs only one dominant high SNR source to be active in localization area. Our proposed TDE-ILD-HRTF method tries to solve these problems using source counting, noise reduction using spectral subtraction, and HRTF.

### 7.1. Source counting method

According to previous discussions, ILD-based method needs to use source counting to find that one dominant source is active for high-resolution localization. If more than one source is active in localization area, it cannot calculate *m* = *E*_{1}/E_{2} correctly. Therefore, we would need to count active and dominant sound sources and decide on localization of one sound source if only one source is dominant enough. PHAT gives us the cross-correlation vector of two microphone output signals. The number of dominant peaks of the cross-correlation vector gives us the number of dominant sound sources [66]. We consider only one source signal to be a periodic signal as:

s\left(t\right)=s\left(t+T\right).

(62)

If the signals window is greater than T, calculating cross-correlation between the output signals of the two microphones gives us one dominant peak and some weak peaks with multiples of T distances. However, using signals window of approximately equal to T or using non-periodic source signal would lead to only one dominant peak when calculating cross-correlation between the output signals of the two microphones. This peak value is delayed equal to the number of samples between the two microphones' output signals. Therefore, if one sound source is dominant in the localization area, only one dominant peak value will be in cross-correlation vector. Now, we consider having two sound sources *s*(*t*) and *s*'(*t*) in high SNR localization area. According to (6) and (7), we have

{s}_{1}\left(t\right)=s\left(t-{T}_{1}\right)+s\text{'}\left(t-T{\text{'}}_{1}\right)

(63)

{s}_{2}\left(t\right)=s\left(t-{T}_{2}\right)+s\text{'}\left(t-T{\text{'}}_{2}\right).

(64)

According to (32), we have

\begin{array}{ll}{R}_{{s}_{1}{s}_{2}}\left(\tau \right)={\displaystyle {\int}_{-\infty}^{+\infty}(s\left(t-{T}_{1}\right)}& +s\text{\'}\left(t-T{\text{\'}}_{1}\right))\left(s\left(t-{T}_{2}+\tau \right)\right.\\ \left(\right)close=")">+s\text{\'}\left(t-T{\text{\'}}_{2}+\tau \right)& \mathit{dt}\end{array}\n

(65)

{R}_{{s}_{1}{s}_{2}}\left(\tau \right)=R{1}_{{s}_{1}{s}_{2}}\left(\tau \right)+R{2}_{{s}_{1}{s}_{2}}\left(\tau \right)+R{3}_{{s}_{1}{s}_{2}}\left(\tau \right)+R{4}_{{s}_{1}{s}_{2}}\left(\tau \right)

(66)

where

R{1}_{{\mathrm{s}}_{1}{\mathrm{s}}_{2}}\left(\tau \right)={\displaystyle {\int}_{-\infty}^{+\infty}\left(t-{T}_{1}\right)\cdot s\left(t-{T}_{2}+\tau \right)\mathit{dt}}

(67)

\mathrm{R}{2}_{{\mathrm{s}}_{1}{\mathrm{s}}_{2}}\left(\tau \right)={\displaystyle {\int}_{-\infty}^{+\infty}\left(t-{T}_{1}\right)\cdot s\text{'}\left(t-T{\text{'}}_{2}+\tau \right)\mathit{dt}}

(68)

\phantom{\rule{0.25em}{0ex}}\mathrm{R}{3}_{{\mathrm{s}}_{1}{\mathrm{s}}_{2}}\left(\tau \right)={\displaystyle {\int}_{-\infty}^{+\infty}s\text{'}\left(t-T{\text{'}}_{1}\right)\cdot s\left(t-{T}_{2}+\tau \right)\mathit{dt}}

(69)

R{4}_{{\mathrm{s}}_{1}{\mathrm{s}}_{2}}\left(\tau \right)={\displaystyle {\int}_{-\infty}^{+\infty}s\text{'}\left(t-T{\text{'}}_{1}\right)\cdot s\text{'}\left(t-T{\text{'}}_{2}+\tau \right)\mathit{dt}}.

(70)

Using (34), *τ*_{1} = *T*_{2} − *T*_{1} gives us a maximum value for R{1}_{{s}_{1}{s}_{2}}\left(\tau \right),{\tau}_{2}=T{\text{'}}_{2}-{T}_{1}, gives us a maximum value for R{2}_{{s}_{1}{s}_{2}}\left(\tau \right),{\tau}_{3}={T}_{2}-T{\text{'}}_{1}, gives us a maximum value for R{3}_{{s}_{1}{s}_{2}}\left(\tau \right)\phantom{\rule{0.22em}{0ex}}\mathit{and}\phantom{\rule{0.12em}{0ex}}{\tau}_{4}=T{\text{'}}_{2}-T{\text{'}}_{1}, and gives us a maximum value for R{4}_{{s}_{1}{s}_{2}}\left(\tau \right). Therefore, we will have four peak values in cross-correlation vector. However, according to this fact that (67) and (70) are cross-correlation functions of a signal with delayed version of itself, and (68) and (69) are cross-correlation functions of two different signals, *τ*_{1} and *τ*_{4} maximum values are dominant with respect to *τ*_{2} and *τ*_{3} values. Now, we conclude in two dominant sound sources area, cross-correlation vector will have two dominant values and therefore equal count dominant values for more than two dominant sound sources signals as multiple power spectrum peaks in DOA-based multiple sound source beamforming methods [16]. Therefore, counting dominant cross-correlation vector values, we can find the number of active and dominant sound sources in localization area.

### 7.2. Noise reduction using spectral subtraction

In order to apply ILD in TDE-ILD-based dual microphone 2D sound source localization, source counting is used to find that one dominant high SNR source is active in localization area. Source counting was proposed to calculate the number of active sources in localization area. Furthermore, spectral subtraction can be used for noise reduction and therefore increasing active source's SNR. Also, according to the background noise, such as wind, rain, and babble, we can consider a background spectrum estimator. In the most practical cases, we can assume that the noisy signal can be modeled as the sum of the clean signal and the noise [67]:

{s}_{n}\left(t\right)=s\left(t\right)+n\left(t\right).

(71)

Also according to this fact that the signal and noise are generated by independent sources, they are considered uncorrelated. Therefore, the noisy spectrum can be written as:

{S}_{n}\left(w\right)=S\left(w\right)+N\left(w\right).

(72)

During the silent periods, i.e., periods without target sound, it can be estimated background noise spectrum, considering the noise to be stationary. Then, the noise magnitude spectrum can be subtracted from the noisy input magnitude spectrum. In non-stationary noise cases, there can be used an adaptive background noise spectrum estimation procedure [67].

### 7.3. Using HRTF method

Using a linear array in TDE-ILD-based dual microphone 2D sound source localization method leads to two mirror points produced simultaneously. Adding an HRTF-inspired idea, whole-plane dual microphone 2D sound source localization would be possible. This idea was published in [68], and it is reviewed here again. Researchers have used an artificial ear with a spiral shape. This is a special type of pinna with a varying distance from a microphone placed in the center of its spiral [30]. However, we consider a half-cylinder instead of artificial ear [68]. Due to the use of such a reflector, a constant notch position is created for all variations (0 to180°) of sound signal angle of arrival (Figure 3). However, clearly, and as shown in the figure, a complete half-cylinder scatters the sound waves from sources behind it. In order to overcome this problem we consider slits on the surface of the half-cylinder (Figure 3). If *d* is the distance between the reflector (half-cylinder) and the microphone (placed at the center), a notch will be created when it is equal to quarter of the wavelength of sound, λ, plus any multiples of λ/2. For such wavelengths, the incident waves are reduced by reflected waves [30]:

n\cdot \left(\frac{\lambda}{2}\right)+\left(\frac{\lambda}{4}\right)=d\phantom{\rule{0.12em}{0ex}}n=0,1,2,3....

(73)

These notches will appear at the following frequencies:

f=\frac{c}{\lambda}=\frac{\left(2n+1\right).c}{4d}.

(74)

Covering only microphone 2 in Figure 1 by reflector, calculating the interaural spectral difference results in

\begin{array}{ll}\left|\mathrm{\Delta H}\left(f\right)\right|& =\left|10{log}_{10}{H}_{1}\left(f\right)-10{log}_{10}{H}_{2}\left(f\right)\right|\\ =\left|10{log}_{10}\frac{{H}_{1}\left(f\right)}{{H}_{2}\left(f\right)}\right|.\end{array}

(75)

High |Δ*H*(*f*)| values indicate that the sound source is in front, while negligible values indicate that sound source is at the back. One important point is that in order to have the same spectrum in both microphones when the sound source is at the back, careful design of the slits is necessary.

### 7.4. Extension of dimensions to three

According to the use of half-cylinder reflector in the proposed localization system, this approach is only applicable in 2D cases. Using a half-sphere reflector instead of the half-cylinder makes it usable in 3D sound source localization (Figure 4). Adding the half-sphere reflector only to microphone 2 in Figure 5 allows the localization system to localize sound sources in 3D cases.

For 3D case, we can consider a plane that passes through source position and *x*-axis and which makes an angle of *θ* with the *x-y* plane, and we name it source-plane (Figure 5). This plane consists of the microphones 1 and 2. According to these assumptions, using (36), we can calculate *ф* which is the angle of arrival in source-plane and is equal to angle of arrival in *x-y* plane. Then, using (59) and (60), we can also calculate *x* and *y* in 2D source-plane (not in *x-y* plane in 3D case) and therefore calculate sound source distance r\sqrt{{x}^{2}+{y}^{2}}. Introducing a new microphone (mic3 in Figure 5) with *y* = 0 and *x* = *z* = *R* helps us calculate the angle *θ*. Therefore, using (36), we can calculate *θ* as:

\mathrm{\theta}=\mathrm{co}{\mathrm{s}}^{\u20121}\left(\frac{{v}_{\mathrm{sound}}\cdot {\tau}_{31}}{R}\right).

(76)

Using half-sphere reflector decreases accuracy of time delay and intensity level deference estimation between microphones 1 and 2 due to the change in the spectrum of second microphone's signal. However, covering the third microphone instead of the second microphone by half-sphere reflector only decreases the accuracy of time delay estimation between the first and third microphones (Figure 5). Multiplying the inverse function of notch-filter in third microphone's spectrum leads to increase in accuracy. Now using *r*, *ф*, and *θ*, we can calculate source location *x*, *y* and *z* in 3D case:

\left\{\begin{array}{l}x=r.cos\left(\varphi \right).sin\left(\theta \right)\\ y=r.sin\left(\varphi \right),sin\left(\theta \right)\\ \phantom{\rule{0.72em}{0ex}}z=r.cos\left(\theta \right)\end{array}\right..

(77)

The reasons for choosing the shape of sphere for the reflector are as follows [69]. The simplest type of reflector is a plane reflector introduced to direct signal in a desired direction. Clearly, using this type of reflector, the distance between reflector and microphone, *d* in (73), varies with respect to source position in 3D cases leading to a change in notch position within spectrum. The change in notch position may not be suitable as it might occur out of the spectral band of interest. To better adjust the energy in the forward direction, the geometrical shape of the plane reflector must be changed so as not to allow radiation in the back and side directions.

A possible arrangement for this purpose consists of two plane reflectors joined on one side to form a corner. This type of reflector returns the signal exactly in the same direction as it is received. Because of this unique feature, reflected wave acquired by the microphone is unique, which is our aim. Whereas having more than one reflected wave causes a higher energy value with respect to a single reflected wave at the microphone, which causes a deep notch. But using this type of reflector, *d* is varied with respect to the source position in 3D cases which is related to notch position in spectrum. It has been shown that if a beam of parallel waves is incident upon a parabola reflector, the radiation will focus at a spot which is known as the focal point. This point in spherical reflector does not match to the center in accordance with this fact that the center has equal distance (*d*) to all points of the reflector surface. Because of this feature, reflected wave which is received by the microphone is unique, whatever the position of the sound source, in 3D cases and belongs to the wave passing the center. The center of the spherical reflector with radius *R* is located at *O* (Figure 6). The sound wave axis strikes the reflector at *B.* From the Law of Reflection and angle geometry of parallel lines, the marked angles are equal. Hence, *BFO* is an equilateral triangle. Dropping the median from *F* which is stapled to *BO* and using trigonometry, the focal point is calculated as:

\frac{R}{2b}=cos\left(\mathrm{\theta}\right)\to b=\frac{R}{2cos\mathrm{\theta}}

(78)

\to F=R-\frac{R}{2cos\left(\theta \right)}.

(79)

*F* is not equal to *R* for all *θ* values. The half-spherical reflector scatters the sound source waves hitting it from back. Therefore, we consider some slits on its surface (Figure 4). Obviously, the slits need to be designed in a fashion that leads to the same spectrum in both microphones when sound source is in back. When a plane wave hits a barrier with a single circular slit narrower than signal wavelength (λ), the wave bends and emerges from the slit as a circular wave [70]. If *L* is the distance between the slit and the viewing screen and *d* is the slit width, based on the assumption that *L*> > *d* (Fraunhofer scattering), the wave intensity distribution observed (on a screen) at an angle *θ* with respect to the incident direction is given by:

\mathit{if}\phantom{\rule{0.12em}{0ex}}k=\frac{\pi .d}{\lambda}sin\left(\theta \right)\to \frac{I\left(\theta \right)}{I\left(\theta \right)}=\frac{{sin}^{2}\left(k\right)}{{k}^{2}},

(80)

where *I*(*θ*) is wave intensity in direction of observation and *I*(*0*) is maximum wave intensity of diffraction pattern (central fringe). This relationship will have a value of zero each time sin^{2}(*k*) = 0. This occurs when

k=\pm \mathit{m\pi}\phantom{\rule{0.12em}{0ex}}\mathrm{or}\phantom{\rule{0.12em}{0ex}}\frac{\pi .d}{\lambda}sin\left(\theta \right)=\pm \mathit{m\pi},

(81)

yielding the following condition for observing minimum wave intensity from a single circular slit:

sin\left(\theta \right)=\frac{\mathit{m\lambda}}{d}m=0,\pm 2,....

(82)

This relationship is satisfied for integer values of *m*. Increasing values of *m* gives minima at correspondingly larger angles. The first minimum will be found for *m* = 1, the second for *m* = 2, and so forth. If \frac{d}{\lambda}sin\left(\theta \right) is less than one for all values of *θ*, i.e., when the size of the aperture is smaller than a wavelength (*d* < *λ*), there will be no minima. As we need more than one circular slit on half-sphere reflector's surface, we consider a parallel wave incident on a barrier that consists of two closely spaced narrow circular slits S1 and S2. The narrow circular slits split the incident wave into two coherent waves. After passing through the slits, the two waves spread out due to diffraction interfere with one another. If this transmitted wave is made to fall on a screen some distance away, an interference pattern of bright and dark fringes are observed on the screen, with the bright ones corresponding to regions of maximum wave intensity while the dark ones corresponding to those of minimum wave intensity. As discussed before, for all points on the screen where the path difference is some integer multiple of the wavelength, the two waves from the slits S1 and S2 arrive in phase and bright fringes are observed. Thus, the condition for producing bright fringes is as in (82). Similarly, dark fringes are produced on the screen if the two waves arriving on the screen from slits S1 and S2 are exactly out of phase. This happens if the path difference between the two waves is an odd integer multiple of half-wavelengths:

sin\left(\theta \right)=\frac{\left(m+\frac{1}{2}\right).\lambda}{d}m=0,\pm 1,\pm 2,....

(83)

Of course, this condition is changed using a half-sphere instead of a plane (Figure 7), where according to the same distance *R* to the center, the two waves from the slits S1 and S2 arrive in phase at the center and bright fringes are observed there. This signal intensity magnification is not suitable for the case of estimating intensity level difference between two microphones where one of them is covered by this reflector. However, covering the third microphone by half-sphere reflector is suitable where there is no need to estimate the intensity level difference between two microphones 1 and 3.