In this section, the VAD based on the MP coefficients and LRT is presented in Section 3.1. To test the distribution of the MP coefficients, a goodness-of-fit test (GOF) for those coefficients is provided in Section 3.2. More details about the MP feature are discussed in Section 3.3.

### 3.1 Statistical modeling of the MP coefficients and decision rule

Assuming that the noisy speech *x* consists of a clean speech *s* and an uncorrelated additive noise signal *n*, that is

Applying the signal atomic decomposition by using the conjugate MP, the noisy MP coefficient extracted from *x* at each pursuit iteration has the following form

{\alpha}_{k}={\alpha}_{s,k}+{\alpha}_{n,k},\phantom{\rule{1em}{0ex}}k=1,\dots ,K,

(8)

where *α*_{
s,k
} and *α*_{
n,k
} are the MP coefficients of clean speech and noise, respectively. The variance of the noisy MP coefficient *α*_{
k
} is given by

{\lambda}_{k}={\lambda}_{s,k}+{\lambda}_{n,k},\phantom{\rule{1em}{0ex}}k=1,\dots ,K.

(9)

where *λ*_{
s,k
} and *λ*_{
n,k
} are the variances of MP coefficients of clean speech and noise, respectively.

The *K*-dimensional MP coefficient vectors of speech, noise, and noisy speech are denoted as *α*_{
s
}, *α*_{
n
}, and *α* with their *k* th elements *α*_{
s,k
}, *α*_{
n,k
}, and *α*_{
k
}, respectively. Given two hypotheses *H*_{0} and *H*_{1}, which indicate speech absence and presence, we assume that

\begin{array}{c}{H}_{0}:\alpha ={\alpha}_{n}\\ {H}_{1}:\alpha ={\alpha}_{n}+{\alpha}_{s}\end{array}

For implementation of the above statistical model, a suitable distribution of the MP coefficients is required. In this article, we assume that the MP coefficients of noisy speech and noise signal are asymptotically independent complex Gaussian random variables with zero means. We also assume that the variances of the MP coefficient of noise, {*λ*_{
n,k
}, *k* = 1, ..., *K*} are known. Thus, the probability density functions (PDFs) conditioned on *H*_{0}, and *H*_{1} with a set of *K* unknown parameters *Θ* = {*λ*_{
s,k
}, *k* = 1, ..., *K*}, are given by

p\left(\alpha |{H}_{0}\right)=\prod _{k=1}^{K}\frac{1}{\pi {\lambda}_{n,k}}exp\left\{-\frac{|{\alpha}_{k}{|}^{2}}{{\lambda}_{n,k}}\right\}

(10)

p\left(\alpha |\Theta ,{H}_{1}\right)=\prod _{k=1}^{K}\frac{1}{\pi \left({\lambda}_{n,k}+{\lambda}_{s,k}\right)}exp\left\{-\frac{|{\alpha}_{k}{|}^{2}}{{\lambda}_{n,k}+{\lambda}_{s,k}}\right\}

(11)

The maximum likelihood estimate \widehat{\Theta}=\left\{{\widehat{\lambda}}_{s,k},k=1,\dots ,K\right\} of *Θ* is obtained by

\widehat{\Theta}=\underset{\Theta}{arg\; max}\left\{logp\left(\alpha |\Theta ,{H}_{1}\right)\right\},

(12)

and equals

{\widehat{\lambda}}_{s,k}=|{\alpha}_{k}{|}^{2}-{\lambda}_{n,k},\phantom{\rule{1em}{0ex}}k=1,\dots ,K.

(13)

By substituting Equation (13) into Equation (11), the decision rule using the likelihood ratio is obtained as follows

\begin{array}{c}{\Lambda}_{g}={\scriptscriptstyle \frac{1}{K}}\mathrm{log}{\scriptscriptstyle \frac{p(\alpha |\widehat{O},{H}_{1})}{p(\alpha |{H}_{0})}}\\ =\frac{1}{K}{\displaystyle \sum _{k=1}^{K}\left\{{\scriptscriptstyle \frac{|{\alpha}_{k}{|}^{2}}{{\lambda}_{nk}}}-\mathrm{log}{\scriptscriptstyle \frac{|{\alpha}_{k}{|}^{2}}{{\lambda}_{nk}}}-1\right\}}\begin{array}{c}\stackrel{{H}_{1}}{\ge}\\ \underset{{H}_{0}}{<}\end{array}\eta \end{array}

(14)

where *η* denotes a threshold value.

### 3.2 GOF test for MP coefficients

The MP coefficients are considered to follow a Gaussian distribution in section above. To test this, we carried out a statistical fitting test for the noisy MP coefficients conditioned on both hypotheses under various noise conditions. To this end, the Kolomogorov-Sriminov (KS) test [22], which serves as a GOF test, is employed to guarantee a reliable survey of the statistical assumption.

With the KS test, the empirical cumulative distribution function (CDF) *F*_{
α
} is compared to a given distribution function *F*, where *F* is the complex Gaussian function. Let *α*= {*α*_{1}, *α*_{2}, ..., *α*_{
N
}} be a set of the MP coefficients extracted from the noisy speech data, and the empirical CDF is defined by

{F}_{\alpha}=\left\{\begin{array}{c}\hfill 0,\phantom{\rule{1em}{0ex}}z<{\alpha}_{\left(1\right)}\hfill \\ \hfill \frac{n}{N},{\alpha}_{\left(n\right)}\le z<{\alpha}_{\left(n+1\right)},\hfill \\ \hfill 1,\phantom{\rule{1em}{0ex}}z\le {\alpha}_{\left(N\right)}\hfill \end{array}\phantom{\rule{1em}{0ex}}n=1,\dots ,N\right.

(15)

where *α*(*n*), *n* = 1, ..., *N* are the order statistics of the data *α*. To compute the order statistics, the elements of *α* are sorted and ordered so that *α*_{(1)} represents the smallest element of *α* and *α*_{(N)}is the largest one.

For simulating the noisy environments, the white and factory noises from the NOISEX'92 database are added to a clean speech signal at 0 dB SNR. With the noisy speech, the mean and variance are calculated and substituted into the Gaussian distribution. Figure 4 shows the comparison of the empirical CDF and Gaussian function. As can be seen, the empirical CDF curves of noisy speech signal are much closed to that of the Gaussian CDF under both the white and factory noise conditions. Therefore, the Gaussian distribution is suitable for modeling the MP coefficients.

### 3.3 Obtaining MP features

As mentioned before, the DFT coefficients suffer several shortcomings for modeling a signal and exposing the signal structure. We use the MP coefficients, {\left\{{\alpha}_{k}\right\}}_{k=1}^{K}, obtained by the MP as the new feature for discriminating speech and nonspeech. With the advantage of the atomic decomposition, MP coefficients can capture the characteristics of speech [17] and are insensitive to environment noise. Therefore, the MP coefficients as a new feature for VAD are more suitable for the classification task than DFT coefficients.

With the decomposition of a speech signal by using the conjugate MP, the MP feature also captures the harmonic structures of the speech signal. Such harmonic components can be viewed as a series of sinusoids, which are buried in noise, with different amplitude, frequency, and phase. The *k* th harmonic component *h*_{
k
} extracted from the *k* th pursuit iteration has the following form

{h}_{k}={A}_{k}cos\left({\omega}_{k}+{\varphi}_{k}\right)=2\mathsf{\text{Re}}\left\{{\alpha}_{k}{g}_{{\gamma}_{k}}\right\}

(16)

where *A*_{
k
}, *ω*_{
k
}, and *ϕ*_{
k
} are the amplitude, frequency, and phase of the sinusoidal component *h*_{
k
}, respectively. Those harmonic structures are prominent in a signal when the speech is present but not when noise only.

In a practical implementation, the procedure for extracting MP feature is described as follows. Assuming the input signal is segmented into non-overlapping frames, each frame is decomposed by conjugate subspace MP. Thus, the complex MP coefficients of a given frame are obtained. Instead of requiring a full reconstruction of a signal, the goal of MP is to extract MP coefficients. These coefficients capture the most characters of a signal so that the VAD detector based on them can detect whether the speech is present or not. Naturally, the selection of iteration number *K* depends on the number of sinusoidal components in a speech signal.