### 2.1 An overview of the VAD algorithm

As well known, heuristic rules based and statistical model based VAD methods respectively have advantages and disadvantages against different noises. We combine the advantages of these two methods together for making the VAD algorithm more robust. The method proposed in this article is shown in Figure 1. We divide this method into three submodules, such as noise estimation submodule, feature extraction submodule and HMM/GMM based classification submodule.

In our study, the MSRA mandarin speech corpus are employed for training the HMM/GMM hybrid models at different SNR regimes (as SNR = 5 dB, SNR = 10 dB et al.) under maximum likelihood principle with BW algorithm firstly. Then, in the VAD process, the SNR of the noisy speech is estimated by the noise estimation submodule, and the corresponding SNR level of HMM/GMM hybrid model is selected. After that, the speech features such as MFCCs, the harmonic structure information and the HOS are extracted to represent each speech/non-speech segment. Finally, the non-speech segments are distinguished from the speech segments by the phoneme recognition using the trained HMM/GMM hybrid model.

Note that, in this article, the typical noise estimation method named minima controlled recursive averaging (MCRA) is employed for the realization of noise estimation submodule, referring to [17] for details.

### 2.2 Feature extraction

Different features have their own advantages in ASR system. And it is impossible to use one feature to cope with all the noisy environments. Combining some features together for discriminating the speech from non-speech is a popular strategy in recent years. In this article, three useful features such as harmonic structure information, HOS and MFCCs are combined together to represent the speech signals, since harmonic structure information is robust to high-pitched sounds, HOS is robust to the Gaussian and Gaussian-like noise, and MFCCs are the important features in phoneme recognizer.

#### 2.2.1 Harmonic structure information

Harmonic structure information is a well known acoustic cue for improving the noise robustness, which has been introduced in many VAD algorithms [11, 18]. In [11], Fukuda et al. only incorporated the GMM model with harmonic structure information, and made a significant improvement in ASR system. This method assumes that the harmonic structure of pitch information is only included in the middle range of the cepstral coefficients. The feature extraction method is shown in Figure 2.

First, the log power spectrum *y*_{
t
} (*j*) of each frame is converted into the cepstrum *p*_{
t
} (*i*) by using the discrete cosine transform (DCT).

{p}_{t}\left(i\right)=\sum _{i}{M}_{a}\left(i,j\right)\cdot {y}_{t}\left(j\right),

(1)

where *M*_{
a
} (*i*, *j*) is the matrix of DCT, and *i* indicates the bin index of the cepstral coefficients.

Then, the harmonic structure information *q*_{
t
} is obtained from the observed cesptra *p*_{
t
} by suppressing the lower and higher cepstra

\begin{array}{c}{q}_{t}\left(i\right)={p}_{t}\left(i\right)\phantom{\rule{1em}{0ex}}{D}_{L}<i<{D}_{H},\\ {q}_{t}\left(i\right)=\lambda {p}_{t}\left(i\right)\phantom{\rule{1em}{0ex}}otherwise,\end{array}

(2)

where *λ* is a small constant.

After the lower and higher cepstra suppressed, the harmonic structure information *q*_{
t
} (*i*) is converted back to linear domain *w*_{
t
} (*j*) by inverse DCT (IDCT) and exponential transform. Moreover, the *w*_{
t
} (*j*) is integrated into *b*_{
t
} (*k*) by using the *K*-channel mel-scaled band pass filter.

Finally, the harmonic structure-based mel cepstral coefficients are obtained when *b*_{
t
} (*k*) is converted into the mel-cepstrum *c*_{
t
} (*n*) by the DCT matrix *M*_{
b
} (*n*, *k*).

{c}_{t}\left(n\right)=\sum _{k=1}^{K}{M}_{b}\left(n,k\right)\cdot {b}_{t}\left(k\right),

(3)

#### 2.2.2 High order statistic

Generally, the HOS of speech are nonzero and sufficiently distinct from those of the Gaussian noise. Moreover, it is reported by Nemer et al. [19] that the skewness and kurtosis of the linear predictive coding (LPC) residual of the steady voiced speech can discriminate the speech from noise more effective.

Assume that {*x*(*n*)}, *n* = 0, ±1, ±2,... is a real stationary discrete time signal and its moments up to order *k* exist, then the *k* th-order moment function is given as follows:

{m}_{k}\left({\tau}_{1},{\tau}_{2}\dots {\tau}_{k-1}\right)\equiv E\left[x\left(n\right)x\left(n+{\tau}_{1}\right)\dots x\left(n+{\tau}_{k-1}\right)\right],

(4)

where *τ*_{1}, *τ*_{2},..., *τ*_{
k
}_{-1} = 0, ±1, ±2,..., and *E*[·] represents the statistical expectation. If the signal has zero mean, then the cumulant sequences of {*x*(*n*)} can be defined:

Second-order cumulant

{C}_{2}\left({\tau}_{1}\right)={m}_{2}\left({\tau}_{1}\right).

(5)

Third-order cumulant

{C}_{3}\left({\tau}_{1},{\tau}_{2}\right)={m}_{3}\left({\tau}_{1},{\tau}_{2}\right).

(6)

Fourth-order cumulant

\begin{array}{c}{C}_{4}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)={m}_{4}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)-{m}_{2}\left({\tau}_{1}\right)\cdot {m}_{2}\left({\tau}_{2}-{\tau}_{3}\right)\\ -{m}_{2}\left({\tau}_{2}\right)\cdot {m}_{2}\left({\tau}_{3}-{\tau}_{1}\right)-{m}_{2}\left({\tau}_{3}\right)\cdot {m}_{2}\left({\tau}_{1}-{\tau}_{3}\right).\end{array}

(7)

Let *τ*_{1}, *τ*_{2},..., *τ*_{
k
}_{-1} = 0, then the higher-order statistics such as variance *γ*_{2}, skewness *γ*_{3}, kurtosis *γ*_{4}, can be expressed as follows respectively:

{\gamma}_{2}=E\left[{x}^{2}\left(n\right)\right]={m}_{2},

(8a)

{\gamma}_{3}=E\left[{x}^{3}\left(n\right)\right]={m}_{3},

(8b)

{\gamma}_{4}=E\left[{x}^{4}\left(n\right)\right]-3{\gamma}_{2}^{2}={m}_{4}-3{m}_{2}^{2}.

(8c)

Moreover, the steady voiced speech can be modeled as a sum of *M* coherent sine waves, and the skewness and kurtosis of the LPC residual of the steady voiced speech can be written as functions of the signal energy *E*_{
s
} and the number of harmonic *M*[12]:

{\gamma}_{3}=\frac{3}{2\sqrt{2}}{\left({E}_{s}\right)}^{\frac{3}{2}}\left[\frac{M-1}{M}\right],

(9)

and

{\gamma}_{4}={{E}_{s}}^{2}\left[\frac{4}{3}M-4+\frac{7}{6M}\right].

(10)

### 2.3 VAD in HMM/GMM model

One of the most widely used method to model speech characteristics is Gaussian function or Gaussian mixture model. The GMM based VAD algorithm has attracted considerable attention for its high accuracy in speech/non-speech detection. However, the number of the mixtures of GMMs must be very large to distinguish the speech from non-speech, which increases the cost of calculation dramatically. Moreover, *N*-order GMMs can not discriminate the non-speech from speech precisely since the boundary between the speech and non-speech is not clear enough. In this article, we improve this method by regarding the non-speech as an additional phoneme (named as *'usp'*) corresponding to the conventional phonemes (such as *'zh'*, *'ang'* et al.) in mandarin, and using the GMMs based HMM hybrid model to discriminate the non-speech from speech.

In HMM/GMM based speech recognition [20], it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Hidden Markov model as shown in Figure 3. Here, *a*_{
ij
} and *b*(*o*) means the transition probabilities and output probabilities respectively. 2, 3, 4 are the states of state sequence \mathcal{X}, and *O*_{
i
} represent the observations of observation sequence \mathcal{O}.

As well known, only the observation sequence \mathcal{O} is known and the underlying state sequence \mathcal{X} is hidden, so the required likelihood is computed by summing over all possible state sequences\mathcal{X}=x\left(1\right),x\left(2\right),x\left(3\right),\dots ,x\left(T\right), that is

P(\mathcal{O}|\mathcal{M})={\displaystyle \sum _{\mathcal{X}}{a}_{x(0)x(1)}}{\displaystyle \prod _{t=1}^{T}{b}_{x(t)}}({O}_{t}){a}_{x(t)x(t+1)},

(11)

where *x*(0) is constrained to be the model entry state and *x*(*T* + 1) is constraint to be the model exit state. The output distributions are represented by GMMs in hybrid model as

{b}_{j}\left({\mathbf{o}}_{t}\right)=\sum _{m=1}^{M}{c}_{jm}\mathcal{N}\left({\mathbf{o}}_{t},{\mu}_{jm},{\mathbf{\Sigma}}_{jm}\right),

(12)

where *M* is the number of mixture components, *c*_{
jm
} is the weight of *m* th component and \mathcal{N}\left(\mathbf{o},\mu ,\mathbf{\Sigma}\right) is a multivariate Gaussian with mean vector *μ* and covariance matrix **∑**, that is

\mathcal{N}\left(\mathbf{o},\mu ,\mathbf{\Sigma}\right)=\frac{1}{\sqrt{{\left(2\pi \right)}^{n}\left|\mathbf{\Sigma}\right|}}{e}^{-\frac{1}{2}{\left(\mathbf{o}-\mu \right)}^{T}{\mathbf{\Sigma}}^{-1}\left(\mathbf{o}-\mu \right)},

(13)

where *n* is the dimensionality of **o**.

In the GMM/HMM based VAD method, we use the same method which is usually employed in ASR system by phoneme recognition. In first step, each phoneme (including the conventional phonemes and the non-speech phoneme) in GMM/HMM hybrid model are initialized. Then the underlying HMM parameters are re-estimated by Baum-Welch algorithm. In the step of discrimination, Viterbi algorithm is employed for searching the maximum likelihood of the observed signals, which can be referred to [20] for details. Note that, in our method, the triphones which are essential for ASR are not adopted here, because we think that the monophones based recognition is appropriate for discriminating the speech from the nonspeech.