Voice activity detection based on conjugate subspace matching pursuit and likelihood ratio test
- Shiwen Deng^{1, 2} and
- Jiqing Han^{1}Email author
https://doi.org/10.1186/1687-4722-2011-12
© Deng and Han; licensee Springer. 2011
Received: 29 June 2011
Accepted: 21 December 2011
Published: 21 December 2011
Abstract
Most of voice activity detection (VAD) schemes are operated in the discrete Fourier transform (DFT) domain by classifying each sound frame into speech or noise based on the DFT coefficients. These coefficients are used as features in VAD, and thus the robustness of these features has an important effect on the performance of VAD scheme. However, some shortcomings of modeling a signal in the DFT domain can easily degrade the performance of a VAD in a noise environment. Instead of using the DFT coefficients in VAD, this article presents a novel approach by using the complex coefficients derived from complex exponential atomic decomposition of a signal. With the goodness-of-fit test, we show that those coefficients are suitable to be modeled by a Gaussian probability distribution. A statistical model is employed to derive the decision rule from the likelihood ratio test. According to the experimental results, the proposed VAD method shows better performance than the VAD based on the DFT coefficients in various noise environments.
Keywords
1 Introduction
Voice activity detection (VAD) refers to the problem of distinguishing active speech from non-speech regions in an given audio stream, and it has become an indispensable component for many applications of speech processing and modern speech communication systems [1–3] such as robust speech recognition, speech enhancement, and coding systems. Various traditional VAD algorithms have been proposed based on the energy, zero-crossing rate, and spectral difference in earlier literature [1, 4, 5]. However, these algorithms are easily degraded by environmental noise.
Recently, much study for improving the performance of the VADs in various high noise environments has been carried out by incorporating a statistical model and a likelihood ratio test (LRT) [6]. Those algorithms assume that the distributions of the noise and the noisy speech spectra are specified in terms of some certain parametric models such as complex Gaussian [7], complex Laplacian [8], generalized Gaussian [9], or generalized Gamma distribution [10]. Moreover, some algorithms based on LRT consider more complex statistical structure of signals, such as the multiple observation likelihood ratio test (MO-LRT) [11, 12], higher order statistics (HOS) [13, 14], and the modified maximum a posteriori (MAP) criterion [15, 16].
In this article, we present an approach for VAD based on the conjugate subspace matching pursuit (MP) and the statistical model. Specifically, the MP is carried out in each frame by first selecting the most dominant component, then subtracting its contribution from the signal and iterating the estimation on the residual. By subtracting a component at each iteration, the next component selected in the residual does not interfere with the previous component. Subsequently, the coefficients extracted in each frame, named MP feature [21], are modeled in complex Gaussian distribution, and the LRT is employed as well. Experimental results indicate that the proposed VAD algorithm shows better results compared with the conventional algorithms based on the DFT coefficients in various noise environments.
The rest of this article is organized as follows. Section 2 reviews the method of the conjugate subspace MP. Section 3 presents our proposed approach for VAD based the MP coefficients and statistical model. Implementation issues and the experimental results are shows in Section 4. Section 5 concludes this study.
2 Signal atomic decomposition based on conjugate subspace MP
In this section, we will briefly review the process of signal decomposition by using the conjugate subspace MP [19, 20]. The conjugate subspace MP algorithm is described in Section 2.1, and the demonstration of algorithm and comparison between MP coefficients and DFT coefficients are presented in Section 2.2.
2.1 Conjugate subspace MP
where i and n are frequency and time indexes, and S is a constant in order to obtain unit-norm function. The complex exponential dictionary is denoted as D = [g_{1}, ..., g_{ M }] where M is the number of dictionary elements such that M > N. Note that, this dictionary contains the prior knowledge of the statistical structure of the signal that we are mostly interested in. Here, the prior knowledge is that speech is the sum of some complex exponential with complex weights. And hence, speech can be represented by a few atoms in dictionary, but noise is not.
The conjugate subspace MP is a method of subspace pursuit. In the subspace pursuit, the residual of a signal is projected into a set of subspaces, each of which is spanned by some atoms from the dictionary, and the most dominant component in the corresponding subspace is selected and subtracted from the residual. Each of the subspaces in the conjugate subspace MP is the two-dimensional subspace spanned by an atom and its complex conjugate. With the given complex dictionary, the conjugate subspace MP is operated as follows.
where g* is the complex conjugate of g and c =< g, g* > is the conjugate cross-correlation coefficient. To obtain atomic decomposition of a signal, the MP iteration is continued until a halting criterion is met.
where ${\left\{{\alpha}_{k}\right\}}_{k=1}^{K}$ are referred to as the complex MP coefficients of atomic decomposition.
2.2 Demonstration of algorithm and comparison between MP coefficients and DFT coefficients
where F_{ s } = 4, 000 Hz is the sample frequency, and the frequencies f_{1}, f_{2}, ..., f_{5} are 100, 115, 130, 160, and 200 Hz, respectively.
As shown in Figure 2, the MP coefficients accurately capture all the frequency components of the original signal x[m] from the noisy signal y[m], but the DFT coefficients only capture two frequency components of x[m]. On the other hand, the MP coefficients well represent the frequency components without the problem of the spectra components interference, such as these components at A, B, and C shown in Figure 2d, but the DFT coefficients fail to do this even in the noise-free case. Therefore, the MP coefficients are more robust that the DFT coefficients, and are not sensitive to the noise.
3 Decision rule based on MP coefficients and LRT
In this section, the VAD based on the MP coefficients and LRT is presented in Section 3.1. To test the distribution of the MP coefficients, a goodness-of-fit test (GOF) for those coefficients is provided in Section 3.2. More details about the MP feature are discussed in Section 3.3.
3.1 Statistical modeling of the MP coefficients and decision rule
where λ_{ s,k } and λ_{ n,k } are the variances of MP coefficients of clean speech and noise, respectively.
where η denotes a threshold value.
3.2 GOF test for MP coefficients
The MP coefficients are considered to follow a Gaussian distribution in section above. To test this, we carried out a statistical fitting test for the noisy MP coefficients conditioned on both hypotheses under various noise conditions. To this end, the Kolomogorov-Sriminov (KS) test [22], which serves as a GOF test, is employed to guarantee a reliable survey of the statistical assumption.
where α(n), n = 1, ..., N are the order statistics of the data α. To compute the order statistics, the elements of α are sorted and ordered so that α_{(1)} represents the smallest element of α and α_{(N)}is the largest one.
3.3 Obtaining MP features
As mentioned before, the DFT coefficients suffer several shortcomings for modeling a signal and exposing the signal structure. We use the MP coefficients, ${\left\{{\alpha}_{k}\right\}}_{k=1}^{K}$, obtained by the MP as the new feature for discriminating speech and nonspeech. With the advantage of the atomic decomposition, MP coefficients can capture the characteristics of speech [17] and are insensitive to environment noise. Therefore, the MP coefficients as a new feature for VAD are more suitable for the classification task than DFT coefficients.
where A_{ k }, ω_{ k }, and ϕ_{ k } are the amplitude, frequency, and phase of the sinusoidal component h_{ k }, respectively. Those harmonic structures are prominent in a signal when the speech is present but not when noise only.
In a practical implementation, the procedure for extracting MP feature is described as follows. Assuming the input signal is segmented into non-overlapping frames, each frame is decomposed by conjugate subspace MP. Thus, the complex MP coefficients of a given frame are obtained. Instead of requiring a full reconstruction of a signal, the goal of MP is to extract MP coefficients. These coefficients capture the most characters of a signal so that the VAD detector based on them can detect whether the speech is present or not. Naturally, the selection of iteration number K depends on the number of sinusoidal components in a speech signal.
4 Experiments and results
4.1 Noise statistic update
To implement the VAD scheme, the variance of the noise MP coefficients requires to be estimated, which are assumed to be known in Equation (14). We assume that the signal consists of noise only during a short initialization period, and the initial noise characteristics are learned. The background noise is usually non-stationary, and hence the estimation requires to be adaptively updated or tracked. The update is performed frame by frame by using the minimum mean square error (MMSE) estimation.
where ε = P(H_{1}) = P(H_{0}) and ${\Lambda}_{k}^{\left(m\right)}=p\left({\alpha}_{k}^{\left(m\right)}|{H}_{1}\right)\u2215p\left({\alpha}_{k}^{\left(m\right)}|{H}_{0}\right)$. Since the decision is made by observing all the K MP coefficients, we replace the LRT at the k th MP coefficient ${\Lambda}_{k}^{\left(m\right)}$ with their geometric mean ${\Lambda}_{g}^{\left(m\right)}$ in Equation (14).
4.2 Experimental results
Performance evaluation in different noise conditions
Environments | LRT-MP | LRT-Laplacian | |||
---|---|---|---|---|---|
Noise | SNR (dB) | P _{ d } (%) | P _{ f } (%) | P _{ d } (%) | P _{ f } (%) |
White | 0 | 87.9 | 10.7 | 88.7 | 10.3 |
5 | 94.3 | 9.9 | 94.2 | 9.7 | |
10 | 96.4 | 9.5 | 95.8 | 9.6 | |
20 | 97.2 | 9.4 | 96.8 | 9.2 | |
Vehicle | 0 | 85.3 | 10.9 | 80.3 | 11.4 |
5 | 93.3 | 10.7 | 89.7 | 10.5 | |
10 | 95.4 | 9.1 | 92.5 | 10.2 | |
20 | 97.2 | 8.8 | 95.2 | 9.3 | |
Babble | 0 | 63.3 | 11.1 | 58.7 | 11.9 |
5 | 79.3 | 11.1 | 78.9 | 11.7 | |
10 | 84.2 | 9.3 | 80.6 | 10.4 | |
20 | 87.4 | 9.1 | 83.7 | 9.6 |
5 Conclusion
In this article, we present a novel approach for VAD. The method is based on the complex atomic decomposition of a signal by using the conjugate subspace MP. With the decomposition, the complex MP coefficients are obtained, and modeled as the complex Gaussian distribution which is a suitable one according to the results of GOF test. Based on the statistical model, the decision rule for VAD is derived by incorporating the LRT on it. In a practical implementation, the decision is made frame by frame in a frame-processed signal.
The advantage of the proposed approach is that the MP coefficients are insensitive to the environmental noise, and hence the performance of VAD is robust in high noise environments. Note that, the advantage with MP coefficients is obtained at the cost of computational cost, which is proportional to the iteration number. An online detection can be implemented when the iteration number is smaller than 20. Furthermore, the experimental results show that the proposed approach outperforms the traditional VADs based on DFT coefficients in white, vehicle, and babble noise conditions.
Declarations
Acknowledgements
This study was supported by the Natural Science Foundation of China (No. 61071181 and 91120303).
Authors’ Affiliations
References
- Benyassine A, Shlomot E, Su HY, Massaloux D, Lamblin C, Petit JP: ITU-T Recommendation G.729, Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications. IEEE Commun Mag 1997,35(9):64-73. 10.1109/35.620527View ArticleGoogle Scholar
- Itoh K, Mizushima M: Environmental noise reduction based on speech/non-speech identification for hearing aids. Proc Int Conf Acoust, Speech, and Signal Process 1997, 1: 419-422.Google Scholar
- Virag N: Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans Speech Audio Process 1999,7(2):126-137. 10.1109/89.748118View ArticleGoogle Scholar
- Woo K, Yang T, Park K, Lee C: Robust voice activity detection algorithm for estimating noise spectrum. Electron Lett 2000,36(2):180-181. 10.1049/el:20000192View ArticleGoogle Scholar
- Marzinzik M, Kollmeier B: Speech pause detection for noise spectrum estimation by tracking power envelope dynamics. IEEE Trans Speech Audio Process 2002,10(6):341-351. 10.1109/TSA.2002.803420View ArticleGoogle Scholar
- Kay SM: Fundamentals of Statistical Signal Processing. Prentice-Hall, Englewood Cliffs; 1998.Google Scholar
- Sohn J, Kim NS, Sung W: A statistical model-based voice activity detection. IEEE Signal Process Lett 1999,6(1):1-3. 10.1109/97.736233View ArticleGoogle Scholar
- Chang JH, Shin JW, Kimm NS: Likelihood ratio test with complex Laplacian model for voice activity detection. In Proc Eurospeech. Geneva, Switzerland; 2003:1065-1068.Google Scholar
- Shin JW, Chang JH, Kim NS: Voice activity detection based on a family of parametric distributions. Pattern Recogn Lett 2007,28(11):1295-1299. 10.1016/j.patrec.2006.11.015View ArticleGoogle Scholar
- Shin JW, Chang JH, Yun HS, Kim NS: Voice activity detection based on generalized gamma distribution. Proc IEEE Internat Conf on Acoustics, Speech, and Signal Processing 2005, 1: 781-784. Corfu, Greece 17-19Google Scholar
- Ramirez J, Segura JC, Benitez C, Garcia L, Rubio A: Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process Lett 2005,12(10):689-692.View ArticleGoogle Scholar
- Gorriz JM, Ramirez J, Lang EW, Puntonet CG: Jointly Gaussian PDF-based likelihood ratio test for voice activity detection. IEEE Trans Speech Audio Process 2008,16(8):1565-1578.View ArticleGoogle Scholar
- Ramirez J, Gorriz JM, Segura JC, Puntonet CG, Rubio AJ: Speech/non-speech discrimination based on contextual information integrated bispectrum LRT. IEEE Signal Process Let 2006,13(8):497-500.View ArticleGoogle Scholar
- Gorriz JM, Ramirez J, Puntonet CG, Segura JC: Generalized LRT-based voice activity detector". IEEE Signal Process Lett 2006,13(10):636-639.View ArticleGoogle Scholar
- Shin JW, Kwon HJ, Kim NS: Voice activity detection based on conditional MAP criterion. IEEE Signal Process Lett 2008, 15: 257-260.View ArticleGoogle Scholar
- Deng Shiwen, Han Jiqing: A modified MAP criterion based on hidden Markov model for voice activity detection. Proc Int Conf Acoust, Speech, Signal Process 2011, 5220-5223. Prague 22-27Google Scholar
- Mallat SG, Zhang Z: Matching pursuit in a time-frequency dictionary. IEEE Trans Signal Process 1993,41(12):3397-3415. 10.1109/78.258082MATHView ArticleGoogle Scholar
- Goodwin M: Matching pursuit with damped sinusoids. Proc IEEE Internat Conf on Acoustics, Speech, and Signal Processing 1997, 3: 2037-2040. Munich, Germany 21-24Google Scholar
- Goodwin M, Vetterli M: Matching pursuit and atomic signal models based on recursive filter banks. IEEE Trans Signal Process 1999,47(7):1890-1902. 10.1109/78.771038View ArticleGoogle Scholar
- McClure MR, Carin L: Matching pursuits with a wave-based dictionary. IEEE Trans Signal Process 1997,45(12):2912-2927. 10.1109/78.650250View ArticleGoogle Scholar
- Shiwen D, Jiqing H: Voice activity detection based on complex exponential atomic decomposition and likelihood ratio test. In 20th Int Conf Pattern Recognition, ICPR 2010. Istanbul, Turkey; 2010:89-92.Google Scholar
- Reininger RC, Gibson JD: Distributions of the two dimensional DCT coefficients for images. IEEE Trans Commun 1983,31(6):835-839. 10.1109/TCOM.1983.1095893View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.