Voice activity detection algorithm based on long-term pitch information
© The Author(s). 2016
Received: 12 January 2016
Accepted: 23 June 2016
Published: 7 July 2016
A new voice activity detection algorithm based on long-term pitch divergence is presented. The long-term pitch divergence not only decomposes speech signals with a bionic decomposition but also makes full use of long-term information. It is more discriminative comparing with other feature sets, such as long-term spectral divergence. Experimental results show that among six analyzed algorithms, the proposed algorithm is the best one with the highest non-speech hit rate and a reasonably high speech hit rate.
KeywordsVoice activity detection Non-stationary noise Long-term pitch envelop Long-term pitch divergence
Voice activity detection (VAD) is an essential module in almost every audio signal processing application, including coding, enhancement, and recognition. VAD can increase efficiency and improve recognition rates by removing insignificant parts from the audio signals, such as silences or background noises and retaining human voices. In high signal-to-noise ratio (SNR) conditions, it is a relatively simple task since we can reach a satisfying result only by computing the frame energies and setting an appropriate threshold for classification . However, in modern real-life applications, audio signals are always corrupted by the background noises which make those simple VAD algorithms deteriorate dramatically.
For VAD under extreme noisy conditions, a considerable amount of research has been done [2–5]. And the main difference of these algorithms lies in the exploited feature sets in the systems, including spectrum-based features , cepstrum-based features , fundamental frequency-based features , entropy , harmonic , and energy-based features. Among these features, long-term spectral divergence (LTSD) feature  stands out because of its simplicity, adaptability, and good behaviors. Nevertheless, the performance may still need to be improved in non-stationary noises, especially in environmental noises such as factory or battlefield noises which are usually characterized by large, irregular random bursts embedded in a relatively stationary background .
In this paper, we propose a new VAD algorithm based on long-term pitch divergence (LTPD) features. Different from the LTSD feature, LTPD takes advantage of time-varying pitch information  and can deal with the tough noises mentioned above. In a sense, the pitch is a special type of spectrum. Both of them try to decompose audio signals into spectral bands; however, the scale of pitch bands is not linear but in some logarithmic fashion, that is termed as the equal-tempered scale. In musically related task, this logarithmic form of decomposition has been proved to be more suitable for human perception and pitch-based features are more discriminative. Thus, compared to LTSD, LTPD not only benefits from the long-term information about speech signals but also benefits from the logarithmic decomposition of speech signals which is more reasonable than spectrum. The experimental results show that the average performance of the proposed method is the best among the VADs analyzed.
The outline of this paper is as follows: Pitch-based audio features are given in Section 2. Then, we present our LTPE-VAD algorithm in Section 3. Section 4 depicts database and experimental setup and analyzes evaluation results. Finally, the conclusion is given in Section 5.
2 Long-term pitch divergence features
2.1 Pitch-based audio features
Pitch-based audio features are extracted by decomposing audio signals into 88 frequency bands, where each band corresponds to a pitch of the equal-tempered scale . The decomposition is realized by a suitable multi-rate filter bank consisting of elliptic filters . This representation of audio signals can then be used as a basis for deriving various audio features of various characteristics [13, 14], such as chroma pitch, chroma log pitch, and chroma energy normalized statics .
2.2 Definition of LTPD
The definitions are quite identical between LTPD and LTSD. The main difference is the scale of spectral bands, logarithmic rather than linear. However, this subtlety is of considerable importance because logarithmic spectral decomposition is superior to the linear form in theory as well as in practice. And this conclusion can be proved by comparing the distributions of LTPD and LTSD shown in Section 2.3.
2.3 LTPD distributions of speech and non-speech
In this section, we will present the distributions of the LTPD as a function of the window order M so as to clarify the motivations for the algorithm proposed. To study the distribution of the LTPD feature, speeches from the TIMIT corpus  and noises (factory, fighter jet, destroyer, and tank noise) from the NOISEX-92 corpus  were used in the analyses. More details about the databases will be presented in Section 4.
The conclusion about LTPD above is identical to that the conclusion in  concerning the effect of window length on LTSD feature. Consequently, LTPD can also take advantage of the long-time information of speech as LTSD does.
3 The proposed VAD algorithm
where, SNR(t) is the SNR estimated at frame t. SNR0 and SNR1 are the SNRs in the cleanest and noisiest background noises, and γ 0 and γ 1 are their optimal thresholds, respectively.
where, K is a constant. The estimation of SNR is only based on the K + 1 frames before frame t; thus, it can diminish the effect of time-variation of SNR.
4 Experiments and results
To illustrate the effectiveness of LTPD-VAD, some up-to-date voice-active detection methods, which have been proved to be noise robust, are chosen for comparison. They are Sohn , Harmfreq , LTSD , LTSV , and LSFM .
4.1 Data and experimental setup
To evaluate the proposed method, utterances from TIMIT corpus are used. Utterances in TIMIT are on average no longer than 4 s and contain a very small number of non-speech segments. Thus, single utterance is too short to evaluate a VAD algorithm properly. Hence a number of randomly chosen utterances from every dialects (i.e., DR1 to DR8) have been concatenated into a single speech recording, adding 2.5 s of silence at the beginning, ending, and junctions of the utterances. And amplitudes of each utterance have been normalized in order to equalize the power. The initial labels have been obtained by a simple energy VAD and examined visually. 47.06 % of the whole samples are labeled as active speech samples. Two datasets, development and test, have been constructed, and the duration of each dataset is about 600 s. The development and test datasets are used to estimate the parameters and evaluate the performance, respectively.
factory1 (noise near plate-cutting and electrical welding equipment);
factory2 (noise in a car production hall);
leopard (military vehicle noise);
m109 (tank noise);
opsroom (destroyer operations room background noise);
f16 (F-16 cockpit noise);
buccaneer1 (Buccaneer jet traveling at 190 knots)
buccaneer2 (Buccaneer jet traveling at 450 knots)
babble (100 people speaking in a canteen)
engine (destroyer Engine Room noise)
hfchannel (noise in an HF radio channel after demodulation)
machinegun (a .50-caliber gun fired repeatedly)
pink (pink noise)
volvo (Volvo 340 noise)
white (white noise)
Among these noises, only white and pink noises are stationary.
To add noises to speeches at a desired SNR, the open-source Filtering and Noise Adding Tool (FaNT)1 is used.
The audio signals have been divided into 50 ms-long non-overlapping frames and windowed with a periodic Hamming window. The pitch features are extracted by using The Chroma Toolbox.2 The MMSE-based noise estimator is based on MATLAB implementation estnoiseg in Voicebox.3 The order of LTSD is 3. And to compute LTSV and LSFM, the long-term window length is 6 and the parameter of Welch-Bartlett method is 2. All of these parameter values are smaller than those recommended in the corresponding references because of a longer frame length and a lack of the overlap factor.
The receiver operating characteristic (ROC) curves and area under curve (AUC) values are used to describe the average performance of the VAD algorithms. And detection performances under different SNR levels are also assessed in terms of non-speech hit rate (HR0) and speech hit rate (HR1).
4.2 Evaluation results
AUC values of the evaluated VAD algorithms under −5 dB SNR
For other noises such as m109, opsroom, engine, and hfchannel, the best performance is obtained by the LTSV-based VAD algorithm, which means the LTSV measure can effectively distinguish these noises from the corresponding noisy speech. Not only does LTSV method takes advantage of the long-term information but also benefits from the signal variability defined in LTSV. However, the LTPD-based VAD algorithm still outperforms other algorithms except LTSV.
For the vehicle interior noise like leopard and volvo, the characteristics of noisy speech do not change significantly compared to that of pure speech  resulting wonderful performances for all evaluated methods. As an exceptional case, all methods do not perform very well under babble noise composed of voices from 100 people speaking. However, in this case, LTSD-based VAD algorithm is superior to other algorithms, which means that linear spectrum-based LTSD measure is successful in distinguishing such noise consisting of human voices from the corresponding noisy speech.
According to the average AUC value in measuring the comprehensive property of each VAD algorithm under different noisy environments, LPTD-based VAD algorithm is significantly superior to other algorithms, even with a stronger robustness even at low SNR.
Sohn-VAD algorithm yields a moderate behavior with relatively high speech hit rate but slightly low non-speech hit rate.
Harmfreq-VAD, LTSV-VAD, and LSFM-VAD algorithms also obtain a moderate behavior with relatively high non-speech hit rate but slightly low speech hit rate.
The LTSD-VAD algorithm yields the best speech hit rate while non-speech hit rate is poor.
The LTPD-VAD achieves the best compromise among the four evaluated VADs. The speech hit rate of LTPD-VAD is less than all the other methods in clean conditions (above 5 dB) but better than Harmfreq, LTSV, and LSFM in noisy conditions (−5 dB). Moreover, its non-speech hit rate is much better than all the other methods in all cases.
Average speech and non-speech hit rates for SNR levels ranging from 20 to −5 dB
In this paper, a new VAD algorithm is presented for improving the performance of speech detection robustness in various noisy environments. The algorithm is based on the estimation of long-term pitch envelope and measure of long-term pitch divergence between speeches and noises. And an adapted LTPD decision threshold is also given using the measured signal-to-noise ratios. The experimental results show that the proposed method outperforms the other up-to-date VAD algorithms under the most non-stationary noisy environments and is more robust than other VAD algorithms even at low SNR due to the highest non-speech hit rate and a moderate speech hit rate.
However, from the experimental results, it can be argued that LTSV-based VAD method is superior to LTPD-based algorithm in some noisy environments (m109, opsroom, engine, and hfchannel). This may indicate that the long-term signal variability based on logarithmic spectrum decomposition, constructed by combining pitch feature with LTSV feature, may be suitable for VAD tasks. Further, comparing with strict logarithmic scale, some critical-band-based scales is more conforming to human perception of speech signals. Hence, studies of combining these critical-band-based spectrum decomposition with long-term spectral divergence or long-term signal variability are worth further exploration.
This work was supported in part by the National Natural Science Foundation of China (No. 61175017, No. 61370034, and No. 61403224).
The pitch is different to that used in speech signal processing. Here, the pitches mean the spectral bands corresponding to the equal-tempered scale as used in Western music.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- LR Rabiner, MR Sambur, An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal 54(2), 297–315 (1975)View ArticleGoogle Scholar
- PK Ghosh, A Tsiartas, S Narayanan, Robust voice activity detection using long-term signal variability. IEEE Transactions on Audio, Speech and Language Processing 19(3), 600–613 (2011)View ArticleGoogle Scholar
- Y Datao, H Jiqing, Z Guibin, Z Tieran, Sparse power spectrum based robust voice activity detector (IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, 2012)Google Scholar
- W Hongzhi, X Yuchao, L Meijing, Study on the MFCC similarity-based voice activity detection algorithm (International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), 2011)Google Scholar
- G Martin, A Abeer, E Dan et al., All for one: feature combination for highly channel-degraded speech activity detection (INTERSPEECH, Lyon, 2013), pp. 709–713Google Scholar
- T Kristjansson, S Deligne, P Olsen, Voicing features for robust speech detection (INTERSPEECH, 2005), pp. 369–372Google Scholar
- S Ahmadi, AS Spanias, Cepstrum-based pitch detection using a new statistical V/UV classification algorithm. IEEE Transactions on Speech Audio Processing 7, 333–338 (1999)View ArticleGoogle Scholar
- BF Wu, KC Wang, Robust endpoint detection algorithm based on the adaptive band partitioning spectral entropy in adverse environments. IEEE Transactions Speech Audio Processing 13, 762–775 (2005)View ArticleGoogle Scholar
- Z Tuske, P Mihajlik, Z Tobler, T Fegyo, Robust voice activity detection based on the entropy of noise-suppressed spectrum (INTERSPEECH, 2005)Google Scholar
- L. N. Tan, B. J. Borgstrom, and A. Alwan, Voice activity detection using harmonic frequency components in likelihood ratio test (IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010)Google Scholar
- J Ramirez, JC Segura, C Benitez, A de la Torre, A Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Communication 42(3–4), 271–287 (2004)View ArticleGoogle Scholar
- K Manohar, P Rao, Speech enhancement in nonstationary noise environments using noise properties. Speech Communication 48(1), 96–109 (2006)View ArticleGoogle Scholar
- M Muller, Information retrieval for music and motion (Springer Verlag, 2007)Google Scholar
- M Meinard, E Sebastian, Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features, in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR) (2011)Google Scholar
- MA Bartsch, GH Wakefield, Audio thumbnailing of popular music using chroma-based representations. IEEE Transactions on Multimedia 7(1), 96–104 (2005)View ArticleGoogle Scholar
- EH Berger, LH Royster, DP Driscoll, JD Royster, M Layne, The Noise Manual, 5th edn. (American Industrial Hygiene Association, 2003)Google Scholar
- J Rodman, “The effect of bandwidth on speech intelligibility”, White paper (POLYCOM Inc., USA, 2003)Google Scholar
- T Gerkmann, RC Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Transactions on Audio, Speech and Language Processing 20(4), 1383–1393 (2012)View ArticleGoogle Scholar
- JS Garofolo, LF Lamel, WM Fisher et al., DARPA TIMIT acoustic phonetic continuous speech corpus CDROM (NIST, 1993)Google Scholar
- A Varga, HJM Steeneken, Assessment for automatic speech recognition: Ii. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12(3), 247–251 (1993)View ArticleGoogle Scholar
- J Sohn, NS Kim, W Sung, A statistical model-based voice activity detection. IEEE Signal Processing Letter 6(1), 1–3 (1999)View ArticleGoogle Scholar
- M Yanna, A Nishihara, Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP Journal on Audio, Speech and Music Processing, 21 (2013)Google Scholar