Voice activity detection (VAD) based on deep neural networks (DNN) have demonstrated good performance in adverse acoustic environments. Current DNN-based VAD optimizes a surrogate function, e.g., minimum cross-entropy or minimum squared error, at a given decision threshold. However, VAD usually works on-the-fly with a dynamic decision threshold, and the receiver operating characteristic (ROC) curve is a global evaluation metric for VAD at all possible decision thresholds. In this paper, we propose to maximize the area under the ROC curve (MaxAUC) by DNN, which can maximize the performance of VAD in terms of the entire ROC curve. However, the objective of the AUC maximization is nondifferentiable. To overcome this difficulty, we relax the nondifferentiable loss function to two differentiable approximation functions—sigmoid loss and hinge loss. To study the effectiveness of the proposed MaxAUC-DNN VAD, we take either a standard feedforward neural network or a bidirectional long short-term memory network as the DNN model with either the state-of-the-art multi-resolution cochleagram or short-term Fourier transform as the acoustic feature. We conducted noise-independent training to all comparison methods. Experimental results show that taking AUC as the optimization objective results in higher performance than the common objectives of the minimum squared error and minimum cross-entropy. The experimental conclusion is consistent across different DNN structures, acoustic features, noise scenarios, training sets, and languages.

Introduction

Voice activity detection (VAD) aims to detect target voices from background noises. It has demonstrated its effectiveness in many speech processing tasks, such as speech communications, speech recognition, speaker recognition, keyword spotting, and acoustic event detection. A major challenge of VAD is how to deal with low signal-to-noise ratio (SNR) environments. To address this issue, many methods in the early research stage of VAD focused on extracting the statistics of acoustic features. Typical features include energy in the time domain, zero-crossing rate, pitch detection [1], cepstral coefficients [2], and higher-order statistics [3]. Later on, the focus shifted to building statistical models from acoustic features. It fits signals to predefined models and learns the parameters of a prior probability distribution on-the-fly. A crucial problem of the statistical VAD is how to make an accurate model assumption for the real-world distribution of speech data. Existing model assumptions include Gaussian [4, 5], Laplacian [6], Gamma distributions [7], and their combinations [8]. A substantial difficulty that hinders the statistical VAD from adverse environments is that the model parameters are updated using limited local data, leaving a large amount of prior knowledge unexplored. Moreover, real-world data distributions may be too complicated to be modeled accurately by a predefined model assumption.

Machine learning-based VAD has received much attention recently. It regards VAD as a classification problem. It is flexible in incorporating prior knowledge, such as manually labeled data. It is also good at fusing multiple acoustic features. Existing supervised models include linear discriminant analysis [9], support vector machines (SVM) [10], multi-modal methods [11], sparse coding [12, 13], and deep neural networks (DNN) [14,15,16,17,18,19,20,21,22,23,24,25]. Particularly, DNN has demonstrated a strong scalability in building multiple layers of nonlinear transforms on a large-scale training corpus, e.g. [18], which is important to make off-line supervised training methods practical towards real-world applications. Hence, there is a bloom on the development of DNN-based VAD methods, which has focused mainly otwo rgeinpects—acoustic features, e.g. [14, 18, 23, 24], and deep models, e.g. [16, 17, 19, 21, 25].

Recently, some new forms of deep learning based VAD have been studied as well. For example, VAD has been jointly studied with speech enhancement. Some work uses advanced speech enhancement models, like denoising variational autoencoders [25], convolutional-recurrent-network-based speech enhancement, residual-convolutional neural-network [26] and U-Net [27], to extract denoised features for VAD. In [28, 29], the works optimizes VAD and speech enhancement jointly in the framework of multitask learning. In [30], information about the speaker was exploited for the VAD, which makes VAD able to extract speaker-dependent speech segments.

Although deep learning-based VAD has been extensively studied, a fundamental missing research aspect is the training target. To our knowledge, the training targets of VAD are limited to either classification-loss-based minimum cross-entropy (MCE) [14] or regression-loss-based minimum mean square error (MMSE) [18]. It is known that the decision threshold of VAD is usually determined on-the-fly, and different applications may have different minimum requirements to the missing detection rate. Hence, it is needed to optimize the performance of VAD at a wide range of decision thresholds. Moreover, the receiver operating characteristic (ROC) curve and the area under ROC curve (AUC) are two standard evaluation metrics to measure the global performance of VAD. However, MCE and MMSE are both surrogate loss functions that do not optimize the ROC curve or AUC directly.

Motivated by the above issue, this paper proposes MaxAUC-DNN VAD, which optimizes the AUC directly by DNN. Specifically, the AUC optimization is originally formulated as an NP-hard integer programming problem. We first relaxes this nondifferentiable problem to a polynomial-time solvable convex optimization problem by two approximation functions—a sigmoid-loss function and a hinge-loss function, and then calculates the gradient of the relaxed AUC loss. Finally, we take the relaxed AUC loss as the training target of DNN, and back-propagate the gradient to the entire DNN. To benefit from both the relaxed AUC loss and other loss functions, we also propose a hybrid loss to optimize the loss functions jointly.

To demonstrate the strong generalization ability of the MaxAUC-DNN VAD systematically, we test the MaxAUC-DNN VAD with two conventional DNN models, which are a standard feedforward neural network and a bidirectional long short-term memory (BLSTM) network. We also adopt two kinds of acoustic features, which are the short-term Fourier transform (STFT) and multi-resolution cochleagram (MRCG). The above settings amount to six MaxAUC-DNN VADs. To evaluate their generalization ability to unknown test scenarios, we train them with large-scale noise-independent training, and evaluate their performance extensively in both noise-mismatching and language-mismatching test scenarios. We compared MaxAUC-DNN VAD with the other two common DNN-based VADs—MMSE-DNN VAD and MCE-DNN VAD, using the same types of the basic deep model and acoustic feature. Experimental results show that MaxAUC-DNN VAD yields significantly higher performance than the MMSE-DNN VAD and MCE-DNN VAD. The experimental conclusion is consistent across different DNN structures, acoustic features, noise scenarios, training sets, and languages.

This paper differs from our preliminary work [31] in several major aspects, which include the use of two relaxation functions in this paper (but not in [31]), several MaxAUC-DNN VADs with BLSTM in this paper (but not in [31]), noise-independent training in this paper (but not in [31]), different parameter settings, training and evaluation datasets, and experiments for evaluating the generalization ability of DNN models (but not in [31]). Particularly, the proposed MaxAUC in [31] is only a special case of the proposed MaxAUC_{hinge} in this paper. Consequently, experimental results in this paper show that the relative improvement of MaxAUC over the comparison methods in the mismatched environments is at least as good as that in the matched environments, which has not be observed in [31].

The paper is organized as follows. In Section 3, we present the motivation and problem formulation of the proposed algorithm. In Section 4, we present the MaxAUC-DNN VAD algorithm. In Section 6, we present results with noise-independent training. Finally, we conclude in Section 7.

Notations

We first introduce some notations here. Regular small letters, e.g. s, t, and \(\gamma\), indicate scalars. Bold small letters, e.g. \(\mathbf {y}\) and \(\varvec{\alpha }\), indicate vectors. Bold capital letters, e.g. \(\mathbf {P}\) and \(\varvec{\Phi }\), indicate matrices. Letters in calligraphic fonts, e.g. \(\mathcal {X}\), indicate sets. \(\mathbf {0}\) (\(\mathbf {1}\)) is a vector with all entries being 1 (0). The operator \(^T\) denotes the transpose. The operator \(\circ\) denotes the element-wise product.

Motivation

Supervised learning based VAD aims to detect speech from nonspeech, which can be viewed as a typical binary classification problem. More precisely speaking, because nonspeech contains a lot of noise scenarios, VAD is essentially a problem of discriminating one class (i.e. noisy speech) to the rest classes (i.e. various kinds of noises). Here we formulate supervised learning-based VAD problem as follows.

Given a training corpus \(\mathcal {X} = \{(\mathbf {x}_i,y_i)\}_{i=1}^M\) where \(\mathbf {x}_i\) is a high-dimensional acoustic feature of the i-th frame and \(y_i\) is the ground-truth label of \(\mathbf {x}_i\). If \(\mathbf {x}_i\) is labeled as a speech frame, then \(y_i=1\); otherwise, \(y_i=0\). In the modal training stage, supervised VAD learns a mapping function \(f_{\alpha }(\cdot )\) given \(\mathcal {X}\), where \({\alpha }\) is the model parameter. In the test stage, VAD conducts:

where \(\eta\) is a decision threshold. \(f_\alpha (\cdot )\) can be various supervised models. In this paper, we set \(f_\alpha (\cdot )\) to a deep neural network. To restrict the output of \(f_{\alpha }(\cdot )\) to a range of [0, 1], we set the output units of \(f_{\alpha }(\cdot )\) to sigmoid functions or softmax units.

For DNN-based VAD, there are mainly two training objectives, i.e., MMSE and MCE. MMSE minimizes \(\sum \nolimits _{i=1}^{n}||{\mathbf {y}}_i - f_{\alpha }({\mathbf {x}}_{i})||^2\). MCE minimizes \(-\sum \nolimits _{i=1}^n (y_i\log (f_{\alpha }(\mathbf {x}_i))+(1-y_i)\log (1-f_{\alpha }(\mathbf {x}_i)) )\). However, both of them were not carefully designed for VAD. In real-world applications, \(\eta\) is usually determined on-the-fly. For example, it is set close to zero in relatively clean environments, and far away from zero in noisy environments. \(\eta\) also varies in different applications. It is tuned for high speech detection rates in speech communications, and tuned for low false alarm rates in speaker recognition. Hence, the ROC curve and its corresponding AUC, which are unrelated to \(\eta\), are used as the global evaluation metrics of VAD instead of classification accuracy. Because the mean squared error and cross entropy do not have direct connections to AUC, traditional DNN-based VADs yield suboptimal performance in terms of AUC.

MaxAUC-DNN-based VAD

In this section, we first present how to calculate AUC in Section 4.1, then present the optimization objective—MaxAUC in Section 4.2, and finally present the optimization algorithm of the MaxAUC-DNN VAD in Section 4.3.

AUC calculation

The ROC curve of \(f_{\alpha }(\cdot )\) is defined as a curve of the speech detection rate \(P_{\mathrm {D}}\) against false alarm rate \(P_{\mathrm {FA}}\) at all possible decision thresholds \(\eta\):

where \(p_{\mathrm {FA}}(\eta )\) is the probability density function \(f_{\alpha }(\mathbf {x}^-)\) of the random variable at point \(\eta\).

When the number of the training samples M is finite, i.e., \(M<+\infty\), the AUC of \(f_{\alpha }(\mathbf {x})\) on the training data is calculated as follows. We denote the subsets of \(\mathcal {X}\) containing the speech and nonspeech frames as \(\mathcal {X}^+=\{(\mathbf {x}^+_i,y^+_i)\}_{i=1}^P\) and \(\mathcal {X}^-=\{(\mathbf {x}^-_j,y^-_j)\}_{j=1}^N\), respectively, with \(M = P+N\). The AUC on the finite training set equals to the normalized Wilcoxon-Mann-Whitney statistic [32] of \(f_{\alpha }(\mathbf {x})\) in the following form:

Note that if we merge \(\{f_{\alpha }(\mathbf {x}^+_i)\}_{i=1}^P\) and \(\{f_{\alpha }(\mathbf {x}^-_j)\}_{j=1}^N\), and sort the merged data in an ascending order, (6) can be calculated efficiently by:

where \(r_i \in \{1,2,\ldots ,N\}\), \(i=1,\ldots ,P\) is the ranking list of the scores \(f_{\alpha }(\mathbf {x}_i^+)\) in the merged data. We try to maximize (6) for MaxAUC-DNN VAD, and use (8) as the calculation method of AUC in the evaluation stage (Fig. 1).

Objective formulation: MaxAUC

The ideal objective of maximizing AUC is to maximize (6). However, \(g(f_{\alpha }(\mathbf {x}^+_i),f_{\alpha }(\mathbf {x}^-_j))\) in (6) is nondifferentiable. To overcome this problem, we have to replace it with a differentiable approximation function. This paper considers the sigmoid function:

where \(\beta >0\) is a free-parameter. When \(\beta <1\), (9) is too smooth to approximate to (7). The larger \(\beta\) is, the better function (9) approximates to (7). However, when \(\beta\) is too large, the gradient of (10) will encounter numerical problems.

Another approximation function is the p-order hinge-loss function:

where \(0<\gamma \le 1\) is a predefined discriminative margin indicating that, if \(f_{\alpha }(\mathbf {x}^+_i)<\gamma +f_{\alpha }(\mathbf {x}^-_j)\), the speech-nonspeech pair \((f_{\alpha }(\mathbf {x}^+_i), f_{\alpha }(\mathbf {x}^-_j))\) is regarded as a wrong pair produced by \(f_{\alpha }(\cdot )\), and \(p>1\) is a predefined parameter that enforces different loss to different wrong pairs according to their distances from the margin. The optimization objective of the MaxAUC-DNN VAD with (11) is:

Note that the optimization objective in our conference version [31] is a special case of (12) with \(p=1\).

Optimization algorithm

In this paper, we employ the mini-batch stochastic gradient descent algorithm to solve (10) and (12). Because the gradient \(\nabla f_\alpha (\mathbf {x}_i)\) with respect to \(\mathbf {x}_i\) can be easily backpropagated throughout the network in a standard procedure, we only need to derive the gradient at the output layer.

Although the original purpose of the proposed method is to optimize the evaluation metric of VAD directly, MaxAUC, which takes hinge loss or sigmoid loss to relax the 0–1 loss, is actually a surrogate function of the AUC maximization. Therefore, there is no guarantee that MaxAUC will outperform other loss functions in all cases.

To combine the advantage of multiple loss functions, here we propose a hybrid loss:

where \(\ell _i\) is a base loss function that can be AUC, cross entropy, squared error, etc., C is the number of the base loss functions in the hybrid loss, and \(\{ \lambda _i\}_{i=1}^C\) are learnable parameters that balance the loss functions. Note that \(\{ \lambda _i\}_{i=1}^C\) are jointly optimized with the parameters of the deep model by backpropagation. In this paper, we jointly optimize MaxAUC_{hinge} and MCE as a special case of the hybrid loss.

Experiments

In this section, we first present the datasets and experimental settings in Sections 6.1 and 6.2, respectively, then present the main results in Section 6.3, and finally discuss the effects of the hyperparameters of the MaxAUC-DNN VAD on performance in Section 6.4.

Datasets

We used the LibriSpeech ASR database,^{Footnote 1} CHiME-4 challenge,^{Footnote 2} and THCHS-30^{Footnote 3} corpora as the source of clean speech. We used a large-scale sound effect library^{Footnote 4} and the NOISEX-92 database as the source of additive noise. All audio files were sampled at 16 KHz. The LibriSpeech ASR corpus is a large-scale corpus of 1000 h of read English speech. The single-channel clean speech data of CHiME-4, named the “tr05_org” subset, are a read Engish speech corpus based on the original WSJ0 training data, which contains 7138 utterances. THCHS-30 is an open Chinese speech database consisting of 35 h of clean speech signals. The sound effect library contains over 20,000 sound effects. NOISEX-92 is a widely used noise database containing 9 noise scenarios, each of which is about 5 minutes long.

Construction of the training and test sets

We constructed a noisy training set by mixing the “train-clean-100” subset of LibriSpeech ASR with the sound effect library at a SNR range from \(-10\) to 20 dB, which generated over 200 h noisy English speech. We also constructed a development set by adding the babble and factory noise in NOISEX-92 to part of the clean speech of THCHS-30 at \(-5\) dB for the hyperparameter selection problem. We denote the training data as the “Noisy-LibriSpeech.”

We constructed two noisy test sets, one in English and the other in Chinese, by mixing the “tr05-org” subset of CHiME-4 and THCHS-30 with all 9 noise scenarios of NOISEX-92 at SNR levels of \(\{-10,-5,0,5,10,15,20\}\) dB, respectively. The three datasets do not have sample-level ground-truth labels. To address this issue, we applied Sohn VAD to the clean speech of the data sets, and used the prediction of the clean speech as the ground-truth labels, which has been proven to be reliable in [33]. We denote the English test data as “Noisy-CHiME-4” and the Chinese test data as “Noisy-THCHS-30.”

All DNN models were trained with the Noisy-LibriSpeech dataset unless otherwise stated. All evaluations were conducted in mismatching conditions, including the mismatches of noise types, SNR levels, and languages.

Note that, to save the space of the paper, we only report the results in the babble, factory, and volvo noise scenarios of NOISEX-92, leaving the results in all 9 noise scenarios listed in the Supplementary material.

Experimental settings

To verify the effectiveness of the proposed algorithm with different acoustic features and different models, we took STFT and MRCG features respectively as its input. We set the frame length to 30 and 20 ms for the STFT and MRCG features, respectively, and set the frame shift to 10 ms for both features. The hyperparameter \(\beta\) of the MaxAUC_{sigm}-DNN VAD was set to 25 when MRCG was used, and set to 45 when STFT was adopted. The hyperparameters \(\gamma\) and p of the MaxAUC_{hinge} were set to 0.2 and 1, respectively. We took feedfoward neural network and BLSTM as two basic deep models. Their hyperparameter settings are as follows.

For the feedfoward neural network, it contains two hidden layers. The number of the hidden units per hidden layer was set to 256. The activation functions of the hidden units and output units were set to the rectified linear units and sigmoid functions, respectively. The dropout rate was set to 0.2. We used stochastic gradient descent as the optimizer with an initial learning rate of 0.01 and a decay coefficient of 0.05. The number of training epochs was set to 30. The momentum of the first 3 epochs was set to 0.5, and the momentum of other epochs was set to 0.9. The batch size was set to 4096. A contextual window was used to expand each input frame to its context along the time axis. The window size was set to 3.

For the BLSTM model, it comprises of a fully connected layer with the rectified linear units as the activation functions, followed by a BLSTM layer with the tangent functions as the activation functions, and an output layer with the sigmoid functions. The numbers of the hidden units for the fully connected layer and BLSTM layer were set to 512 and 256, respectively. The dropout rate was set to 0.2. The batch size was set to 4096. The network was randomly initialized, and optimized by the stochastic gradient descent with the Adam optimizer. The number of training epochs was set to 30. The learning rate was initialized to 1 and decreased with a decay coefficient of 0.05. The BLSTM model adopted the same contextual window as the DNN model for expanding the input. Note that for specific features and networks, the computational resources required by the proposed AUC loss-based method and the baselines are the same. We list the computational resources required by these approaches in Table 1.

We compared the MaxAUC-DNN VAD with the MMSE-DNN VAD and MCE-DNN VAD. To do the comparison fairly, we compared the loss functions, i.e., MMSE and MCE, with the two variants of MaxAUC only, leaving the other parts of the comparison methods the same. All experiments were conducted in a non-reverberant environment. We adopted the ROC curve and AUC as the evaluation metrics.

Main results

We first compared the VAD methods with the feedforward neural network and STFT feature on the Noisy-CHiME-4 dataset. From the comparison results in Table 2, we see that, when the SNR levels are below 10 dB, the MaxAUC_{hinge}-DNN VAD outperforms the MCE-DNN VAD and the MMSE-DNN VAD by relatively \(2.21\%\) and \(6.90\%\), respectively, meanwhile, the MaxAUC_{sigm}-DNN VAD outperforms the two competitive VADs by relatively \(1.36\%\) and \(6.07\%\), respectively. The MaxAUC-DNN VADs perform similarly with the MCE-DNN VAD in the other scenarios, both of which outperform the MMSE-DNN VAD significantly.

Robustness to different DNN models

To evaluate how different types of DNN models affect the performance, we replaced the feedforward neural network by BLSTM. Table 3 lists the comparison results on Noisy-CHiME-4. From the table, we see that the experimental phenomenon is consistent with that in Table 2. Moreover, the MaxAUC_{hinge}-DNN VAD outperforms the MCE-DNN VAD by relatively \(5.66\%\) when the SNR levels are greater than or equal to 10 dB, which is an interesting phenomenon unobserved in Table 2.

Robustness to mismatched test languages

To further evaluate the generalization ability of the proposed method on different languages, we compared the VAD methods that adopted the BLSTM model and STFT feature on the Chinese Noisy-THCHS-30 test corpus. Table 4 lists the comparison results. From the table, we see that the experimental phenomenon is similar to that in Table 3, though all methods suffer some performance degradation due to the mismatch between the training and test languages.

Robustness to different acoustic features

To study how different acoustic features affect the performance, we replaced STFT with MRCG as the acoustic feature for all comparison methods, and conducted the experiment on the Chinese Noisy-THCHS-30 test corpus. Table 5 lists the comparison results. More results with the STFT feature can be found in the Supplementary materials. From the table, we see that the proposed methods outperform the competitive VADs in most cases at low SNR levels. Comparing Table 5 with Table 4, we also see that the performance of the comparison methods with MRCG is better than that that with STFT.

Robustness to different training sets

To further investigate how different training sets affect the effectiveness of the proposed methods, we conducted a comparison on the Noisy-CHiME-4 test dataset, with the Chinese Noisy-THCHS-30 dataset as the clean speech source of the training set. Note that the generation process of the noisy Noisy-THCHS-30 training set, which was mixed from the training subset of THCHS-30 and the large-scale sound effect library at a SNR range of \(-10\) to 20 dB, was similar to the generation process of the Noisy-LibriSpeech training set. Table 6 lists the comparison results, which again demonstrate the superiority of the proposed methods.

Summary and analysis of the comparison results

Figure 2 summarizes the relative improvement of the MaxAUC VADs over the competitive VADs when the STFT feature is used as the acoustic feature, where we summarize not only the results in Tables 2, 3, and 4 but also the result in the Supplementary materials. From the figure, we see that the relative improvement reaches the maximum around 0 dB. The MaxAUC_{sigm}-DNN VAD performs better than the MaxAUC_{hinge}-DNN VAD when the basic deep model is the feedforward neural network. Although the curves in Figs. 2c and d tend to decrease along with the increase of the SNR level, the relative improvement of the MaxAUC_{hinge}-DNN VAD drops slower than that of the MaxAUC_{sigm}-DNN VAD. Importantly, we find that the mismatch of the test languages does not affect the relative improvement of the MaxAUC-DNN VADs over their comparison methods.

Figure 3 shows the relative AUC improvement when the MRCG feature is used as the acoustic feature from the figure. From the figure, we see that the MaxAUC-DNN VADs outperform the competitive VAD methods in most test scenarios, except the scenario in Fig. 3a at 20 dB. We also find interestingly that, although the relative improvement of the two MaxAUC-DNN VADs over the competitive methods are similar in the low SNR levels, the relative improvement of the MaxAUC_{hinge} VAD over the comparison methods tends to be enlarged when the SNR is increased in Fig. 3b, c, and d, while the relative improvement of the MaxAUC_{sigm} VAD over the comparison methods is reduced on the contrary. Moreover, we find that the relative improvement of the MaxAUC-DNN VADs over their comparison methods on the mismatched test language is higher than that on the matched test language.

Comparing Figs. 2 and 3, we summarize that the MaxAUC_{hinge}-DNN VAD has a slightly stronger generalization ability than the MaxAUC_{sigm}-DNN VAD in most cases. The better the basic deep model and acoustic feature are, the larger the superiority of the MaxAUC_{hinge}-DNN VAD achieves.

At last, we exemplify the ROC curves of the comparison methods with the BLSTM model and MRCG feature on the Chinese Noisy-THCHS-30 at \(-5\) dB in Fig. 4. From the figure, it is clear that both of the proposed VADs outperform the competitive VADs, and the MaxAUC_{hinge}-DNN VAD performs the best in most cases except the machinegun scenario. The above phenomena are observed in most other evaluations too.

Effects of hyperparameters on performance

In this subsection, we evaluated the hyperparameters of the MaxAUC-DNN VAD with the BLSTM model and MRCG feature on the babble and factory noise scenarios of the Chinese development dataset at \(-5\) dB, and applied the optimal hyperparameters to all other test scenarios in this paper. The hyperparameter \(\beta\) of the MaxAUC_{sigm}-DNN VAD was selected from a range of [2, 50]. Figure 5 lists the experimental result. From the figure, we observe that \(\beta\) behaves robustly in a wide range of [20, 50]. The hyperparameters \(\gamma\) and p of the MaxAUC_{hinge}-DNN VAD were selected from [0.1:0.1:0.9], and [1:1:9], respectively, where the symbol [a:b:c] denotes a serial numbers starting from a and ending at c with a step size of c. We searched \((\gamma ,p)\) jointly in a mesh grid. Figure 6 lists the experimental result. From the figure, it seems that the two hyperparameters have a strong correlation. If one of the hyperparameters was enlarged, and if the other one was enlarged accordingly, then the performance is stable across the two evaluation scenarios. The best performance appears around \(\gamma = 0.2\) and \(p = 1\).

Effects of the hybrid loss on performance

In this subsection, we evaluated the hybrid loss of MaxAUC and MCE with the BLSTM model and STFT feature in the challenging babble and factory noise scenarios of the Noisy-CHiME-4 test dataset, where the Chinese Noisy-THCHS-30 dataset was used as the training set. The weights \(\lambda\) of MaxAUC and MCE, which were obtained automatically by optimizing (16) on the Chinese Noisy-THCHS-30 training data, are 0.7764 and 0.2236, respectively. This manifests that MaxAUC is a more effective training loss than MCE in generating a good local minimum of the BLSTM model. Table 7 lists the comparison results. From the table, we see that the hybrid-loss-based VAD outperforms the MaxAUC-DNN VAD and MCE-DNN VAD in the babble and volvo noise scenarios. However, it does not outperform the MaxAUC-DNN VAD in the factory noise scenario, which needs further investigation.

Conclusions

In this paper, we have proposed the MaxAUC-DNN VAD for improving the performance of the DNN-based VAD at any decision threshold. Specifically, we first relax the AUC calculation, which is an integer optimization problem, to a polynomial-time solvable problem by a differentiable function, then compute the gradient of the relaxed AUC loss with respect to the parameters of the output layer of DNN, and finally back-propagate the gradient to its hidden layers. We proposed two approximation functions—a sigmoid loss approximation and a hinge loss approximation. To integrate the advantage of the proposed loss with existing VAD loss functions, we propose a hybrid loss framework that jointly optimizes the loss functions. We evaluated the effectiveness of the MaxAUC-DNN VAD in a wide range of test scenarios from the respects of different DNN models, acoustic features, training sets, and the noise mismatching and language mismatching scenarios. Empirical results show that the MaxAUC-DNN VAD outperforms the MMSE-DNN VAD and MCE-DNN VAD in most test scenarios, and the relative improvement over the comparison methods tends to be enlarged when the training and test conditions are mismatched; it is also insensitive to the hyperparameter selection. Finally, the hybrid loss has shown its potential in outperforming its components.

J.-C. Junqua, H. Wakita, in Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference On. A comparative study of cepstral lifters and distance measures for all pole models of speech in noise (IEEE, 1989), pp. 476–479

E. Nemer, R. Goubran, S. Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans. Speech Audio Process. 9(3), 217–231 (2001)

J. Ramírez, J.C. Segura, C. Benítez, L. García, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Signal Process. Lett. 12(10), 689–692 (2005)

J.W. Shin, J.-H. Chang, N.S. Kim, Statistical modeling of speech signals based on generalized gamma distribution. IEEE Signal Process. Lett. 12(3), 258–261 (2005)

J. Padrell, D. Macho, C. Nadeu, in Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference On. Robust speech activity detection using lda applied to ff parameters, vol. 1 (IEEE, 2005), p. 557

J. Wu, X.L. Zhang, Efficient multiple kernel support vector machine based voice activity detection. IEEE Signal Process. Lett. 18(8), 466–499 (2011)

D. Dov, R. Talmon, I. Cohen, Multimodal kernel method for activity detection of sound sources. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1322–1334 (2017)

X.-L. Zhang, J. Wu, in the 38th IEEE International Conference on Acoustic, Speech, and Signal Processing. Denoising deep neural networks based voice activity detection (2013), pp. 853–857

T. Hughes, K. Mierle, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Recurrent neural networks for voice activity detection (2013). pp. 7378–7382

F. Eyben, F. Weninger, S. Squartini, B. Schuller, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Real-life voice activity detection with lstm recurrent neural networks and an application to hollywood movies (IEEE, 2013) pp. 483–487

X.-L. Zhang, D. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)

I. Hwang, H.-M. Park, J.-H. Chang, Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Comput. Speech Lang. 38, 1–12 (2016)

Q. Wang, J. Du, X. Bao, Z.-R. Wang, L.-R. Dai, C.-H. Lee, In: Sixteenth Annual Conference of the International Speech Communication Association. A universal vad based on jointly trained deep neural networks (2015)

L. Wang, K. Phapatanaburi, Z. Go, S. Nakagawa, M. Iwahashi, J. Dang, in Proceedings of ICME. Limiting numerical precision of neural networks to achieve real-time voice activity detection (2018), pp. 1087–1092

Y. Tachioka, in Proceedings of ICASSP. Limiting numerical precision of neural networks to achieve real-time voice activity detection (2018), pp. 2236–2240

Y. Tachioka, in Proceedings of ICASSP. Dnn-based voice activity detection using auxiliary speech models in noisy environments (2018). pp. 5529–5533

W.A. Jassim, N. Harte, in Proceedings of ICASSP. Voice activity detection using neurograms (2018), pp. 5524–5528

Y. Jung, Y. Kim, Y. Choi, H. Kim, in Interspeech. Joint learning using denoising variational autoencoders for voice activity detection (2018), pp. 1210–1214

T. Xu, H. Zhang, X. Zhang, in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Joint training rescnn-based voice activity detection with speech enhancement (IEEE, 2019), pp. 1157–1162

G.W. Lee, H.K. Kim, Multi-task learning u-net for single-channel speech enhancement and mask-based voice activity detection. Appl. Sci. 10(9), 3230 (2020)

Y. Zhuang, S. Tong, M. Yin, Y. Qian, K. Yu, in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP). Multi-task joint-learning for robust voice activity detection (IEEE, 2016), pp. 1–5

X. Tan, X.-L. Zhang, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speech enhancement aided end-to-end multi-task learning for voice activity detection (IEEE, 2021), pp. 6823–6827

Y. Chen, S. Wang, Y. Qian, K. Yu, End-to-end speaker-dependent voice activity detection. arXiv preprint arXiv:2009.09906 (2020)

Z.-C. Fan, Z. Bai, X.-L. Zhang, S. Rahardja, J. Chen, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Auc optimization for deep learning based voice activity detection (IEEE, 2019), pp. 6760–6764

H.B. Mann, D.R. Whitney, On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 50–60 (1947)

X.-L. Zhang, D. Wang, Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans Audio Speech Lang. Process. 24(2), 252–264 (2015)

The authors would like to thank the editors and anonymous reviewers for their volunteer endeavor on this paper which greatly improved the quality of the paper.

Funding

This work was supported in part by National Science Foundation of China under Grant No. 62176211, in part by Project of the Science, Technology, and Innovation Commission of Shenzhen Municipality under grant No. JSGG20210802152546026 and JCYJ20210324143006016.

Author information

Authors and Affiliations

Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, China

Xiao-Lei Zhang & Menglong Xu

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Zhang, XL., Xu, M. AUC optimization for deep learning-based voice activity detection.
J AUDIO SPEECH MUSIC PROC.2022, 27 (2022). https://doi.org/10.1186/s13636-022-00260-9