AUC optimization for deep learning-based voice activity detection

Voice activity detection (VAD) based on deep neural networks (DNN) have demonstrated good performance in adverse acoustic environments. Current DNN-based VAD optimizes a surrogate function, e.g., minimum cross-entropy or minimum squared error, at a given decision threshold. However, VAD usually works on-the-fly with a dynamic decision threshold, and the receiver operating characteristic (ROC) curve is a global evaluation metric for VAD at all possible decision thresholds. In this paper, we propose to maximize the area under the ROC curve (MaxAUC) by DNN, which can maximize the performance of VAD in terms of the entire ROC curve. However, the objective of the AUC maximization is nondifferentiable. To overcome this difficulty, we relax the nondifferentiable loss function to two differentiable approximation functions—sigmoid loss and hinge loss. To study the effectiveness of the proposed MaxAUC-DNN VAD, we take either a standard feedforward neural network or a bidirectional long short-term memory network as the DNN model with either the state-of-the-art multi-resolution cochleagram or short-term Fourier transform as the acoustic feature. We conducted noise-independent training to all comparison methods. Experimental results show that taking AUC as the optimization objective results in higher performance than the common objectives of the minimum squared error and minimum cross-entropy. The experimental conclusion is consistent across different DNN structures, acoustic features, noise scenarios, training sets, and languages.


Introduction
Voice activity detection (VAD) aims to detect target voices from background noises. It has demonstrated its effectiveness in many speech processing tasks, such as speech communications, speech recognition, speaker recognition, keyword spotting, and acoustic event detection. A major challenge of VAD is how to deal with low signal-to-noise ratio (SNR) environments. To address this issue, many methods in the early research stage of VAD focused on extracting the statistics of acoustic features. Typical features include energy in the time domain, zerocrossing rate, pitch detection [1], cepstral coefficients [2], and higher-order statistics [3]. Later on, the focus shifted to building statistical models from acoustic features. It fits signals to predefined models and learns the parameters of a prior probability distribution on-the-fly. A crucial problem of the statistical VAD is how to make an accurate model assumption for the real-world distribution of speech data. Existing model assumptions include Gaussian [4,5], Laplacian [6], Gamma distributions [7], and their combinations [8]. A substantial difficulty that hinders the statistical VAD from adverse environments is that the model parameters are updated using limited local data, leaving a large amount of prior knowledge unexplored. Moreover, real-world data distributions may be too complicated to be modeled accurately by a predefined model assumption.
Recently, some new forms of deep learning based VAD have been studied as well. For example, VAD has been jointly studied with speech enhancement. Some work uses advanced speech enhancement models, like denoising variational autoencoders [25], convolutionalrecurrent-network-based speech enhancement, residual-convolutional neural-network [26] and U-Net [27], to extract denoised features for VAD. In [28,29], the works optimizes VAD and speech enhancement jointly in the framework of multitask learning. In [30], information about the speaker was exploited for the VAD, which makes VAD able to extract speaker-dependent speech segments.
Although deep learning-based VAD has been extensively studied, a fundamental missing research aspect is the training target. To our knowledge, the training targets of VAD are limited to either classification-loss-based minimum cross-entropy (MCE) [14] or regression-lossbased minimum mean square error (MMSE) [18]. It is known that the decision threshold of VAD is usually determined on-the-fly, and different applications may have different minimum requirements to the missing detection rate. Hence, it is needed to optimize the performance of VAD at a wide range of decision thresholds. Moreover, the receiver operating characteristic (ROC) curve and the area under ROC curve (AUC) are two standard evaluation metrics to measure the global performance of VAD. However, MCE and MMSE are both surrogate loss functions that do not optimize the ROC curve or AUC directly.
Motivated by the above issue, this paper proposes MaxAUC-DNN VAD, which optimizes the AUC directly by DNN. Specifically, the AUC optimization is originally formulated as an NP-hard integer programming problem. We first relaxes this nondifferentiable problem to a polynomial-time solvable convex optimization problem by two approximation functions-a sigmoid-loss function and a hinge-loss function, and then calculates the gradient of the relaxed AUC loss. Finally, we take the relaxed AUC loss as the training target of DNN, and back-propagate the gradient to the entire DNN. To benefit from both the relaxed AUC loss and other loss functions, we also propose a hybrid loss to optimize the loss functions jointly.
To demonstrate the strong generalization ability of the MaxAUC-DNN VAD systematically, we test the MaxAUC-DNN VAD with two conventional DNN models, which are a standard feedforward neural network and a bidirectional long short-term memory (BLSTM) network. We also adopt two kinds of acoustic features, which are the short-term Fourier transform (STFT) and multi-resolution cochleagram (MRCG). The above settings amount to six MaxAUC-DNN VADs. To evaluate their generalization ability to unknown test scenarios, we train them with large-scale noise-independent training, and evaluate their performance extensively in both noisemismatching and language-mismatching test scenarios. We compared MaxAUC-DNN VAD with the other two common DNN-based VADs-MMSE-DNN VAD and MCE-DNN VAD, using the same types of the basic deep model and acoustic feature. Experimental results show that MaxAUC-DNN VAD yields significantly higher performance than the MMSE-DNN VAD and MCE-DNN VAD. The experimental conclusion is consistent across different DNN structures, acoustic features, noise scenarios, training sets, and languages. This paper differs from our preliminary work [31] in several major aspects, which include the use of two relaxation functions in this paper (but not in [31]), several MaxAUC-DNN VADs with BLSTM in this paper (but not in [31]), noise-independent training in this paper (but not in [31]), different parameter settings, training and evaluation datasets, and experiments for evaluating the generalization ability of DNN models (but not in [31]). Particularly, the proposed MaxAUC in [31] is only a special case of the proposed MaxAUC hinge in this paper. Consequently, experimental results in this paper show that the relative improvement of MaxAUC over the comparison methods in the mismatched environments is at least as good as that in the matched environments, which has not be observed in [31].
The paper is organized as follows. In Section 3, we present the motivation and problem formulation of the proposed algorithm. In Section 4, we present the MaxAUC-DNN VAD algorithm. In Section 6, we present results with noise-independent training. Finally, we conclude in Section 7.

Notations
We first introduce some notations here. Regular small letters, e.g. s, t, and γ , indicate scalars. Bold small letters, e.g. y and α , indicate vectors. Bold capital letters, e.g. P and , indicate matrices. Letters in calligraphic fonts, e.g.

Motivation
Supervised learning based VAD aims to detect speech from nonspeech, which can be viewed as a typical binary classification problem. More precisely speaking, because nonspeech contains a lot of noise scenarios, VAD is essentially a problem of discriminating one class (i.e. noisy speech) to the rest classes (i.e. various kinds of noises). Here we formulate supervised learning-based VAD problem as follows.
Given a training corpus where x i is a high-dimensional acoustic feature of the i-th frame and y i is the ground-truth label of x i . If x i is labeled as a speech frame, then y i = 1 ; otherwise, y i = 0 . In the modal training stage, supervised VAD learns a mapping function f α (·) given X , where α is the model parameter. In the test stage, VAD conducts: where η is a decision threshold. f α (·) can be various supervised models. In this paper, we set f α (·) to a deep neural network. To restrict the output of f α (·) to a range of [0, 1], we set the output units of f α (·) to sigmoid functions or softmax units.
For DNN-based VAD, there are mainly two training objectives, i.e., MMSE and MCE. MMSE minimizes ) . However, both of them were not carefully designed for VAD. In real-world applications, η is usually determined onthe-fly. For example, it is set close to zero in relatively clean environments, and far away from zero in noisy environments. η also varies in different applications. It is tuned for high speech detection rates in speech communications, and tuned for low false alarm rates in speaker recognition. Hence, the ROC curve and its corresponding AUC, which are unrelated to η , are used as the global evaluation metrics of VAD instead of classification accuracy. Because the mean squared error and cross entropy do not have direct connections to AUC, traditional DNN-based VADs yield suboptimal performance in terms of AUC.

MaxAUC-DNN-based VAD
In this section, we first present how to calculate AUC in Section 4.1, then present the optimization objective-MaxAUC in Section 4.2, and finally present the optimization algorithm of the MaxAUC-DNN VAD in Section 4.3.

AUC calculation
The ROC curve of f α (·) is defined as a curve of the speech detection rate P D against false alarm rate P FA at all possible decision thresholds η: where R denotes the set of real numbers and P(·) denotes probability. The AUC is calculated by: When the number of the training samples M is finite, i.e., M < +∞ , the AUC of f α (x) on the training data is calculated as follows. We denote the subsets of X containing the speech and nonspeech frames as , respectively, with M = P + N . The AUC on the finite training set equals to the normalized Wilcoxon-Mann-Whitney statistic [32] of f α (x) in the following form: where , and sort the merged data in an ascending order, (6) can be calculated efficiently by: where r i ∈ {1, 2, . . . , N } , i = 1, . . . , P is the ranking list of the scores f α (x + i ) in the merged data. We try to maximize (6) for MaxAUC-DNN VAD, and use (8) as the calculation method of AUC in the evaluation stage ( Fig. 1).

Objective formulation: MaxAUC
The ideal objective of maximizing AUC is to maximize (6). (6) is nondifferentiable. To overcome this problem, we have to replace it with a differentiable approximation function. This paper considers the sigmoid function: as an approximation function, which results in the following optimization objective of the MaxAUC-DNN VAD: where β > 0 is a free-parameter. When β < 1 , (9) is too smooth to approximate to (7). The larger β is, the better function (9) approximates to (7). However, when β is too large, the gradient of (10) will encounter numerical problems.
Another approximation function is the p-order hingeloss function: ) is regarded as a wrong pair produced by f α (·) , and p > 1 is a predefined parameter that enforces different loss to different wrong pairs according to their distances from the margin. The optimization objective of the MaxAUC-DNN VAD with (11) is: Note that the optimization objective in our conference version [31] is a special case of (12) with p = 1.

Optimization algorithm
In this paper, we employ the mini-batch stochastic gradient descent algorithm to solve (10) and (12). Because the gradient ∇f α (x i ) with respect to x i can be easily backpropagated throughout the network in a standard procedure, we only need to derive the gradient at the output layer.
We can easily derive the gradient of (10) as: and the gradient of (12) as: where �f α (x) denotes the gradient of f α (x) at the output layer, and �(i, j) is defined as:

Hybrid loss
Although the original purpose of the proposed method is to optimize the evaluation metric of VAD directly, MaxAUC, which takes hinge loss or sigmoid loss to relax the 0-1 loss, is actually a surrogate function of the (12) AUC maximization. Therefore, there is no guarantee that MaxAUC will outperform other loss functions in all cases. To combine the advantage of multiple loss functions, here we propose a hybrid loss: where ℓ i is a base loss function that can be AUC, cross entropy, squared error, etc., C is the number of the base loss functions in the hybrid loss, and { i } C i=1 are learnable parameters that balance the loss functions. Note that are jointly optimized with the parameters of the deep model by backpropagation. In this paper, we jointly optimize MaxAUC hinge and MCE as a special case of the hybrid loss.

Experiments
In this section, we first present the datasets and experimental settings in Sections 6.1 and 6.2, respectively, then present the main results in Section 6.3, and finally discuss the effects of the hyperparameters of the MaxAUC-DNN VAD on performance in Section 6.4.

Datasets
We used the LibriSpeech ASR database, 1 CHiME-4 challenge, 2 and THCHS-30 3 corpora as the source of clean speech. We used a large-scale sound effect library 4 and the NOISEX-92 database as the source of additive noise. All audio files were sampled at 16 KHz. The LibriSpeech ASR corpus is a large-scale corpus of 1000 h of read English speech. The single-channel clean speech data of CHiME-4, named the "tr05_org" subset, are a read Engish speech corpus based on the original WSJ0 training data, which contains 7138 utterances. THCHS-30 is an open Chinese speech database consisting of 35 h of clean speech signals. The sound effect library contains over 20,000 sound effects. NOISEX-92 is a widely used noise database containing 9 noise scenarios, each of which is about 5 minutes long. (16) min C i i ℓ i subject to:

Construction of the training and test sets
We constructed a noisy training set by mixing the "train-clean-100" subset of LibriSpeech ASR with the sound effect library at a SNR range from −10 to 20 dB, which generated over 200 h noisy English speech. We also constructed a development set by adding the babble and factory noise in NOISEX-92 to part of the clean speech of THCHS-30 at −5 dB for the hyperparameter selection problem. We denote the training data as the "Noisy-LibriSpeech. " We constructed two noisy test sets, one in English and the other in Chinese, by mixing the "tr05-org" subset of CHiME-4 and THCHS-30 with all 9 noise scenarios of NOISEX-92 at SNR levels of {−10, −5, 0, 5, 10, 15, 20} dB, respectively. The three datasets do not have sample-level ground-truth labels. To address this issue, we applied Sohn VAD to the clean speech of the data sets, and used the prediction of the clean speech as the ground-truth labels, which has been proven to be reliable in [33]. We denote the English test data as "Noisy-CHiME-4" and the Chinese test data as "Noisy-THCHS-30. " All DNN models were trained with the Noisy-LibriSpeech dataset unless otherwise stated. All evaluations were conducted in mismatching conditions, including the mismatches of noise types, SNR levels, and languages.
Note that, to save the space of the paper, we only report the results in the babble, factory, and volvo noise scenarios of NOISEX-92, leaving the results in all 9 noise scenarios listed in the Supplementary material.

Experimental settings
To verify the effectiveness of the proposed algorithm with different acoustic features and different models, we took STFT and MRCG features respectively as its input. We set the frame length to 30 and 20 ms for the STFT and MRCG features, respectively, and set the frame shift to 10 ms for both features. The hyperparameter β of the MaxAUC sigm -DNN VAD was set to 25 when MRCG was used, and set to 45 when STFT was adopted. The hyperparameters γ and p of the MaxAUC hinge were set to 0.2 and 1, respectively. We took feedfoward neural network and BLSTM as two basic deep models. Their hyperparameter settings are as follows.
For the feedfoward neural network, it contains two hidden layers. The number of the hidden units per hidden layer was set to 256. The activation functions of the hidden units and output units were set to the rectified linear units and sigmoid functions, respectively. The dropout rate was set to 0.2. We used stochastic gradient descent as the optimizer with an initial learning rate of 0.01 and a decay coefficient of 0.05. The number of training epochs was set to 30. The momentum of the first 3 epochs was set to 0.5, and the momentum of other epochs was set to 0.9. The batch size was set to 4096. A contextual window was used to expand each input frame to its context along the time axis. The window size was set to 3.
For the BLSTM model, it comprises of a fully connected layer with the rectified linear units as the activation functions, followed by a BLSTM layer with the tangent functions as the activation functions, and an output layer with the sigmoid functions. The numbers of the hidden units for the fully connected layer and BLSTM layer were set to 512 and 256, respectively. The dropout rate was set to 0.2. The batch size was set to 4096. The network was randomly initialized, and optimized by the stochastic gradient descent with the Adam optimizer. The number of training epochs was set to 30. The learning rate was initialized to 1 and decreased with a decay coefficient of 0.05. The BLSTM model adopted the same contextual window as the DNN model for expanding the input. Note that for specific features and networks, the computational resources required by the proposed AUC loss-based method and the baselines are the same. We list the computational resources required by these approaches in Table 1.
We compared the MaxAUC-DNN VAD with the MMSE-DNN VAD and MCE-DNN VAD. To do the comparison fairly, we compared the loss functions, i.e., MMSE and MCE, with the two variants of MaxAUC only, leaving the other parts of the comparison methods the same. All experiments were conducted in a non-reverberant environment. We adopted the ROC curve and AUC as the evaluation metrics.

Main results
We first compared the VAD methods with the feedforward neural network and STFT feature on the Noisy-CHiME-4 dataset. From the comparison results in Table 2, we see that, when the SNR levels are below 10 dB, the MaxAUC hinge -DNN VAD outperforms the MCE-DNN VAD and the MMSE-DNN VAD by relatively 2.21% and 6.90% , respectively, meanwhile, the MaxAUC sigm -DNN VAD outperforms the two competitive VADs by relatively 1.36% and 6.07% , respectively. The MaxAUC-DNN VADs perform similarly with the MCE-DNN VAD in the other scenarios, both of which outperform the MMSE-DNN VAD significantly.

Robustness to different DNN models
To evaluate how different types of DNN models affect the performance, we replaced the feedforward neural network by BLSTM. Table 3 lists the comparison results on Noisy-CHiME-4. From the table, we see that the experimental phenomenon is consistent with that in Table 2. Moreover, the MaxAUC hinge -DNN VAD outperforms the MCE-DNN VAD by relatively 5.66% when the SNR levels are greater than or equal to 10 dB, which is an interesting phenomenon unobserved in Table 2.

Robustness to mismatched test languages
To further evaluate the generalization ability of the proposed method on different languages, we compared the VAD methods that adopted the BLSTM model and STFT feature on the Chinese Noisy-THCHS-30 test corpus. Table 4 lists the comparison results. From the table, we see that the experimental phenomenon is similar to that in Table 3, though all methods suffer some performance degradation due to the mismatch between the training and test languages.

Robustness to different acoustic features
To study how different acoustic features affect the performance, we replaced STFT with MRCG as the acoustic feature for all comparison methods, and conducted the experiment on the Chinese Noisy-THCHS-30 test corpus. Table 5 lists the comparison results. More results with the STFT feature can be found in the Supplementary materials. From the table, we see that the proposed methods outperform the competitive VADs in most cases at low SNR levels. Comparing Table 5 with Table 4, we also see that the performance of the comparison methods with MRCG is better than that that with STFT.

Robustness to different training sets
To further investigate how different training sets affect the effectiveness of the proposed methods, we conducted a comparison on the Noisy-CHiME-4 test dataset, with the Chinese Noisy-THCHS-30 dataset as the clean speech source of the training set. Note that the generation process of the noisy Noisy-THCHS-30 training set, which was mixed from the training subset of THCHS-30 and the large-scale sound effect library at a SNR range of −10 to 20 dB, was similar to the generation process of the Noisy-LibriSpeech training set. Table 6 lists the comparison results, which again demonstrate the superiority of the proposed methods. Importantly, we find that the mismatch of the test languages does not affect the relative improvement of the MaxAUC-DNN VADs over their comparison methods. Figure 3 shows the relative AUC improvement when the MRCG feature is used as the acoustic feature from the figure. From the figure, we see that the MaxAUC-DNN VADs outperform the competitive VAD methods in most test scenarios, except the scenario in Fig. 3a     at 20 dB. We also find interestingly that, although the relative improvement of the two MaxAUC-DNN VADs over the competitive methods are similar in the low SNR levels, the relative improvement of the MaxAUC hinge VAD over the comparison methods tends to be enlarged when the SNR is increased in Fig. 3b, c, and d, while the relative improvement of the MaxAUC sigm VAD over the comparison methods is reduced on the contrary. Moreover, we find that the relative improvement of the MaxAUC-DNN VADs over their comparison methods on the mismatched test language is higher than that on the matched test language. Comparing Figs. 2 and 3, we summarize that the MaxAUC hinge -DNN VAD has a slightly stronger generalization ability than the MaxAUC sigm -DNN VAD in most cases. The better the basic deep model and acoustic feature are, the larger the superiority of the MaxAUC hinge -DNN VAD achieves.

Summary and analysis of the comparison results
At last, we exemplify the ROC curves of the comparison methods with the BLSTM model and MRCG feature on the Chinese Noisy-THCHS-30 at −5 dB in Fig. 4. From the figure, it is clear that both of the proposed VADs outperform the competitive VADs, and the MaxAUC hinge -DNN VAD performs the best in most cases except the machinegun scenario. The above phenomena are observed in most other evaluations too.

Effects of hyperparameters on performance
In this subsection, we evaluated the hyperparameters of the MaxAUC-DNN VAD with the BLSTM model and MRCG feature on the babble and factory noise scenarios of the Chinese development dataset at −5 dB, and applied the optimal hyperparameters to all other test scenarios in this paper. The hyperparameter β of the MaxAUC sigm -DNN VAD was selected from a range of [2,50]. Figure 5 lists the experimental result. From the figure, we observe that β behaves robustly in a wide range of [20,50]. The hyperparameters γ and p of the MaxAUC hinge -DNN VAD were selected from [0.1:0.1:0.9], and [1:1:9], respectively, where the symbol [a:b:c] denotes a serial numbers starting from a and ending at c with a step size of c. We searched (γ , p) jointly in a mesh grid. Figure 6 lists the experimental result. From the figure, it seems that the two hyperparameters have a strong correlation. If one of the hyperparameters was enlarged, and if the other one was enlarged accordingly, then the performance is stable across the two evaluation scenarios. The best performance appears around γ = 0.2 and p = 1.   . 2 Relative AUC improvement of the proposed methods over the competitive methods, when STFT is used as the acoustic feature. a Feedforward neural network is used as the basic deep model; the evaluation is conducted on the English Noisy-CHiME-4 dataset. b Feedforward neural network is used; the evaluation is conducted on the Chinese Noisy-THCHS-30 dataset. c BLSTM is used; the evaluation is conducted on the English Noisy-CHiME-4 dataset. d BLSTM is used; the evaluation is conducted on the Chinese Noisy-THCHS-30 dataset Fig. 3 Relative AUC improvement of the proposed methods over the competitive methods, when MRCG is used as the acoustic feature. The terms "EN" and "CH" are short for English and Chinese respectively. The term "NN" is short for neural networks. a Feedforward neural network is used as the basic deep model; the evaluation is conducted on the English Noisy-CHiME-4 dataset. b Feedforward neural network is used; the evaluation is conducted on the Chinese Noisy-THCHS-30 dataset. c BLSTM is used; the evaluation is conducted on the English Noisy-CHiME-4 dataset. d BLSTM is used; the evaluation is conducted on the Chinese Noisy-THCHS-30 dataset

Conclusions
In this paper, we have proposed the MaxAUC-DNN VAD for improving the performance of the DNN-based VAD at any decision threshold. Specifically, we first relax the AUC calculation, which is an integer optimization problem, to a polynomial-time solvable problem by a differentiable function, then compute the gradient of the relaxed AUC loss with respect to the parameters of the output layer of DNN, and finally back-propagate the gradient to its hidden layers. We proposed two approximation functions-a sigmoid loss approximation and a hinge loss approximation. To integrate the advantage of the proposed loss with existing VAD loss functions, we propose a hybrid loss framework that jointly optimizes the loss functions. We evaluated the effectiveness of the MaxAUC-DNN VAD in a wide range of test scenarios from the respects of different DNN models, acoustic features, training sets, and the noise mismatching and language mismatching scenarios. Empirical results show that the MaxAUC-DNN VAD outperforms the MMSE-DNN VAD and MCE-DNN VAD in most test scenarios, and the relative improvement over the comparison methods tends to be enlarged when the training and test conditions are mismatched; it is also insensitive to the hyperparameter selection. Finally, the hybrid loss has shown its potential in outperforming its components.