Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

Zhang, Zhaofeng; Wang, Longbiao; Kai, Atsuhiko; Yamada, Takanori; Li, Weifeng; Iwahashi, Masahiro

doi:10.1186/s13636-015-0056-7

Research
Open access
Published: 12 May 2015

Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

Zhaofeng Zhang¹,
Longbiao Wang¹,
Atsuhiko Kai²,
Takanori Yamada²,
Weifeng Li³ &
…
Masahiro Iwahashi¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2015, Article number: 12 (2015) Cite this article

128k Accesses
84 Citations
Metrics details

Abstract

Deep neural network (DNN)-based approaches have been shown to be effective in many automatic speech recognition systems. However, few works have focused on DNNs for distant-talking speaker recognition. In this study, a bottleneck feature derived from a DNN and a cepstral domain denoising autoencoder (DAE)-based dereverberation are presented for distant-talking speaker identification, and a combination of these two approaches is proposed. For the DNN-based bottleneck feature, we noted that DNNs can transform the reverberant speech feature to a new feature space with greater discriminative classification ability for distant-talking speaker recognition. Conversely, cepstral domain DAE-based dereverberation tries to suppress the reverberation by mapping the cepstrum of reverberant speech to that of clean speech with the expectation of improving the performance of distant-talking speaker recognition. Since the DNN-based discriminant bottleneck feature and DAE-based dereverberation have a strong complementary nature, the combination of these two methods is expected to be very effective for distant-talking speaker identification. A speaker identification experiment was performed on a distant-talking speech set, with reverberant environments differing from the training environments. In suppressing late reverberation, our method outperformed some state-of-the-art dereverberation approaches such as the multichannel least mean squares (MCLMS). Compared with the MCLMS, we obtained a reduction in relative error rates of 21.4% for the bottleneck feature and 47.0% for the autoencoder feature. Moreover, the combination of likelihoods of the DNN-based bottleneck feature and DAE-based dereverberation further improved the performance.

1 Introduction

Although speaker recognition has been researched for many years, most applications still require a microphone located near the speaker. However, many applications would benefit from speaker recognition through distant-talking speech capture, where the speaker is able to speak at some distance from the microphones. While in this task, even in quiet conditions, the microphone records not only the direct sound of the specific speaker but also reverberation signals. A reverberation signal is created when a sound or signal is reflected, causing a large number of reflections to build up and then decay as the sound is absorbed by the surfaces of objects in the space, which could include walls, furniture, people, and air.

Owing to the effects of reverberation, the accuracy of distant-talking speaker identification is significantly reduced. According to [1], approaches for dealing with reverberation can be classified as front-end- or back-end-based approaches. Approaches of the former type attempt to reduce the effect of reverberation from the observed speech signal [2-5], while the latter methods attempt to modify the acoustic model and/or decoder to suit a reverberant environment [6,7]. In this paper, we focus on front-end-based approaches for distant-talking speaker identification.

Many front-end-based techniques have been proposed for robust automatic speech recognition (ASR) and speaker recognition in distant-talking environments [2,4,5,8-18].

Cepstral mean normalization (CMN) [19-22] is considered the most general approach for dereverberation. However, the length of an impulse response in a distant-talking environment is usually much longer than the size of the analysis window in short-term spectral analysis. Therefore, CMN cannot compensate for late reverberation. Several studies have focused on mitigating this problem [4,5,13,17,23].

Beamforming [8,24], which is a simple and robust means of spatial filtering, can be used to suppress any signal from noise or the direction of reflection; therefore, it is effective for dereverberation [13,25]. Recently, a two-stage beamforming approach [26] was presented for dereverberation and noise reduction. The first stage comprises a delay-and-sum beamformer that generates a reference signal containing a spatially filtered version of the desired speech and the interference. The second stage uses the filtered microphone signals and the noisy reference signal to estimate the desired speech. However, good performance cannot be achieved, particularly when the reverberation is very strong.

In [27,28], a method based on mean subtraction using a long-term spectral analysis window was proposed. The results showed that while subtracting the mean of the log magnitude spectrum improved ASR performance, the improvement was not sufficient, especially in the presence of significant late reverberation. A reverberation compensation method for speaker recognition using spectral subtraction [29], in which late reverberation is treated as additive noise, was proposed in [4], while a method based on multistep linear prediction (MSLP) was proposed in [5,17] for both single and multiple microphones. This method first estimates late reverberation using long-term MSLP and then suppresses this with the subsequent spectral subtraction. Wang et al. proposed a distant-talking speech recognition method based on generalized spectral subtraction (SS) [30] employing the multichannel least mean squares (MCLMS) algorithm [13,31,32]. The authors further extended their method to distant-talking speaker recognition and proposed an efficient computational method for combining the likelihoods of dereverberant speech using multiple compensation parameter sets [23]. The drawback of the above approaches is that the estimation of late reverberation is not very accurate, and thus, adequate improvement cannot be achieved.

To construct a more robust representation of each cepstral feature distribution, a feature warping method was proposed [4,33]. Such methods warp the distribution of a cepstral feature stream to a standardized distribution over a specified time interval. In addition, a feature transformation approach was presented for robust distant-talking speaker recognition [34]. The transformation is applied to distorted features before mapping them to a normal distribution and aims to decorrelate the feature vectors making them more amendable to the diagonal covariance Gaussian mixture model (GMM).

Neural network-based approaches have been proposed for feature mapping and dereverberation for speech/speaker recognition [35,36] because of their flexible representations. Bottleneck features extracted by a multilayer perceptron (MLP) can be used for nonlinear feature transformation and dimensionality reduction [35]. The MLP is trained by a backpropagation algorithm from random initial parameters. Then, the bottleneck features are extracted by dimensionality reduction of several frames of cepstral coefficients. The combination of bottleneck features and cepstral coefficients is better than the conventional mel-frequency cepstral coefficients (MFCCs). However, deep networks of MLPs with many hidden layers have a high computational cost and cannot learn in layers further away from the top layer. Nugraha et al. proposed a neural network-based method to map a reverberant feature in a log-melspectral domain to its corresponding anechoic feature [36]. The results show that cascading neural network-based dereverberation significantly improves speaker recognition compared with other dereverberation approaches. Many studies have shown that cepstral features such as MFCCs are very efficient for speaker recognition; however, extending this method directly to cepstral domain dereverberation is very difficult.

Recently, deep neural network (DNN)-based approaches have been successful in many speech and image processing fields [37-40]. Deep belief networks, which employ an unsupervised pre-training method using a restricted Boltzmann machine (RBM) [39,41], have also been proposed to train better initial values of deep networks [37]. DNNs with pre-training achieve better performance than, for example, conventional MLPs without pre-training on ASR [39,40] and large vocabulary business search tasks [38]. Denoising autoencoders (DAEs) have been shown to be effective in many noise reduction applications because higher level representations and increased flexibility of the feature mapping function can be learned [42,43]. Ishii et al. applied a DAE to spectral domain dereverberation [44] and found that the word accuracy of large vocabulary continuous speech recognition improved from 61.4% to 65.2% for the JNAS (speech corpus for large vocabulary continuous speech recognition research) database [45]. However, the suppressed spectral domain feature needs to be converted to a cepstral domain feature, and the subsequent performance improvement is not sufficient.

Few studies have focused on a DNN-based approach for distant-talking speaker recognition. By removing reverberation, we can expect to improve the speech/speaker recognition performance. However, very little research has focused on the differences between speech and speaker recognition in a distant-talking environment. For speech recognition, it is necessary to maximize the inter-phoneme variation while minimizing the intra-phoneme variation in the feature space. For speaker recognition, on the other hand, the focus is on speaker variation instead of phoneme variation. These characteristics mean some methods that are effective in speech recognition may not be as effective for speaker recognition, especially in a hands-free environment [46]. Therefore, the effect of DNN-based feature mapping and dereverberation on distant-talking speaker recognition is still unknown.

In our preliminary experiment, we found that DNN-based cepstral domain feature mapping is efficient for distant-talking speaker recognition [47]. In this paper, we present DNN-based bottleneck feature mapping, DAE-based cepstral domain dereverberation, and a combination of the two for distant-talking speaker recognition. For the DNN-based bottleneck feature (BF-DNN), we noted that DNNs can transform the reverberant speech feature to a new feature space with greater discriminative classification ability for distant-talking speaker recognition. In addition, by using multiple contexts (frames) for input data, the bottleneck features can reduce the influence of reverberation over several frames.

For neural network-based dereverberation, previous studies have shown that the spectral domain feature is efficient for the ASR task [44]. Noting that many speaker recognition systems adopt a cepstral domain feature as the direct input, it is meaningful to discover the performance of the cepstral domain DAE-based dereverberation method. Cepstral domain DAE-based dereverberation transforms the cepstrum of reverberant speech to that of clean speech. Moreover, the dimensions of the spectral domain-based features are greater than those of the cepstral domain-based ones. This introduces greater difficulties in learning a DAE with a deep architecture. Thus, it is expected that DAE-based cepstral domain dereverberation would be more efficient than DAE-based spectral domain dereverberation for speaker identification under distant-talking environments.

The DNN-based bottleneck feature is a method for extracting discriminant features while DAE-based dereverberation is a method for suppressing reverberation. Thus, they have a strong complementary nature, and a combination of the two methods should be very efficient in distant-talking speaker identification. Therefore, the likelihood of the bottleneck features extracted from the DNN and that of cepstral domain DAE-based dereverberation are combined linearly. A block diagram of the complete system is shown in Figure 1. In the training stage, DAE and BF-DNN models for feature transformation and speaker models with transformed features are trained. In the test stage, first, MFCCs extracted from the reverberant speech are input to the DAE and BF-DNN models for feature transformation. Then, the transformed features and speaker models are used to calculate the likelihood of each speaker. Finally, the likelihoods of DAE-based and BF-DNN-based features are combined and the target speaker is determined.

We also analyzed the optimal neural network architecture and parameters of the DNN-based bottleneck feature and DAE-based dereverberation for distant-talking speaker identification.

The remainder of this paper is organized as follows: Section 2 presents some basic theory for constructing and training DNNs, while an outline of the DNN-based bottleneck feature and DAE-based dereverberation method is given in Section 3. Section 4 discusses the development and evaluation of an experiment for distant-talking speaker recognition in reverberant environments. Finally, Section 5 summarizes the paper.

2 Overview of restricted Boltzmann machine

In speech recognition, DNN has been successfully used for modeling the posterior probability of state. In this work, for non-linear feature transformation, we used DNN, which can suppress the reverberation and transform the original feature to a discriminative feature for reverberant speech. A basic training strategy involved multiple phases. First, pre-training of the DNN was accomplished by training an unsupervised RBM and stacking them in a deep belief network (DBN). Second, optimization with back-propagating, referred to as fine-tuning, discriminatively trains the DNN using supervised signals. Meanwhile, in the pre-training phase of the DAE task, the encoder network was also trained layer by layer as a stack on RBM. In this section, we briefly introduced the RBM [39,41].

2.1 Restricted Boltzmann machine

The RBM is a bipartite graph as shown in Figure 2.

It has both visible and hidden layers in which visible units representing observations are connected to hidden units that learn to represent features using weighted connections. An RBM is restricted in that there are no visible-visible or hidden-hidden connections. Different types of RBMs are used for binary and real-valued input. Bernoulli-Bernoulli RBMs are used to convert binary stochastic variables to binary stochastic variables, while Gaussian-Bernoulli RBMs are used to convert real-valued stochastic variables to binary stochastic variables.

In a Bernoulli-Bernoulli RBM, the weights on the connections and the biases of the individual units define a probability distribution over the joint states of the visible and hidden units via an energy function. The energy of the joint configuration is given by

$$ E(\textbf{v},\textbf{h}|\theta)=-\sum^{\mathcal V}_{i=1}\sum^{\mathcal H}_{j=1}w_{ij}v_{i} h_{j}-\sum^{\mathcal V}_{i=1}a_{i} v_{i}-\sum^{\mathcal H}_{j=1}{b_{j}} {h_{j}}, $$

((1))

where θ=(w,a,b) and w _ij represents the symmetric interaction term between visible unit i and hidden unit j with a _i and b _j their respective bias terms. and denote the numbers of visible and hidden units, respectively.

The maximum likelihood estimation of an RBM is to maximize the log likelihood log p(v|θ) of parameter θ. Therefore, the weight update equation is given by

$$ \Delta w_{ij}=\epsilon (\langle v_{i}h_{j} \rangle_{\text{data}}-\langle v_{i}h_{j}\rangle_{\text{model}}), $$

((2))

where ε is the learning rate, 〈·〉_data is the expectation that v _i and h _j are on together in the training set, while 〈·〉_model is the same expectation calculated from the model. Because computing 〈v _i h _j〉 is expensive, we use a contrastive divergence approximation to compute the gradient. It is possible to compute 〈v _i h _j〉 by applying Gibbs sampling.

2.2 DNN structure and training

DBNs are configured hierarchically by connecting pre-trained RBMs. The top layer of a DBN is a softmax layer, with the softmax operation given as

$$ p(l|\textbf{h})=\frac{\exp(b_{l}+{\sum_{i}} {h_{i}} w_{il})}{{\sum_{m}}\exp({b_{m}}+{\sum_{i}} {h_{i}} w_{im})}, $$

((3))

where b _l is the bias of the label and w _il is the weight of hidden unit i in the top layer to label l.

After configuring the DBN using RBMs, it is discriminatively trained using the backpropagation algorithm [48] to maximize the log probability of the class labels. In general, after discriminative training, a DBN is called a DNN.

In particular, we used the algorithm from [37] to train a DNN. In the pre-training phase, we first initialized the RBMs with random values. We then subdivided all training datasets into mini-batches, with 128 data vectors for unsupervised pre-training. Each hidden layer was pre-trained for 50 passes. The weight was updated after each mini-batch. For the DNN training phase, also referred to as the fine-tuning phase, we used the method of the conjugate gradient algorithm. We repeated the fine-tuning for 100 epochs updating the entire training set. The learning rate for the weights was 0.03 and for biases was 0.1.

3 DNN-based bottleneck feature and DAE-based dereverberation

3.1 Bottleneck features extracted from a DNN

Bottleneck features were generated from an MLP [35] in which one of the internal layers has a small number of hidden units relative to the size of the other layers. The multilayer network to obtain the bottleneck features is shown in Figure 3. In this example, the number of hidden layers (including the bottleneck layer) is set to 5. The number of hidden units in the innermost layer is smaller than that in the other layers. We call this the bottleneck layer.

In our work, both MLPs without pre-training and DNNs with pre-training were used as multilayer networks. In the pre-training step, we trained each layer of the RBM to construct a DBN using the common DBN training. With the pre-training step, the DBN achieved better initial values of the neural network. This structured bottleneck layer could be treated as a nonlinear mapping of input features. In addition, it was possible to enhance the identification ability of bottleneck features by discriminative training, which was expected to mitigate the influence of reverberation on speaker identification.

We used the speaker labels as the teacher signal. DNN’s can be trained by backpropagating derivatives of a cost function that measures the cross entropy between the target outputs and the actual outputs produced for each training case.

The initial value of the MLP was generated randomly in the range −0.5 to 0.5, while the initial value of the DBN was determined by unsupervised pre-training. After initialization, supervised discriminative training was performed for both the MLP without pre-training and DBN with pre-training. Finally, the bottleneck features extracted from the bottleneck layer of the DNN were used to train the speaker model.

3.2 Denoising autoencoder for cepstral domain dereverberation

An autoencoder is a type of artificial neural network whose output is a reconstruction of the input and which is often used for dimensionality reduction.

The autoencoder training phase aims to find a value for the parameter vector, which minimizes the value between the input and teacher signals. This minimization is usually carried out by minimizing the cross entropy using conjugate gradients. Because it was difficult to directly optimize weights in a deep autoencoder with many layers, an initialization step called pre-training was conducted.

DAEs share the same structure as autoencoders, but the input data are a noisy version of the output data. Autoencoders use feature mapping to convert noisy input data into clean output and, thus, have been used for noise removal in the field of image processing [42,49]. Ishii et al. applied a DAE to spectral domain dereverberation [44]. However, the suppressed spectral domain feature needs to be converted to a cepstral domain feature, and this improvement in performance was inadequate. In this paper, we applied a DAE for cepstral domain dereverberation because there were many speaker recognition systems that adopted cepstral domain features as their direct input. It is meaningful to evaluate the performance with cepstral domain-based DAE features of speaker recognition. Given a pair of speech samples, that is, clean speech and the corresponding reverberant speech, the DAE learns the nonlinear conversion function that converts reverberant speech features into clean speech. In general, reverberation is dependent on both the current and several previous observation frames. In addition to the vector of the current frame, vectors of past frames were concatenated to form input.

For cepstral feature X _i of the observed reverberant speech of the i−t h frame, cepstral features of N−1 frames before the current frame are concatenated with those of the current frame to form a cepstral vector of N frames. Output O _i of the nonlinear transformer based on the DAE is given by

$$ O_{i} = f_{L}(\ldots f_{l}(\ldots f_{2}(f_{1}(X_{i},X_{i-1},\ldots,X_{i-N}))), $$

((4))

where f _l is the nonlinear transformation function in layer l, and N is the number of frames to be used as input features.

The topology of the cepstral domain DAE for dereverberation is shown in Figure 4. In this example, the number of hidden layers was set to five. In Figure 4, W _i(i=1,2,3) denotes the weighting of the different layers and ${W_{i}^{T}}$ shows the transposition of W _i ^a. That is, W ₁, W ₂, and W ₃ were the encoder matrices and ${W_{1}^{T}}$, ${W_{2}^{T}}$, and ${W_{3}^{T}}$ were the decoder matrices, respectively. To train a DAE, we used DBNs [50] for pre-training because they can obtain accurate initial values of the deep-layer neural networks. To obtain a pre-trained RBM, we trained the second hidden layer using a Bernoulli-Bernoulli RBM and the third hidden layer using a Gaussian-Bernoulli RBM. DBNs are hierarchically configured by connecting these pre-trained RBMs. Here W ₁, W ₂, and W ₃ are learned automatically, while ${W_{1}^{T}}$, ${W_{2}^{T}}$, and ${W_{3}^{T}}$ are generated from W ₁, W ₂, and W ₃, respectively.

After pre-training, a backpropagation algorithm was applied to adjust the parameters of autoencoder. Backpropagation algorithm modified the weights of autoencoder to reduce the cross entropy error between the teacher signal and the output value when a pair of signals is given (an input signal and an ideal teacher signal pairs.). In this paper, the input signal is the cepstral feature of reverberant speech and the ideal teacher signal is the cepstral feature of clean speech. The conjugate gradient algorithm was used to adjust the relative weightings of the units to minimize the cross entropy error for each training case [37].

3.3 Combination of DNN-based bottleneck feature and DAE-based dereverberation

We used a GMM as our speaker model owing to its convenience and effectiveness in conventional speaker recognition. In this paper, our methods were combined by GMM likelihood. The likelihood of a DNN-based bottleneck feature-based GMM likelihood was linearly coupled with that of the DAE-based one to produce a new score $L_{\text {comb}}^{n}$ given by

$$ L_{\text{comb}}^{n} = (1-\alpha)L_{\text{BF}}^{n} + {\alpha}L_{\text{DAE}}^{n}, n= 1,2,\ldots,N, $$

((5))

where $L_{\text {BF}}^{n}$ and $L_{\text {DAE}}^{n}$ are the likelihoods produced by the n-th bottleneck feature-based model and DAE-based model, respectively. N was the number of speakers registered and α denoted the weighting coefficients. The speaker with the maximum likelihood was selected as the target speaker.

4 Experiments

Our proposed method was evaluated on both simulated and actual data. Settings for the simulated data and speaker identification experiment are discussed in Section 4.1, while experimental results are presented in Sections 4.2.1 to 4.2.3. Section 4.2.1 describes the development experiment, while Section 4.2.2 evaluates our proposed method on simulated data. Section 4.2.3 investigates the effect of different training data. Regarding the experiment on actual data, details of the training data (comprising artificially created reverberant speech), the actual evaluation data, and evaluation experiment are described in Section 4.2.4.

4.1 Experimental setup

We used clean speech convoluted with various impulse responses to generate simulated data for the dereverberation experiment. For the simulated data, eight multichannel impulse responses were selected from the Real World Computing Partnership (RWCP) sound scene database [51] and the CENSREC-4 database [52]. These were convoluted with clean speech to create artificial reverberant speech. A large-scale database, the Japanese Newspaper Article Sentence (JNAS) [45] corpus, was used as the source for clean speech. Table 1 describes the development, training, and test datasets. Since the training and development datasets are the same, we refer to both as the training dataset. Utterances from 100 speakers (50 male and 50 female) were used for development and to train parameters for the DAE, BF-DNN, and GMMs. For each speaker, we used three types of artificial impulses (CENSREC-4) convoluted into five different sentences unless there was a special expression. Thus, in total, 1,500 sentences (15 sentences per speaker × 100 speakers) were used to train the DAE, BF-DNN, and GMMs. Each speaker provided 20 utterances for the test data. The average duration of training and test utterances was about 3.9 and 5.6 s, respectively.

Table 1 Dataset descriptions

Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

Abstract

1 Introduction

2 Overview of restricted Boltzmann machine

2.1 Restricted Boltzmann machine

2.2 DNN structure and training

3 DNN-based bottleneck feature and DAE-based dereverberation

3.1 Bottleneck features extracted from a DNN

3.2 Denoising autoencoder for cepstral domain dereverberation

3.3 Combination of DNN-based bottleneck feature and DAE-based dereverberation

4 Experiments

4.1 Experimental setup

4.2 Experimental results

4.2.1 4.2.1 Results of simulated development experiment

4.2.2 4.2.2 Experimental results of simulated evaluation data

4.2.3 4.2.3 Investigation of the effect of varying sizes of training data

4.2.4 4.2.4 Experimental results of actual environmental data

5 Conclusions

6 Endnote

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords