Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Nugraha, Aditya Arie; Yamamoto, Kazumasa; Nakagawa, Seiichi

doi:10.1186/1687-4722-2014-13

Research
Open access
Published: 10 April 2014

Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Aditya Arie Nugraha¹,
Kazumasa Yamamoto^1,2 &
Seiichi Nakagawa¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 13 (2014) Cite this article

2179 Accesses
8 Citations
Metrics details

Abstract

We present a feature enhancement method that uses neural networks (NNs) to map the reverberant feature in a log-melspectral domain to its corresponding anechoic feature. The mapping is done by cascade NNs trained using Cascade2 algorithm with an implementation of segment-based normalization. Experiments using speaker identification (SID) and automatic speech recognition (ASR) systems were conducted to evaluate the method. The experiments of SID system was conducted by using our own simulated and real reverberant datasets, while the CENSREC-4 evaluation framework was used as the evaluation for the ASR system. The proposed method could remarkably improve the performance of both systems by using limited stereo data and low speaker-variant data as the training data. From the evaluation using SID, we reached 26.0% and 34.8% of error rate reduction (ERR) relative to the baseline by using simulated and real data, respectively, by using only one pair of utterances for matched condition cases. Then, by using combined dataset containing 15 pairs of utterances by one speaker from three positions in a room, we could reach 93.7% of average identification rate (three known and two unknown positions), which was 42.2% of ERR relative to the use of cepstral mean normalization (CMN). From the evaluation using ASR, by using 40 pairs of utterances as the NN training data, we could reach 78.4% of ERR relative to the baseline by using simulated utterances by five speakers. Moreover, we could reach 75.4% and 71.6% of ERR relative to the baseline by using real utterances by five speakers and one speaker, respectively.

1 Introduction

The use of distant-talking microphones for automatic speech recognition (ASR) system or automatic speaker identification (SID) system can improve user convenience. The use of such microphones is essential for certain applications, e.g., the application of ASR and/or SID for smart home, where it will be not practical if the users have to hold or wear microphone anytime they want to interact with the system. However, the use of distant-talking microphones will make the captured signal be vulnerable to the phenomenon known as reverberation, where the signal not only travels directly from the speaker to the microphone but also through reflections, which can be seen as delayed and attenuated versions of the direct signal. Thus, reverberation will cause smearing effect because the microphone captures the currently spoken utterance along with other utterance spoken in the past[1]. Because of this signal degradation, the use of reverberant signal captured by the microphone will degrade the ASR or SID system performance, which is usually trained using anechoic speech data.

Several approaches already proposed to deal with this reverberation problem from ASR point of view. According to[2] in which the state-of-the-art in reverberant speech processing is discussed, there are two classes of approaches in dealing with reverberation problem, i.e., front-end-based and back-end-based approaches. The front-end-based approaches attempt to remove the effect of reverberation from the observed feature vectors. It can be divided into linear filtering, spectrum enhancement, and feature enhancement. The linear filtering dereverberates time-domain signals or STFT coefficients, e.g.,[3, 4], the spectrum enhancement dereverberates the corrupted power spectra of signal, e.g.,[5–7], and the feature enhancement dereverberates the corrupted feature vectors, e.g.,[8–10]. Meanwhile, the back-end-based approaches attempt to modify the acoustic model and/or decoder so that they are suitable for reverberant environment, e.g.,[11, 12].

Among the front-end-based approaches, there are single-channel approaches (by using single microphone) and multi-channel approaches (by using microphone array). Many recently proposed dereverberation researches focus on the use of microphone array, e.g., multi-channel linear prediction[13], minimum variance distortionless response (MVDR) beamformer[14], multi-channel least mean squares (LMS)[15]. Comparing to the use of single microphone, the main benefit of microphone array is the spatial information it can provide. Despite the benefits of microphone array, the use of single microphone is much easier and cheaper to be implemented for real applications. Thus, the research on single-channel dereverberation method is still worth to be considered.

Many works focused on feature enhancement approach. Several single-channel feature enhancement methods have been proposed. Some of them do not need stereo data at all, e.g., cepstral mean normalization (CMN)[16, 17], long-term feature normalization[18], vector Taylor series (VTS)[19], particle filter[8, 20], and extended Kalman filter[9, 21]. Meanwhile, some of them assume that stereo training data can be acquired. In the context of distant SID or ASR system, stereo data are simultaneously recorded pairs of close-talking and distant-talking utterances. In general, the stereo training data is used to train a mapping function from the distant-talking utterance to its corresponding close-talking utterance. Several existing approaches which need stereo training data will be reviewed in Section 2.

This research focused on developing a single-channel dereverberation method for automatic speaker identification and speech recognition under real environmental conditions by doing feature enhancement and assuming that stereo data can be acquired. In order to increase its feasibility for real-world applications, the method should have good performance by using a limited number of stereo data.

We proposed a single-channel non-linear regression-based dereverberation method using cascade neural networks (NNs). The NNs were trained on stereo data to compensate the reverberation effect by mapping a segment of reverberant 24-dimensional log-melspectral feature vectors to its corresponding anechoic feature vector. Two most important parts of the proposed method are the segment-based normalization and the feature mapping using NN. The segment-based normalization is done by normalizing the current frame of the anechoic and the reverberant feature and also preserving the power envelope of reverberant input segment. For the feature mapping, cascade NNs trained using the Cascade2 algorithm with the Resilient Backpropagation (RPROP) weight update algorithm, which is a variation of batch backpropagation algorithm, were used. These two most important parts are most likely the reason why the proposed method could generalize and perform remarkably well for a limited number of stereo data (one or five pairs of utterances; corresponds to less than 1 min of utterance).

The proposed method was evaluated on SID and ASR systems. Both evaluations were done by using simulated and real data. The evaluation using SID system used our own simulated and real data, while the evaluation using ASR system used simulated and real data from CENSREC-4[22]. The proposed method could perform very well by using only few stereo data and also low speaker-variant data as the NN training data.

The experimental result of SID using simulated data shows that in matched condition cases, the error rate reduction (ERR) relative to the baseline by using only one pair of utterances could reach 26.0% when single NN (‘1 NNs’) configuration was used. Meanwhile, by using 15 pairs of utterances and multiple NNs (‘24 NNs’) configuration, the ERR reached 62.6%. Also in matched condition cases, the experimental result using real data shows that by using ‘6 NNs’ configuration and only one pair of utterances for the NN training data, we could reach 34.8% of ERR relative to the use of CMN. Then, by combining the training data from known positions in a room, we could train NNs which performed well for unknown positions in the same room. By using multiple NNs (‘24 NNs’) and 15 pairs of utterances by one speaker from three positions, we could reach 93.7% of average identification rate over three known and two unknown positions, which was 42.2% of ERR relative to the use of CMN.

The experimental result of ASR using simulated data shows that we could reach 78.4% of ERR relative to the baseline by using dataset containing 40 pairs of utterances by five speakers as the NN training data. Meanwhile, by using the same number of real reverberant data, we could reach 75.4% and 71.6% of ERR relative to the baseline by using dataset containing utterances by five speakers and one speaker, respectively.

2 Related works

The most popular feature enhancement method using stereo data is stereo-based piece-wise linear compensation for environments (SPLICE), which estimates the clean cepstral feature from the noisy feature using a Gaussian mixture model (GMM) of noisy feature[23]. In general, SPLICE tries to represent a non-linear relation by using piece-wise linear relation in each subspace of noisy feature. SPLICE could perform well for both simulated and real noisy data[24]. However, SPLICE is designed specifically for dealing with the noise problem.

For the reverberation problem, in[25], 13 multi-layer perceptron (MLP) NNs were trained using stereo data to map the 13-dimensional reverberant cepstral feature, where one NN was used for one dimension of feature, to its corresponding anechoic feature. The input of each NN was a sequence of cepstral feature coefficients from nine consecutive frames, and the output was a cepstral feature coefficient. The approach was evaluated using vector quantization (VQ)-based speaker identification method and could reach 80.2% of ERR relative to baseline.

In[26], a linear regression by least squares method (LSM) was used to do a mapping of melspectral feature vectors from a four-frame sequence of reverberant speech to a frame of clean speech. Several schemes of dynamic time warping (DTW) were introduced because there was only non-stereo dataset for the experiments. The non-stereo dataset contained close-talking utterances recorded from the distance of 25 cm and distant-talking utterances recorded from various positions in a room. These close- and distant-talking utterances were not recorded simultaneously, although the speakers and the utterances were the same. The DTW was used to align frames of a distant-talking utterance to frames of its corresponding close-talking utterance before they were used as the training data. Nonetheless, the approach should be also worked on stereo dataset.

In[10, 27], a joint sparse representation (JSR) technique was used to capture the relationship between clean and reverberant speech. The dictionary for clean feature space and the dictionary for reverberant feature space were jointly trained using the stereo data in order to have common representation coefficients. Basically, the approach did a mapping of log-melspectral feature vectors from N frames of reverberant speech to N frames of estimated clean speech. In[27], besides the 24-dimensional log-melspectral feature vectors, the mapping included the log-energy coefficients. In the same paper, the sequence of N frames included the use of left and right context (past, current, and future frames).

Stereo data also found to be used in linear filtering approaches. In[28], the dereverberation was done using linear and binary-weighted least squares techniques on time and fast Fourier transform (FFT) domain. The stereo data was needed to calculate the inverse filter coefficients, which then was used to transform N frames of reverberant complex-valued FFT coefficient vectors to N frames of estimated clean FFT coefficient vectors. The experiments were done using 512-, 1,024-, and 2,048-dimensional vectors. The length of vector corresponds to the length of DFT.

Recently, in[29, 30], a denoising autoencoder (DAE), which is one of deep neural network (DNN) approaches, was used to do a mapping of coefficient vectors from a sequence of reverberant speech to a sequence of clean speech. They also introduced the use of short and long window. The short window is used to extract 256 dimensions of power spectral coefficients and the log energy. On the other hand, the long window is used to extract 24 dimensions of melspectral coefficients and the log energy. Thus, by using both windows, the DNN was used to map from and to 2,538-dimensional vectors, which are constructed by power spectral, melspectral, and log-energy coefficients of a nine-frame segment. In addition, DAE was also used for speech enhancement by mapping the power spectral coefficients[31] and the melspectral coefficients[32]. These DNN-based approaches are effective, but they require much training data for training a huge number of parameters.

In summary, the approaches proposed in[25, 26] did a mapping from a N-frame segment to a one-frame segment. Meanwhile, the approaches proposed in[10, 27–30] did a mapping from a N-frame segment to a N-frame segment. The method proposed in this work does a mapping from a N-frame segment to a one-frame segment of log-melspectral coefficients by using cascade NNs and requires only few training data. The NN is used because it should be able to capture a non-linear relation across the frames, which is caused by the insufficiency of analysis window (frame) length in capturing the reverberation effect and other complex factors.

3 Overview of neural network

3.1 Artificial neural network

NN, or more properly called artificial neural network (ANN), is a computational model inspired by the biological nervous systems, such as the brain. In a simple way, a biological nervous system consists of interconnected webs of neurons, where each neuron has dendrites, soma, and axon. The dendrites receive input signals and when the soma feels that the input signals are strong enough, it emits an output signal through the axon. This signal then can be sent to other neuron’s dendrites through the synapses, which are the end points of axon’s branches.

How a biological neuron works is modeled by an artificial neuron, as depicted by Figure1. The neuron has inputs x_n with their associated weights w_n. The weighted inputs are integrated in the neuron, which in most cases is simply done by summation, and then evaluated by an activation function f, e.g., hyperbolic tangent function, to determine the output. An ANN is simply a network that consists of interconnected artificial neurons and has three important elements, i.e., the structure of the nodes/neurons (how the inputs are integrated, the activation function), topology/architecture (the way artificial neurons interconnected), and the training algorithm (to determine the weights in the network)[33].

Various types of NN are defined based on their architectures. The architecture itself is a combination of their framework and their interconnection scheme[34].

The framework is defined by the number of clusters and the number of neurons in each cluster. The clusters are called layers if they are ordered and are called slabs otherwise. There are input, hidden, and output clusters, where each cluster contains one or more neurons. The neurons within a cluster are not necessarily ordered.

The interconnection scheme is mainly defined by the connectivity (describes which neurons are connected) and the types of connections. In layered NN, connections can be divided into interlayer, intralayer, and supralayer connections. Interlayer connection connects neurons from adjacent layers, intralayer connection connects neurons within a layer, and supralayer connection connects neurons from different (non-adjacent) layers. In slabbed NN, where the clusters are not ordered, there are only interslab connection, which connects neurons from different slabs, and intraslab connection, which connects neurons within a slab. Further, in regard to the directionality, connections can be divided into symmetric (bidirectional) and asymmetric (unidirectional) connections.

3.2 Conventional multi-layer perceptron and cascade networks

Both conventional MLP and the cascade networks used in the proposed method may use the same structure of neurons. The main differences between them are in the architecture and how to build the architecture, which then cause a difference in the training algorithm.

3.2.1 Conventional MLP network

In conventional MLP approach, the NN is fully defined in advance before the training is started. The NN is a layered NN with asymmetric interlayer connections (Figure2A). The NN contains an input layer, one or more hidden layers, and an output layer. Except the output layer, each layer commonly contains more than one neuron. The training (weight update) algorithm is then used to update the previously initialized weights and determine the most appropriate weights based on the training data. Backpropagation can be regarded as the most popular weight update algorithm for training MLP. It propagates an input through the network, then propagates back the error and adjusts the weights to minimize the error. The algorithm can be used in both an incremental training, in which the weights are updated for each training datum in the training set, and a batch training, in which the weights are updated only after all training data in the training set are presented.

3.2.2 Cascade network

Besides the common approach above, there are dynamic approaches in which the architecture of NN is altered during the training by adding neurons and/or clusters. Thus, the training algorithm not only consists of the weight update algorithm but also consists of the architecture algorithm, e.g., cascade. Two most common cascade algorithms are Cascade-Correlation (CasCor) and Cascade2[35]. Cascade2 algorithm is a variation of CasCor algorithm. Instead of using covariance maximization as in CasCor, Cascade2 uses direct error minimization. By doing so, Cascade2 is better algorithm for regression task, while CasCor is better for classification task[36].

In cascade algorithm, the NN can be regarded as layered NN with asymmetric interlayer and supralayer connections (Figure2B). Usually, the NN contains an input layer, many hidden layers, and an output layer. Different from conventional MLP approach, each hidden layer in cascade algorithm contains one hidden neuron. The other difference is that in the cascade NN, every hidden neuron and also the output neurons are directly connected to the input neurons. Meanwhile, in a conventional MLP, only hidden neurons of the first hidden layer are directly connected to the input neurons and the output neurons are directly connected only to the hidden neurons of the last hidden layer.

Before the training of a cascade NN is started, the NN only contains an input layer and an output layer with interlayer connections connect neurons from these two layers. The NN is then grown by adding hidden neurons/layers during the training. Each newly added hidden neuron connects to the neurons from input layer and output layer. The newly added hidden neuron also connects to the previously added hidden neuron. The hidden neuron addition is controlled by the cascade algorithm and done after the weight update algorithm cannot find a proper weight to generate the correct output by using the existing architecture. The same backpropagation weight update algorithm as for conventional MLP training can also be used for cascade NN training.

According to[37] and[35], the cascade algorithm offers several advantages, including:

The algorithm will automatically build a reasonably small network, so there is no need to define the NN in advance,
The algorithm learns fast because it employs weight freezing to overcome moving target problem by training one unit at a time instead of training the whole network at once as in conventional MLP, and
The algorithm can build deep network (high-order feature representation) without dramatic slowdown as seen in conventional MLP with more than one hidden layers.

The deep network generated by cascade algorithm can represent very strong non-linearity. It is good for some problems but may be bad for other problems. It can be regarded as overfitting problem caused by the use of too many layers and neurons. As explained in[37], CasCor employs ‘patience’ parameter to stop the training when the error has not changed significantly for a period of time. However, according to[35], overfitting will still occurr if the NN is allowed to grow too much. Therefore, we need to define the proper maximum number of hidden neurons. Besides using less hidden neurons, we can also use more training data to minimize the overfitting possibility.

For further details of cascade NN in general and Cascade2 algorithm in particular, please refer to[35].

4 The estimation function

4.1 Reverberation model

On the time domain, the relation between anechoic and reverberant signal (regardless the noise) can be expressed as

y (t) = s (t) * h (t),

(1)

where s(t) and y(t) are the clean and reverberant signals, respectively, and h(t) is the room impulse response (RIR), which defines the room transfer function (RTF).

The relation between anechoic and reverberant signal in log-melspectral domain should be represented as a non-linear model as shown in[12, 21]. However, for simplicity, we defined it as

Y (t) = \sum_{i = 0}^{N} α_{i} S (t - i)

(2)

= α_{0} S (t) + \sum_{i = 1}^{N} α_{i} S (t - i),

(3)

where S(t) and Y(t) represent the log-melspectral coefficients of anechoic and reverberant signal, respectively, for frame index t. While, α₀,α₁,…,α_N represent the RTF. This formulation was introduced in[26] and also employed in[38–40].

The first term of Equation 3 corresponds to the direct-path signal captured by the microphone and is represented by the solid line in Figure3. Meanwhile, the second term corresponds to the sum of signal reflections and are represented by the dotted lines in Figure3. The reflection can be regarded as an attenuated and delayed version of the direct signal.

4.1.1 Causal model

From Equation 3, S(t) could be expressed as

S (t) = \frac{1}{α_{0}} Y (t) - \sum_{i = 1}^{N} \frac{α_{i}}{α_{0}} S (t - i),

(4)

and by recursively substituting the last term, S(t) could be expressed as

\begin{array}{l} S (t) & = \frac{1}{α_{0}} Y (t) - \sum_{i = 1}^{N} \frac{α_{i}}{α_{0}} \frac{1}{α_{0}} Y (t - i) \\ + \sum_{i = 1}^{N} \frac{α_{i}}{α_{0}} \sum_{j = 1}^{N} \frac{α_{j}}{α_{0}} S (t - i - j), \end{array}

(5)

\begin{array}{l} S (t) & = \frac{1}{α_{0}} Y (t) - \sum_{i = 1}^{N} \frac{α_{i}}{α_{0}^{2}} Y (t - i) \\ + \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{α_{i} α_{j}}{α_{0}^{3}} Y (t - i - j) - \dots . \end{array}

(6)

The first term of Equation 6 considers only the current frame Y(t), while the succeeding terms consider the left context. The second term considers Y(t - N) until Y(t - 1), the third term considers Y(t - 2N) until Y(t - 2), the fourth term considers Y(t - 3N) until Y(t - 3), and so on.

RIR is characterized by reverberation time (T₆₀), which is the time required for reflections of a direct-path signal to decay by 60 dB or one-millionth of the original energy. If N is selected such that the N-segment covers the T₆₀, we may limit the calculation up to the frame N because the energy of reflections in the frames after N is very low and negligible. If the practicality is considered, N should be a trade-off between the dereverberation performance and the computational cost. A longer segment may capture the reverberation effect better which then potentially improve the dereverberation, but the computational cost for processing this segment will be higher. In order to simplify the equation, β_i is used for representing the variables formed by the combination of α_i, e.g. $β_{0} = α_{0}^{- 1}$ , $β_{1} = - α_{1} α_{0}^{- 2}$ , $β_{2} = - α_{2} α_{0}^{- 2} + α_{1}^{2} α_{0}^{- 3}$ , $β_{3} = - α_{3} α_{0}^{- 2} + (2 α_{1} α_{2} - α_{1}^{3}) α_{0}^{- 3}$ , and so on. Thus, the estimated anechoic coefficient $\hat{S} (t)$ could be expressed as a function of reverberant signal Y(t)

\hat{S} (t) = β_{0} Y (t) + \sum_{k = 1}^{L} β_{k} Y (t - k) + ε,

(7)

\hat{S} (t) \approx \sum_{k = 0}^{L} β_{k} Y (t - k),

(8)

where β₀,β₁,…,β_L denotes the weights which are used to compensate the RTF and L denotes the number of past frames in the segment. L is used to substitute N in order to indicate that the frames are the left context. By using Equation 8, we could estimate current source signal S(t) by using an (L + 1)-frame segment of observed signal consisting of current observed signal Y(t) and L frame(s) of past observed signal.

4.1.2 Non-causal model

By considering a typical RIR, intuitively, we know that the information of current frame will remain in its reflections in the future, especially in its early reflections part where the reflections still have considerable amount of energy.

Let 0 < n < N and N > 1, Equation 3 can be rewritten as

\begin{array}{l} Y (t) & = \sum_{i = 0}^{n - 1} α_{i} S (t - i) + α_{n} S (t - n) \\ + \sum_{i = n + 1}^{N} α_{i} S (t - i), \end{array}

(9)

and S(t-n) could be expressed as

\begin{array}{l} S (t - n) & = \frac{1}{α_{n}} Y (t) - \sum_{i = 0}^{n - 1} \frac{α_{i}}{α_{n}} S (t - i) \\ - \sum_{i = n + 1}^{N} \frac{α_{i}}{α_{n}} S (t - i) . \end{array}

(10)

Then, by substituting S(t-i) using the causal model on Equation 8, S(t-n) could be expressed as

\begin{array}{l} S (t - n) & = \frac{1}{α_{n}} Y (t) - \sum_{i = 0}^{n - 1} \frac{α_{i}}{α_{n}} \sum_{j = 0}^{N} β_{j} Y (t - i - j) \\ - \sum_{i = n + 1}^{N} \frac{α_{i}}{α_{n}} \sum_{j = 0}^{N} β_{j} Y (t - i - j) . \end{array}

(11)

Equation 11 comprises three terms. The first term considers only Y(t). The second term considers the left context, which is Y(t) until Y(t - n + 1); the current frame, which is Y(t-n); and the right context, which is Y(t - n - 1) until Y(t - n - N + 1). Meanwhile, the third term considers only the right context, which is Y(t - n - 1) until Y(t - 2N). In order to simplify the equation, γ_i is used to substitute the variables formed by the combination of α_i and β i. Thus, the estimated anechoic coefficient $\hat{S} (t - n)$ could be expressed as

\begin{array}{l} \hat{S} (t - n) & = \sum_{k = 0}^{n - 1} γ_{k} Y (t - k) + γ_{n} Y (t - n) \\ + \sum_{k = n + 1}^{2 N} γ_{k} Y (t - k) + ε . \end{array}

(12)

Then, by making generalization on Equation 12, the estimated anechoic coefficient $\hat{S} (t)$ could be expressed as

\begin{array}{l} \hat{S} (t) & = \sum_{k = - R}^{- 1} γ_{k} Y (t - k) + γ_{0} Y (t) \\ + \sum_{k = 1}^{L} γ_{k} Y (t - k) + ε, \end{array}

(13)

\hat{S} (t) \approx \sum_{k = - R}^{L} γ_{k} Y (t - k),

(14)

where γ_K,…,γ-1,γ₀,γ₁,…,γ_L denotes the weights which are used to compensate the RTF, L denotes the number of past frames in the segment, and R denotes the number of future frames in the segment. L and R are used to indicate that the frames are the left context and right context, respectively. By using Equation 14, we could estimate current source signal S(t) by using an (L + 1 + R)-frame segment of observed signal consisting of current observed signal Y(t), L frame(s) of past observed signal (left context), and R frame(s) of future observed signal (right context).

Equation 14 could be seen as the general form of reverberation model. We could get Equation 8 from Equation 14 by setting R = 0.

Hereafter, we refer to Equation 8 as causal reverberation model and Equation 14 as non-causal reverberation model (for R > 0). For the causal reverberation model, the estimation of $\hat{S} (t)$ can be seen as removing the unwanted information (reflections of previous frames) which is estimated from the left context. While, for the non-causal reverberation model, besides removing the unwanted information, the estimation can be seen as gathering more information about the current frame to be processed, which is estimated from the right context.

4.2 Frame selection

Inspired by the use of window skipping in[41], besides Equation 14 we also defined

\hat{S} (t) \approx \sum_{k = - R}^{L} γ_{k} Y (t - 2 k) .

(15)

Hereafter, we refer to Equation 8 as ‘linear’ frame selection and Equation 15 as ‘skip1’ frame selection.

In this work, we named the frame selection used in the experiments using L-C-R notation. L, C, and R show the number of left context (past), current, and right context (future) frames, respectively. For example, frame selection 4-1-0 means that four past and one current frames of reverberant speech are used to estimate the dereverberated version of current frame, and frame selection 4-1-4 means that four past, one current, and four future frames are used to do the dereverberation.

The use of skipping frame selection could be regarded as dimensionality reduction strategy by minimizing the redundant parts caused by the windowing. Therefore, we could get a representation of longer context of time-domain signal using smaller number of frames, which is beneficial for the NN training. For example, if we use 25 ms window with 10 ms shift, 145 ms of context can be represented by using 13-frame linear frame selection, e.g., 12-1-0 or 6-1-6, or 7-frame skip1 frame selection, e.g., 6-1-0 or 3-1-3.

4.3 Assumptions on the log-melspectral feature

In our works, we made several assumptions on the log-melspectral feature. The first assumption is about the dependency of certain feature dimension to the other feature dimensions, and the second is about the RTF of feature dimension.

On the dependency of feature dimension, we defined several channel selection, as follows:

All-dimension selection, where the dimensions are assumed to be fully dependent on each other.
Single-dimension selection, where the dimensions are assumed to be independent on each other and certain dimension is only affected by the same dimension.
Neighboring-dimension selection, where certain dimension is assumed to be affected by the same dimension and its neighbor dimension.

The estimation function above was derived using the assumption used in all-dimension selection. However, our experiments mainly used the single-dimension selection, which can be expressed as

{\hat{S}}_{d} (t) \approx \sum_{k = - R}^{L} γ_{d, k} Y_{d} (t - k), for d = 1, 2, \dots, D,

(16)

where d is the feature dimension number and D is the total number of feature dimension.

By using the single-dimension selection, we could define several assumptions on the transformation of feature dimension caused by the RTF. In our case, the assumptions on the transformation affects the number of NNs that should be used. Thus, we defined several NN configurations as follows:

Single NN configuration, where it is assumed that the transformation on each dimension is the same as each other, so one NN is used to transform all dimensions of feature.
Basic multiple NNs configuration, where it is assumed that the transformation on each dimension is different from each other, so one NN is used to transform one specific dimension.
Modified multiple NNs configuration, where it is assumed that the transformation on certain several neighboring dimensions are the same, one NN is used to transform more than one dimension (neighboring dimensions).

The basic multiple NNs configuration, in which one transformation function (in the form of NN) is used to transform one dimension, corresponds to Equation 16. Meanwhile, the single NN configuration, in which one transformation function (in the form of NN) is used to transform all dimensions, corresponds to Equation 17. Note that Equations 16 and 17 are linear mapping functions, while transformation by NN is a non-linear mapping. As shown in[26], the use of linear mapping is not good enough to do dereverberation.

{\hat{S}}_{d} (t) \approx \sum_{k = - R}^{L} γ_{k} Y_{d} (t - k), for d = 1, 2, \dots, D .

(17)

5 The proposed dereverberation method

Figure4 shows the block diagram of proposed dereverberation method. In general, the method can be divided into segment-based normalization, feature scaling, and feature mapping using NNs. The inputs of the method are Y(t - L),…,Y(t - 1),Y(t),Y(t + 1),…,Y(t + R), which are past frames, current frame, and future frames of reverberant log-melspectral coefficient vector, and the output is $\hat{S} (t)$ , which is the estimated current anechoic log-melspectral coefficient vector.

5.1 Segment-based normalization

Segment-based normalization is employed to deal with the power difference between the anechoic speech signal and the reverberant signal captured by a distant-talking microphone and to normalize the loudness of speech utterance. In the NN training stage, it is done by normalizing the current reverberant feature vector and the current anechoic feature vector (which is the target of training) to the normalization target. Besides, the segment-based normalization is employed to preserve the relative variation of power envelope in a segment by normalizing the past frames relative to the current frame. The normalization is done using Equations 18 and 19 below.

δ (t) = Δ - \frac{1}{D} \sum_{d = 1}^{D} Y_{d} (t),

(18)

\begin{array}{l} {\bar{Y}}_{d} (t - k) = Y_{d} (t - k) + δ (t), & for d = 1, 2, \dots, D, \\ for - R \leq k \leq L, \end{array}

(19)

where δ(t) is the normalization factor, ${\bar{Y}}_{d} (t)$ is the normalized log-melspectral coefficient for feature dimension d and time index t, D is the number of feature dimensions, and Δ is the normalization target.

The mean of NN output $\bar{S} (t)$ should be equal to the normalization target because the target of NN training was also normalized. Therefore, the denormalization (Equation 20) is used to recover its original mean of power.

{\hat{S}}_{j} (t) = {\bar{S}}_{j} (t) - δ (t) .

(20)

Actually, the use of normalization factor δ(t) calculated from the reverberant feature vector (distant-talking speech utterance) is not the best way to calculate the estimation of clean feature vector (close-talking speech utterance) because the power levels of clean and reverberant signals are most likely not identical due to the influence of the RIR. However, the distant-talking speech utterance is the only input for the dereverberation method, and we assume that the distance is unknown, so the use of normalization factor δ(t) is the most reasonable way we can do to recover the power level of a frame relative to its surrounding frame. Thus, the estimation $\hat{S} (t)$ in Equation 20 can be regarded as an attenuated close-talking speech utterance. Although, it is not the best approach, the denormalization could remarkably improve the output, as can be seen in Figure5.

Figure5 shows spectrograms of an input and outputs of a dereverberation process. Utterance-based normalization (based on maximum value) was done in creating the spectrogram to ease the observation because there is power difference between close-talking speech utterance (clean feature vectors) and distant-talking speech utterance (reverberant feature vectors). Figure5A shows the feature vectors of close-talking utterance, which was recorded from the distance of 25 cm. Figure5B shows the feature vector of corresponding distant-talking utterance, which was recorded from the distance of about 4.0 m. Figure5C,D are the dereverberated feature vectors without denormalization and with denormalization, respectively. We can observe that by using denormalization, we could get better estimation of the clean feature vectors, especially for non-speech segments. In Figure5C,D, the difference of non-speech segments can be easily observed, for example, between frames 290 and 330 (in frames).

5.2 Feature scaling

The feature scaling consists of scaling and de-scaling processes. In general, the scaling and de-scaling can be regarded as the pre-processing and post-processing for the NNs. The scaling is done so that the NN input and output have values ranging from about -1 until 1. The constants τ and κ were used for this purpose, and the value of these constants were determined empirically from preliminary experiments. In contrast, the de-scaling is used to recover the log-melspectral coefficient value from its scaled value. The scaling and de-scaling are done using Equations 21 and 22, respectively.

{\bar{Y}}_{d}^{'} (t - k) = \frac{{\bar{Y}}_{d} (t - k) + τ}{2^{κ}}, for - R \leq k \leq L,

(21)

{\bar{S}}_{d} (t) = {\bar{S}}_{d}^{'} (t) * 2^{κ} - τ .

(22)

5.3 Feature mapping using neural networks

In matrix form, Equation 14 could be written as

\hat{S} = G Y,

(23)

where denotes the estimated anechoic feature vector, Y denotes the supervector which consists of reverberant feature vectors, and G denotes the transformation matrix which represents the RTF compensation. In our works, a non-linear regression is done to determine the function G such that

\underset{G}{argmin} {∥ S - (G \otimes Y) ∥}^{2},

(24)

where S is the anechoic (reference) feature vector and ⊗ denotes a non-linear transformation. The regression is done by NN training algorithm and the NNs resulted from the training are used as the transformation function G. Thus, the NNs are the functions for mapping the reverberant feature vectors Y to the anechoic feature vector S.

We use cascade NNs trained using the Cascade2 algorithm. The algorithm is chosen because our task is a regression task and Cascade2 is better algorithm for the task than CasCor[36].

Figure6 shows an illustration of cascade NN that we used in our works, with N input neurons, M hidden neurons, and one output neuron. Figure7 shows the same NN in conventional MLP representation. The input neurons are represented by y₁,y₂,…,y_N, the hidden neurons are represented by h₁,h₂,…,h_M, and the output neuron is represented by s. Besides, we also use one bias neuron b. The neurons y_i and b are connected to h₁,h₂,…,h_M,s and the neuron s is connected to y₁,y₂,…,y_N,h₁,h₂,…,h_M,b. The connection weight between neuron n₁ and n₂ is represented by w(n₁,n₂).

The NN input y₁,y₂,…,y_N in Figure6 correspond to ${\bar{Y}}_{d}^{'} (t - L), \dots, {\bar{Y}}_{d}^{'} (t - 1), {\bar{Y}}_{d}^{'} (t), {\bar{Y}}_{d}^{'} (t + 1), \dots, {\bar{Y}}_{d}^{'} (t + R)$ in Figure4, which are the scaled value of the dereverberation input segment. While, the NN output s corresponds to ${\bar{S}}_{d}^{'} (t)$ , which is the scaled value of the estimated clean log-melspectral coefficient for frame t and dimension d.

We use the implementation of the Cascade2 algorithm with RPROP (resilient propagation) weight update algorithm, which is an advanced variation of batch backpropagation algorithm[35, 42], in Fast Artificial Neural Network library (FANN)[43, 44]. A linear activation function is used for the output neuron, while the hidden neurons use a symmetric sigmoid (tanh) function. For defining these hidden neurons, we use four options of steepness value, i.e., 0.25, 0.50, 0.75, and 1.00. The training algorithm will choose the best steepness value for each hidden neuron. Equation 25 expresses the linear activation function and Equation 26 expresses the tanh activation function, where x_af and y_af are the input and output of activation function, respectively, and z is the steepness.

y_{af} = z x_{af},

(25)

y_{af} = tanh (z x_{af}) = \frac{2}{1 + exp (- 2 z x_{af})} - 1 .

(26)

Thus, the function of our output neuron s, which corresponds to, can be expressed as

s = z (\sum_{i = 1}^{N} y_{i} w (y_{i}, s) + \sum_{j = 1}^{M} h_{j} w (h_{j}, s) + w (b, s)),

(27)

where z is the activation function steepness, y_i are the input neurons, which corresponds to Y, w(n 1,n 2) are the connection weights between neurons n₁ and n₂, b is the bias neuron, and h_j are the hidden neurons, whose function can be expressed as

h_{j} = \{\begin{array}{l} tanh (zp) & for j = 1, \\ tanh (zq) & else, \end{array}

(28)

in which

p = \sum_{i = 1}^{N} y_{i} w (y_{i}, h_{j}) + w (b, h_{j}),

(29)

q = \sum_{i = 1}^{N} y_{i} w (y_{i}, h_{j}) + \sum_{k = 1}^{j - 1} h_{k} w (h_{k}, h_{j}) + w (b, h_{j}) .

(30)

By using a linear activation function for the output neuron in the CasCor algorithm, it means that, initially, we use a linear regression to fit the transformation function. Then, it becomes a non-linear regression when the training process starts to add hidden neuron to the network.

We use several termination criteria for the training, i.e., maximum number of hidden neurons, maximum and minimum epochs for candidates and output training, and mean squared error. The number of hidden neurons is set proportionally to the number of NN inputs (neurons). Therefore, the size of network depends on the length of input segment. For the default configuration, we set the maximum number of hidden neurons to be equal to twice of the number of NN inputs (M ≤ 2N). The number of hidden layers will be equal to the number of hidden neurons.

By using the above termination criteria of hidden neuron number, our NN will be compact. For example, if we use nine-frame segment of input, we use in total 29 neurons, consisting of 9 input neurons, 1 bias neuron, 18 hidden neurons, and 1 output neuron. In addition, from our experiments, we could have good performance by training this compact NN using relatively few training samples. It fits the statement in[37] that the CasCor algorithm can learn fast while still create a reasonably small network that generalizes well.

6 Evaluation using automatic speaker identification system

6.1 Overview of the automatic speaker identification system

Figure8 depicts the general experimental setup used in the evaluation using the SID system. The NN training and feature mapping used 24-dimensional log-melspectral feature, while the speaker model training and identification used 12-dimensional melcepstral feature (MFCC). For these experiments, we needed to create three datasets, i.e., speaker model training dataset, NN training dataset, and testing dataset.

The figure also shows the use of CMN after cepstral feature extraction. For the experiments using real dataset, it is necessary to use CMN because we need to remove the noise and reverberation in the close-talking utterances used to train the speaker models. However, it may not necessary for the experiments using simulated dataset because we have anechoic utterances to train the speaker models. Nevertheless, we used CMN in the experiments using simulated dataset. For the experiments using real dataset, we experimented on the use of speaker models trained using original MFCC and also normalized MFCC (by CMN). Meanwhile, for the experiments using simulated dataset, we experimented on the use of speaker models trained using normalized MFCC only.

The SID system used speaker-specific GMMs as the speaker models[45]. Each speaker was represented by a D-variate GMM as

λ = {c_{i}, μ_{i}, Σ_{i}}, for i = 1, 2, \dots, M,

(31)

where c_i is the mixture weight, μ_i is the mean vector, Σ_i is the covariance matrix, and M is the component number. In our experiments, M = 32 was used. GMM parameters were estimated using the standard maximum likelihood (ML) estimation method via the expectation maximization (EM) algorithm. For a sequence of T test vectors X = x₁,x₂,…,x_T, the GMM likelihood can be calculated using

L (X | λ) = log p (X | λ) = \sum_{t = 1}^{T} log p (x_{t} | λ) .

(32)

GMM is used to model the speaker identity because the Gaussian components can represent some general speaker-dependent spectral shapes and the Gaussian mixtures can model arbitrary densities[46]. In this work, we focused on developing a dereverberation approach, instead of improving the identification accuracy based on discriminative classification approach, so the use of GMM approach should be sufficient for our purpose in evaluating the proposed dereverberation method. Consideration of the state-of-the-art in speaker identification is beyond the scope of this work.

Voice activity detection (VAD) was also employed to remove the silence parts in the beginning and ending of recordings. For the simulated reverberant data, the VAD was done automatically on the melcepstral domain based on the frame log-energy coefficients. While, for the real reverberant data, the VAD was done manually by hand because the SNRs were low in utterances recorded from very distant position, e.g., from 4.0 m, and made our current automatic VAD unreliable.

6.2 Experiments using simulated reverberant data

6.2.1 Dataset description

The clean speech data was taken from the newspaper reading part of JNAS database[47]. The speech data of 100 male speakers were used. In average, each speaker has 105 utterances. The RIR and noise data were taken from Aurora-5[48], while the simulation program was SImulation of REal Acoustics (SIREAC)[49, 50]. The simulation program was used to generate the reverberant speech data from the clean speech and the RIR. The program can also add additive noise to the signal.

We created the simulated reverberant data by using the RIR of ‘office’ and ‘livingroom’ with reverberation time (T₆₀) of 400 ms. Besides, we also created a simulated noisy reverberant data by adding noise of ‘office’ with signal-to-noise ratio (SNR) of 20 and 10 dB to the ‘office’ reverberant data.

6.2.2 Experimental setup

The GMMs for the speaker identification system was trained using 500 clean utterances (100 speakers, 5 utterances for each speaker). The utterances were selected randomly but constrained by the file size requirement so that the average duration of the utterances after VAD was about 3 s. CMN was also employed as pre-processing of GMM training data.

A pool of training data was created for each environment. This pool of data consisted of 25 pairs of clean utterance and simulated (noisy) reverberant utterance. The utterances were selected randomly but constrained by the file size requirement so that the average duration of the utterances after VAD was about 7 s. From this pool of training data, ‘1u’ (1 pair of utterances by 1 speaker), ‘5u’ (5 pairs of utterances by 5 speakers), ‘10u’ (10 pairs of utterances by 10 speakers), and ‘15u’ (15 pairs of utterances by 15 speakers) NN training datasets were created.

A testing dataset was also created for each type of simulated (noisy) reverberant data. Each dataset consisted of 1,000 simulated (noisy) reverberant utterances (100 speakers, 10 utterances for each speaker). The utterances were selected randomly but constrained by the file size requirement so that the average duration of the utterances after VAD was about 5 s. Note that the utterances in the testing dataset contained different contents (sentences) from the utterances used to train the GMMs.

The experiments were done by using causal reverberation model (using left context only) and non-causal reverberation model (using left and right context) on known environments (matched conditions). We did experiments on the use of single NN (1 NN for 24 feature dimensions) and multiple NNs (1 NN for 1 feature dimension) configurations. In addition, we did experiments using linear and skip1 frame selection.

The NN training used random weight initialization, and variations in the final NN were not unexpected. All experimental results below show the average of three experimental results, where each experiment consists of training phase and testing phase.

6.2.3 Experimental results

The baseline for each type of simulated (noisy) reverberant data is shown in Table1. For the noisy reverberant, we only experimented on the RIR of ‘office’, so the baseline for the RIR of ‘livingroom’ is not available. The identification rate for the clean version of testing dataset was 98.0%. Note that we can regard this baseline as the result of enhancement using CMN because it was used as pre-processing of GMM training data.

Table 1 Speaker identification baseline for each type of simulated (noisy) reverberant data

Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

Abstract

1 Introduction

2 Related works

3 Overview of neural network

3.1 Artificial neural network

3.2 Conventional multi-layer perceptron and cascade networks

3.2.1 Conventional MLP network

3.2.2 Cascade network

4 The estimation function

4.1 Reverberation model

4.1.1 Causal model

4.1.2 Non-causal model

4.2 Frame selection

4.3 Assumptions on the log-melspectral feature

5 The proposed dereverberation method

5.1 Segment-based normalization

5.2 Feature scaling

5.3 Feature mapping using neural networks

6 Evaluation using automatic speaker identification system

6.1 Overview of the automatic speaker identification system

6.2 Experiments using simulated reverberant data

6.2.1 Dataset description

6.2.2 Experimental setup

6.2.3 Experimental results

6.3 Experiments using real reverberant data

6.3.1 Dataset description

6.3.2 Experimental setup

6.3.3 Experimental results

7 Evaluation using automatic speech recognition system

7.1 Overview of CENSREC-4

7.2 Experiments using simulated reverberant data

7.2.1 Experimental setup

7.2.2 Experimental results

7.3 Experiments using real reverberant data

7.3.1 Experimental setup

7.4 Experimental results

7.4.1 Fundamental results

7.4.2 Further results and analyses

8 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords