Skip to main content

Advertisement

Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation

Article metrics

  • 1400 Accesses

  • 68 Citations

Abstract

Previously, a dereverberation method based on generalized spectral subtraction (GSS) using multi-channel least mean-squares (MCLMS) has been proposed. The results of speech recognition experiments showed that this method achieved a significant improvement over conventional methods. In this paper, we apply this method to distant-talking (far-field) speaker recognition. However, for far-field speech, the GSS-based dereverberation method using clean speech models degrades the speaker recognition performance. This may be because GSS-based dereverberation causes some distortion between clean speech and dereverberant speech. In this paper, we address this problem by training speaker models using dereverberant speech obtained by suppressing reverberation from arbitrary artificial reverberant speech. Furthermore, we propose an efficient computational method for a combination of the likelihood of dereverberant speech using multiple compensation parameter sets. This addresses the problem of determining optimal compensation parameters for GSS. We report the results of a speaker recognition experiment performed on large-scale far-field speech with different reverberant environments to the training environments. The proposed GSS-based dereverberation method achieves a recognition rate of 92.2%, which compares well with conventional cepstral mean normalization with delay-and-sum beamforming using a clean speech model (49.0%) and a reverberant speech model (88.4%). We also compare the proposed method with another dereverberation technique, multi-step linear prediction-based spectral subtraction (MSLP-GSS). The proposed method achieves a better recognition rate than the 90.6% of MSLP-GSS. The use of multiple compensation parameters further improves the speech recognition performance, giving our approach a recognition rate of 93.6%. We implement this method in a real environment using the optimal compensation parameters estimated from an artificial environment. The results show a recognition rate of 87.8% compared with 72.5% for delay-and-sum beamforming using a reverberant speech model.

1 Introduction

Because of the existence of reverberation in far-field environments, the recognition performance for distant-talking speech/speakers is drastically degraded. The current approaches to automatic speech recognition (ASR)/speaker recognition that are robust to reverberation can be classified as speech signal processing (pre-processing), robust feature extraction, or model adaptation [14].

In this paper, we focus on speech signal processing for speaker identification. Beamforming is one of the simplest and most robust means of spatial filtering to suppress reverberation and background noise. This means it is able to discriminate between signals based on the physical location of their source [5]. Another general approach is cepstral mean normalization (CMN) [6, 7], which has been extensively examined as a simple and effective way of reducing reverberation by normalizing the cepstral features. Because of multiple reflections and diffusions of the sound waves, the energy of previous speech is smeared over time, and overlaps with subsequent speech. This results in a duration that is much longer than the window size of short-term spectral analysis, a problem known as late reverberation [8]. Therefore, the dereverberation of CMN is not completely effective in environments with late reverberation. Several studies have focused on mitigating the above problem [918]. In [9, 10], a method based on mean subtraction using a long-term spectral analysis window was proposed. The result showed that subtracting the mean of the log magnitude spectrum improved ASR performance. A blind deconvolution-based approach for restoring speech that has been degraded by the acoustic environment was proposed in [19]. This scheme processed the phase-only output from two microphones using cepstrum operations and signal reconstruction theory. In [12], a multi-channel speech dereverberation method based on spectral subtraction using a statistical model to estimate the power spectrum was proposed. In the study of [13], a new set of feature parameters based on the Hilbert envelope of Gammatone filterbank outputs was proposed to reduce the effect of room reverberation in speaker recognition. A novel approach for multi-microphone speech dereverberation was proposed in [14]. The method was based on the construction of a null subspace of the data matrix in the presence of colored noise, employing generalized singular-value decomposition or generalized eigenvalue decomposition of the respective correlation matrices. A method based on multi-step linear prediction (MSLP) was proposed in [15, 20]. The method first estimates late reverberations using long-term multi-step linear prediction, and then suppresses them with subsequent spectral subtraction. A reverberation compensation method for speaker recognition using spectral subtraction [16], in which late reverberation is treated as additive noise, was proposed in [18, 21]. However, the drawback of this approach is that the optimum parameters for spectral subtraction are empirically estimated from a development dataset, meaning that the late reverberation cannot be subtracted correctly as it is not precisely modeled.

Previously, Wang et al. presented a distant-talking speech recognition method based on generalized spectral subtraction (GSS) employing the multi-channel least mean-squares (MCLMS) algorithm [22]. They treated late reverberation as additive noise, and proposed a noise reduction technique based on GSS [23, 24] to estimate the spectrum of the clean speech using an approximated spectrum of the impulse response. To estimate the spectra of the impulse responses, a variable step-size unconstrained MCLMS algorithm for identifying the impulse responses in a time domain [1] was extended to the frequency domain. About the early reverberation, we can remove it by GSS method theoretically. But this method may cause some deviation in the MCLMS step. The estimation error of channel impulse response is inevitable, which results in unreliable estimation of power spectrum of clean speech. On the other hand, CMN is robust to reduce the channel distortion within the spectral analysis window [25]. So, early reverberation was suppressed by CMN. A speech recognition experiment showed that the GSS-based dereverberation method achieved an average relative word error reduction rate of 32.6% compared with conventional CMN with beamforming [22].

GSS-based dereverberation was applied to the field of speech recognition in a previous study [22]. However, the effect of GSS-based dereverberation on distant-talking speaker recognition is still unknown. A preliminary experiment on speaker recognition with a GSS-based method showed that dereverberation using clean speech models degraded the speaker recognition performance, but was very effective for speech recognition. This may be because the GSS-based dereverberation method causes some distortion between the speaker characteristics of clean speech and dereverberant speech. We address this problem by training speaker models using dereverberant speech obtained by suppressing early and late reverberation from arbitrary artificial reverberant speech. We assumed that the distortion of speaker characteristics in the training and test data is similar, so the GSS-based dereverberation method should be effective for speaker recognition.

It is difficult to obtain optimal compensation parameter values (that is, the noise overestimation factor α and exponent parameter n defined in Equation 5) for GSS under different conditions. We assume that the optimal compensation parameters for GSS are dependent on the acoustic environment and utterance content. A fixed compensation parameter cannot robustly suppress reverberation for all conditions. Therefore, we propose a combination of the likelihood of dereverberant speech using multiple compensation parameters for GSS. However, the computational time of this combination method is proportional to the number of compensation parameter sets. To reduce the computational cost, N speaker models with the highest likelihood are obtained using a GSS without tuning (that is, α=n=1). Only these N-best speaker models are used to calculate the likelihood using GSS with other compensation parameters.

With regard to speaker recognition, various models have been studied. The Gaussian mixture model (GMM) has been widely used as a speaker model [2628]. Its use is motivated by the fact that the Gaussian components represent some general speaker-dependent spectral shapes, and by the capability of Gaussian mixtures to model arbitrary densities. Artificial neural networks [29] and support vector machines [30] have been proposed as discriminative models for the boundary between speakers. Recently, joint factor analysis and total factors [31, 32] have been demonstrated as very effective mechanisms for speaker verification by compensating channel variability. The consideration of state-of-the-art speaker models is beyond the scope of the current study. Thus, in this paper, we use GMMs for speaker identification.

The remainder of this paper is organized as follows: Section 2 describes our distant-talking speaker identification system employing a dereverberation method. The outline of blind dereverberation based on SS is described in section 3. The combination of likelihoods with various compensation parameters and its efficient computation is proposed in section 4, and section 5 describes the experimental results of distant-talking speaker recognition in a reverberant environment. Finally, section 6 summarizes the paper.

2 Distant-talking speaker recognition system employing a dereverberation method

The performance of distant-talking speech/speaker recognition is degraded remarkably by reverberation. By removing reverberation, we can expect to improve the speech/speaker recognition performance. However, very little research has studied the difference between speech recognition and speaker recognition in a distant-talking environment. For speech recognition, it is necessary to maximize the inter-phoneme variation while minimizing the intra-phoneme variation in the feature space, whereas for speaker recognition, the focus is on speaker variation instead of phoneme variation. These characteristics mean some methods that are effective in speech recognition may be not effective for speaker recognition, especially in a hands-free environment. For example, a simple and popular channel normalization method, CMN, removes both the transmission characteristics and speaker characteristics, leading to differences in the speaker recognition and speech recognition performance. A previous study [28] on distant-talking speaker recognition showed that conventional CMN gave much worse results than those without CMN, although it was very effective for speech recognition in a reverberant environment with a short reverberation time. CMN has worse speaker recognition performance than without CMN in a small reverberation environments, while the opposite is true in large-reverberation environments. This is because CMN removes the speaker characteristics, and the channel distortion (reverberation) is not very large. In the speech recognition field, GSS-based dereverberation using clean speech models showed a significant improvement [22]. However, in terms of speaker recognition, the experiment we describe in section 5 shows that it degrades the speaker recognition performance. This could be due to the GSS-based dereverberation method distorting the speaker characteristics of clean speech and dereverberant speech.

To mitigate the distortion of speaker characteristics caused by dereverberation in the test stage, we obtain dereverberant speech by suppressing early and late reverberation from arbitrary artificial reverberant speech, and use this to train the speaker models. We assume that the speaker characteristics suffer similar distortion in the training data and test data. By employing dereverberation in both the training and test stages, the transmission characteristics can be removed and the relative speaker characteristics can be maximized. Compared with speaker models trained with reverberant speech, our method is expected to exhibit a better speaker recognition performance. In previous research, GMMs trained with reverberant speech have been used for distant-talking speaker recognition. However, the mismatch of distant-talking environments between the training condition and the test condition has still not been addressed. Furthermore, when late reverberations have a large amount of energy, the performance of speech/speaker recognition cannot be improved sufficiently, even with GMMs or hidden Markov models trained with a matched reverberant condition [4, 33]. This means that GMMs and hidden Markov models cannot handle severe late reverberations precisely. We can see the effect of the dereverberation step in speaker recognition in papers such as [18, 21, 34].

In this paper, we propose a distant-talking speaker recognition system employing a GSS-based dereverberation method. A schematic diagram of our proposed method is shown in Figure 1. In the training stage, clean speech is convoluted by arbitrary impulse responses to create artificial reverberant speech. This can reduce the experimental cost, because real reverberant speech is not necessary. We introduce GSS-based dereverberation in section 3. This is performed to suppress both early and late reverberations. Finally, the dereverberant speech is used to train speaker models. In the test stage, the reverberation of multi-channel distorted speech (artificial reverberant speech or real reverberant speech) is removed by the GSS-based dereverberation method, and then the dereverberant speech is used to perform distant-talking speaker recognition.

Figure 1
figure1

Schematic diagram of distant-talking speaker recognition system.

3 Outline of blind dereverberation

3.1 Dereverberation based on GSS

If speech s[ t] is corrupted by convolutional noise h[ t], the observed speech x[ t] becomes

x[t]=h[t]s[t],
(1)

where * denotes the convolution operation. If the length of the impulse response is much smaller than the size T of the analysis window used for the discrete-time Fourier transform (DTFT), the DTFT of the distorted speech equals that of the clean speech multiplied by the DTFT of the impulse response h[t]. However, if the length of the impulse response is much greater than the analysis window size, the DTFT of the distorted speech is usually approximated by

X ( f , ω ) S ( f , ω ) H ( ω ) S ( f , ω ) H ( 0 , ω ) + d = 1 D 1 S ( f d , ω ) H ( d , ω ) ,
(2)

where f is the frame index, H(ω) is the STFT of the impulse response, S(f,ω) is the STFT of clean speech s, and H(d,ω) denotes the part of H(ω) corresponding to the frame delay d. That is, with a long impulse response, the channel distortion is no longer of a multiplicative nature in a linear spectral domain, but is instead convolutional.

In [22], Wang et al. proposed a dereverberation method based on GSS to estimate the STFT of the clean speech Ŝ(f,ω) based on Equation 2. Assuming that phases of different frames are noncorrelated for simplification, the power spectrum of Equation 2 can be approximated as Equation 3:

| X ( f , ω ) | 2 | S ( f , ω ) | 2 | H ( 0 , ω ) | 2 + d = 1 D 1 | S ( f d , ω ) | 2 | H ( d , ω ) | 2 .
(3)

The power spectrum | X ̂ (f,ω) | 2 obtained by reducing the late reverberation can be estimated as

| X ̂ ( f , ω ) | 2 = | Ŝ ( f , ω ) | 2 | Ĥ ( 0 , ω ) | 2 max | X ( f , ω ) | 2 α · d = 1 D 1 | Ŝ ( f d , ω ) | 2 | Ĥ ( d , ω ) | 2 , β · | X ( f , ω ) | 2 = max | X ( f , ω ) | 2 α · d = 1 D 1 | X ̂ ( f d , ω ) | 2 | Ĥ ( d , ω ) | 2 | Ĥ ( 0 , ω ) | 2 , β · | X ( f , ω ) | 2 ,
(4)

where α is the noise overestimation factor, β is the spectral floor parameter for avoiding negative or underflow values, |Ŝ(f,ω) | 2 is the power spectrum of estimated clean speech, and Ĥ(d,ω),d=0,1D1 is the STFT of the impulse response, which can be blindly estimated by the MCLMS algorithm method mentioned in [22]. D is the number of reverberation windows.

Previous studies have shown that GSS with an arbitrary exponent parameter is more effective than power SS for noise reduction [23, 24]. In this paper, GSS is used to suppress late reverberation, and early reverberation is compensated by subtracting the cepstral mean of the utterance at the feature extraction stage.

The spectrum | X ̂ (f,ω) | 2 n obtained by reducing the late reverberation can be estimated as

| X ̂ ( f , ω ) | 2 n max | X ( f , ω ) | 2 n α · d = 1 D 1 | X ̂ ( f d , ω ) | 2 n | Ĥ ( d , ω ) | 2 n | Ĥ ( 0 , ω ) | 2 n , β · | X ( f , ω ) | 2 n ,
(5)

where | X ̂ (f,ω) | 2 n =|Ŝ(f,ω) | 2 n |Ĥ(0,ω) | 2 n , |Ŝ(f,ω) | 2 n is the spectrum of estimated clean speech and n is the exponent parameter. When n=1, Equation 5 is a power spectral subtraction-based method.

A schematic diagram of our proposed GSS-based dereverberation method is shown in Figure 2. It uses the spectra of impulse responses, which are estimated by MCLMS, to reduce the late reverberation in reverberant speech. The spectrum of dereverberant speech is then inverted into the time domain, and delay-and-sum beamforming a is performed on the multi-channel speech. Finally, the early reverberation is normalized by CMN at the feature extraction stage.

Figure 2
figure2

Schematic diagram of GSS-MCLMS-based dereverberation method.

3.2 Compensation parameter estimation for GSS by MCLMS

In [1, 3537], an adaptive MCLMS algorithm for blind single-input multiple-output (SIMO) system identification was proposed.

A variable step-size unconstrained MCLMS (VSS-UMCLMS) algorithm was proposed to minimize the cost function J in the time-domain [37]. Wang et al. [38] extended the time-domain VSS-UMCLMS algorithm to the frequency domain to estimate the compensation parameters for GSS-based dereverberation.

In the absence of additive noise, we can take advantage of the fact that

X i H j = S H i H j = X j H i , i , j = 1 , 2 , , N , i j ,
(6)

and have the following relation at frequency ω of frame d:

X i T ( d ) H j ( d ) = X j T ( d ) H i ( d ) , i , j = 1 , 2 , , N , i j ,
(7)

where H i (d) is the i th impulse response at frame index f and

X i ( d ) = X i ( d ) X i ( d 1 ) X i ( d D + 1 ) T , i = 1 , 2 , , N ,

where X i (d) is the speech signal received from the i th channel at frame d and D is the number of frames of the impulse response. Multiplying Equation 7 by X i (d) and taking the expectation yields

R X i X i ( d ) H j ( d ) = R X i X j ( d ) H i ( d ) , i , j = 1 , 2 , , N , i j ,
(8)

where R X i X j (d)=E{ X i (d) x j T (d)}. Equation 8 comprises N(N−1) distinct equations. By summing the N−1 cross-correlations associated with one particular channel H j (d), we get

i = 1 , i j N R X i X i ( d ) H j ( d ) = i = 1 , i j N R X i X j ( d ) H i ( d ) , j = 1 , 2 , , N.
(9)

Over all channels, we then have a total of N equations. In matrix form, this set of equations is written as

R X + ( d ) H ( d ) = 0 ,
(10)

where

R X + ( d ) = n 1 R X n X n ( d ) R X 2 X 1 ( d ) R X N X 1 ( d ) R X 1 X 2 ( d ) n 2 R X n X n ( d ) R X N X 2 ( d ) R X 1 X N ( d ) R X 2 X N ( d ) n N R X n X n ( d ) ,
(11)
H ( d ) = H 1 ( d ) T H 2 ( d ) T H N ( d ) T T ,
(12)
H n ( d ) = H n ( d , 0 ) h n ( d , 1 ) H n ( d , D 1 ) T ,
(13)

where H n (d,l) is the l th frame of the n th impulse response at correspond frame d. If the SIMO system is blindly identifiable, the matrix RX+ is rank deficient by 1 (in the absence of noise) and the channel impulse responses can be uniquely determined.

When the estimated channel impulse responses deviate from the true value, the error vector at frame d is produced by:

e ( d ) = R ~ X + ( d ) H ̂ ( d ) ,
(14)
R ~ X + ( d ) = n 1 R ~ X n X n ( d ) R ~ X 2 X 1 ( d ) R ~ X N X 1 ( d ) R ~ X 1 X 2 ( d ) n 2 R ~ X n X n ( d ) R ~ X N X 2 ( d ) R ~ X 1 X N ( d ) R ~ X 2 X N ( d ) n N R ~ X n X n ( d ) ,
(15)

where R ~ X i X j (d)= X i (d) X j T (d),i,j=1,2,,N and H ̂ (d) is the estimated model filter at frame d. Here, the tilde in R ~ X i X j distinguishes this instantaneous value from its mathematical expectation R X i X j .

This error can be used to define a cost function at frame d

J ( d ) = e ( d ) 2 = e ( d ) T e ( d ) .
(16)

By minimizing the cost function J in Equation 16, the impulse response can be blindly derived.

3.3 Dereverberation method based on multiple-step linear prediction

In [15], MSLP was implemented for our reverberation calculation. Linear prediction is a method of generating an inverse filter through a prediction coefficient, which is an effective means of estimating the inverse system. In particular, multi-channel linear prediction can estimate the inverse filter blindly. For comparison with our proposed method, we introduce this alternative approach. A schematic diagram of MSLP is shown in Figure 3.

Figure 3
figure3

Schematic diagram of MSLP-based dereverberation method.

The reverberant speech x(t) is

x[t]=h[t]s[t]= i = 0 K 1 s(ti)h(i),
(17)

where K is the length of the impulse response. Considering the step size D of early reverberation, Equation 17 can be rewritten as

x m [t]= k = 0 D 1 h m (k)s(tk)+ k = D K 1 h m (k)s(tk),
(18)

where x m (t) is the observed signal from the m th microphone. The first part of the right-hand side of this equation is the early reverberation, and the second part is the late reverberation. Using the MSLP method, we have

x m [t]= k = i M k = 0 L 1 w m , i (k) x i (tDk)+ d m (t),
(19)

where L is the linear prediction order and wm,i is the prediction coefficient. When D=1, we have multi-channel linear prediction. To calculate the appropriate wm,i, the present signal of the m th microphone x m (t) should be presented as the sum of the weighted signals of the previous D samples (first term of Equation 19) and signal d m (t) without late reverberation (second term of Equation 19).

After the optimization of wm,i, the dereverberant speech can be calculated by the SS method. In [15], the wm,i are calculated by minimizing the mean square energy of the prediction residual.

4 Combination method and its efficient computation

It is difficult to determine the optimum exponent parameter n and the noise overestimation factor α for GSS. In this study, we use a combination of the various speaker model likelihoods with different compensation parameter sets.

When a combination of multiple methods is used to identify the speaker, the likelihood of speaker models with different compensation parameter sets is linearly coupled to produce a new score L comb k , given by:

L comb k = 1 I i = 1 I L i k ,k=1,2,,K,
(20)

where L i k is the likelihood produced by the k th speaker model with the i th compensation parameter set. K is the number of registered speakers and I denotes the number of compensation parameter sets. The speaker with the maximum likelihood is determined as the target speaker. As a result of this procedure, special tuning is not necessary for GSS.

However, the computational time increases linearly according to the number of compensation parameter sets. In this study, an efficient computational method is proposed. Coverage of the N-best speaker recognitions is illustrated in Figure 4b. The number of target speakers is 260. The result shows that the coverage is over 99% for the 10-best likelihoods, and almost 100% for the 50-best likelihoods, even in a distant-talking environment. That is, there is no need to calculate the likelihood of all speaker models in the combination stage. The efficient computational method can be summarized as follows: Initially, the power SS (that is, compensation parameter n=1) is used to suppress the reverberation, and the likelihoods of all speaker models are calculated. Second, the speaker models with the top N-best likelihoods are used to calculate a new likelihood according to different compensation parameter sets. Finally, the likelihood calculated by a different compensation parameter set is combined to determine the target speaker. In our previous work [22], the speech recognition performances using DTFT of impulse response estimated by MCLMS with each sentence and impulse response condition were almost same. So in this paper, each impulse response condition used the same impulse. The total computational time T A for speaker identification is about T M s + T F + T L , where T F and T L are the computational times for the feature extraction and likelihood calculation of K speaker models. T M is the time for the MCLMS algorithm. As we run the MCLMS algorithm for each reverberation condition only once, the time of T M for a single speech is T M s , where s is the number of test sets. Because our experiment uses a large number of test sets, the value of T M s is very small, and can be neglected here. The computational time for the combination (that is, conventional combination method) of various results with I parameter sets is T A comb =I( T F + T L )=I T A . The computational time for our proposed efficient combination method using the N-best likelihoods is

T E comb = T F + T L + ( I 1 ) T F + ( I 1 ) N K T L = T A + 1 γ + 1 ( ( I 1 ) N K γ + I 1 ) T A ,
(21)
Figure 4
figure4

N-best coverage of distant-talking speaker recognition.

where T L equals γ T F . The computational cost has therefore been decreased compared with the conventional combination method.

5 Experiments

5.1 Experimental setup

Firstly, the proposed method for hands-free speaker identification was evaluated using artificial reverberant speech for determining the most suitable parameters. Then we implemented the method for real reverberant speech with suited parametersc.

In order to compare our work with other dereverberation method. We compared the performance of our proposed method and multi-step linear prediction [15] (MSLP) both in artificial and real reverberant environment.

Eight multi-channel impulse responses were selected from the Real World Computing Partnership (RWCP) sound scene database [39] and the CENSREC-4 database [40]. These were convoluted with clean speech to create artificial reverberant speech. A large-scale database, the Japanese Newspaper Article Sentence (JNAS) [41] corpus, was used as clean speech. The utterances in the training data were composed of 130 male and female speakers, with 10 utterances taken from each. Each speaker gave 20 utterances for the test data. The average time for all utterances was about 5.8 s.

Table 1 lists the impulse responses for the training and test sets. The illustration of microphone array is shown in Figure 5. Channel numbers corresponding to Figure 5 using for dereverberation shown in Table 2 were used. For the RWCP database, a four-channel circular or linear microphone array was taken from a circular + linear microphone array (30 channels). The circular array had a diameter of 30 cm. The microphones in the linear microphone array were located at 2.83-cm intervals. Impulse responses were measured at several positions 2 m from the microphone array. For the CENSREC-4 database, four-channel microphones were taken from a linear microphone array (seven channels), with the microphones located at 2.125-cm intervals. Impulse responses were measured at several positions 0.5 m from the microphone array. We also use reverberant speech from a real environment in our experiment. The speech was collected in a meeting room of size 7.7 m × 3.3 m × 2.5 m (D×W×H). The utterances were collected from 20 male speakers. Each speaker made 9 training utterances. In total, 400 test utterances were recorded. Speakers were seated on chairs (labeled A to E in Figure 6), and were recorded by a multi-channel recording device. The heights of the microphone array and the utterance position of each speaker were about 0.8 and 1.0 m, respectively. We used a nine-channel microphone array (Figure 6), and collected the test data using distant microphone arrays for four channels of microphones 6, 7, 8, and 9. A pin microphone recorded speech in the distant-talking and close-talking environments. The training data were collected by a close microphone, and the CENSREC-4 database (CENSREC-4 impulse response) was used to produce artificial reverberant speech.

Table 1 Details of recording conditions for impulse response measurement
Figure 5
figure5

Illustration of microphone array. (a) CENSREC-4. (b) RWCP.

Table 2 Channel numbers corresponding to Figure 5 using for dereverberation
Figure 6
figure6

Illustration of recording settings and microphone array.

Table 3 gives the conditions for speaker identification. We used 25-dimensional mel-frequency cepstral coefficients (MFCCs) and GMMs with 128 mixtures. Table 4 gives the conditions for GSS-based dereverberation (the same for MCLMS- and MSLP-based methods). The parameters shown in Table 4 were determined empirically. An illustration of the analysis window is shown in Figure 7. For the proposed dereverberation method based on spectral subtraction, the previous clean spectra estimated with a skip window were used to estimate the current clean spectrum since the frame shift was half the frame length in this study d. The spectrum of the impulse response H(d,ω) was estimated for each utterance to be recognized. This study compares five methods. A description of each methods is presented in Table 5. For each method, we performed CMN with delay-and-sum beamforming. Clean speech models, which were directly trained by clean speech, were used as speaker models for method 1 and method 2. For method 1, only CMN with beamforming was used to reduce the reverberation. The GSS-MCLMS based dereverberation was performed at the test stage for method 2, which is the same as the condition for hands-free speech recognition [22]. Reverberant speech models, which were trained using artificial reverberant speech with three types of CENSREC-4 impulse responses (see Table 1a), were used as speaker models for method 3. Method 5 is our proposed method. For this, the reverberation in both the training and test data was suppressed by MCLMC-GSS based dereverberation, and the dereverberant speech was used to train dereverberant speech GMMs. For comparison, we also used an existing MSLP-GSS as Method 4 with dereverberant speech in both the training and test data.

Table 3 Conditions for speaker recognition
Table 4 Conditions for GSS-based dereverberation
Figure 7
figure7

Illustration of the analysis window for spectral subtraction.

Table 5 Description of each speaker recognition method

5.2 Experimental results

5.2.1 Experimental results of artificial reverberant speech

The hands-free speaker identification results for the five methods are compared in Table 6. ‘Number of impulse response conditions for test’ in Table 6 denotes the ‘Array no.’ in Table 1b. In previous research, the speech recognition results for reverberant environments with clean speech models improved when using the GSS-based dereverberation method [22]. However, method 2 proposed in [22] degraded the speaker identification performance in the speaker identification field. Method 3, which was based on reverberant speech models, improved speaker recognition significantly because multiple reverberant environments were trained. However, the reverberation was not suppressed, so employing blind dereverberation may give a further improvement. The proposed method without parameter tuning (that is, α=n=1), which suppressed the reverberation in both training and test data, outperformed all the other methods under all reverberant environments. The proposed method achieved a relative error reduction of 83.7% compared with the baseline (method 1) and 28.4% compared with reverberant speech models (method 3). Furthermore, the proposed method performed better than the existing method 4 with a relative error reduction of 11.7%.

Table 6 Distant-talking speaker recognition rates of artificial data (%)

The performance of the proposed GSS-based dereverberation method may vary with different compensation parameters. We confirmed this and compared the performance of the proposed method with different parameters (noise overestimation α and exponent parameter n). The results are given in Table 7. For GSS, the exponent parameter n is often set in the range 0.1 to 1 [23, 24]. Thus, in this study, the exponent parameter n was set as 0.1, 0.3, 0.5, 0.7, and 1.0, and the noise overestimation factor α was set as α=n or α=2n. The results show that the optimum parameter depends on the reverberant environment, and is very difficult to determine. By combining the results with various compensation parameter sets, we achieved a relative error reduction of 17.9% compared with the individual results with the optimum parameter. The GSS parameter determination increased the computational cost. For the conventional combination method, the computational time T A comb is 10 (the number of parameter sets I is 10) times the computational time for the individual method T A . The computational time T E comb for our proposed efficient combination method is 1.27 T A e, and about 1/8 T A comb when the performance is the same as the conventional combination method, which uses the likelihoods of all the speaker models. As a result, the proposed efficient combination method achieved a relative error reduction of 87.5% compared with the baseline, and 44.8% compared with reverberant speech models, for almost the same computational cost. A comparison of the performance and computational cost of the proposed efficient combination method using the 2-best likelihoods, 5-best likelihoods, and 260-best likelihoods (that is, the conventional combination method) with the individual method is shown in Figure 8.

Table 7 Comparison of results of artificial data with different compensation parameter sets and combination methods for speaker identification
Figure 8
figure8

Comparison of distant-talking speaker recognition performance and computational cost. Methods 1 to 4 are described in Table 5. ‘Effi. comb.’ denotes the ‘Efficient combination’ described in section 4.

Our previous work [22] showed that changes in β have little effect on speech recognition performance. The spectral floor parameter influences the spectral distortion caused by the algorithm. We also conducted experiments with different spectral floor parameters for speaker recognition. The experimental results are shown in Table 8. β is the spectral floor parameter for avoiding negative or underflow values. When β is too small (β=0.05), the dereverberation distortion is too large, worsening the results. However, if β is too large, as for β=0.25, a lot of reverberation cannot be suppressed, so the improvement is not sufficient. Thus, we empirically set β to 0.15, which is same as for speech recognition. β is more sensitive for speaker recognition than for speech recognition.

Table 8 Comparison of results of artificial data with different parameter of β and combination methods for speaker identification

5.2.2 Experimental results of real reverberant speech

We have verified our proposed method in a real reverberant environment. We implemented this method in a real environment using the optimal compensation parameters estimated in an artificial environment (α=n=0.5). The results from the real environment (Table 9) exhibited the same tendency as those in the artificial environment. Our proposed method (method 5) achieved a relative error reduction of 68.3% compared with the baseline (method 1), and a reduction of 55.6% compared with reverberant speech models (method 2). For the sake of comparison, we conducted the same experiments with two other blind reverberation compensation strategies, namely LTLSS (method 3) and an MSLP-GSS-based method (method 4). The proposed method gives an error reduction rate of 35.8% compared with LTLSS and 24.7% compared with MSLP-GSS.

Table 9 Speaker recognition rates in real environment

6 Conclusions

Previously, Wang et al. proposed a blind dereverberation method based on GSS that employed MCLMS for hands-free speech recognition [22]. In this study, we applied this method to hands-free speaker identification. However, in the speaker identification field, the method proposed in [22] performed worse than the baseline method. This is the opposite result to that for speech recognition. We addressed this problem by training speaker models using dereverberant speech, which was obtained by suppressing reverberation from arbitrary artificial reverberant speech. The reverberant speech for test data was also compensated using MCLMS-GSS-based dereverberation. By combining various compensation parameter sets for GSS and efficiently calculating the speaker likelihoods, a more robust result was obtained without parameter tuning. Based on a dereverberant speech models, the proposed method achieved a recognition rate of 93.6%, which compares well with conventional CMN with beamforming using clean speech models (49.0%), and reverberant speech models (88.4%). In addition, the method introduced in this paper does not increase the computational cost over that of previous methods. Furthermore, we implemented this method in a real environment with optimal compensation parameters estimated from an artificial environment. The proposed technique achieves a recognition rate of 87.8%, compared with 72.5% using a reverberant speech model. We also compared our proposed method with other dereverberation methods based on MSLP-GSS, both in artificial and real environments, under the same conditions of the SS method. The proposed method achieved a recognition rate of 91.7%, compared with 90.6% using MSLP-GSS, in an artificial environment, and 87.8% compared with 83.8% in a real environment.

Endnotes

a Delay-and-sum beamforming reduces the directivity of each microphone channel, especially when using many microphones that are far away from each other (as in the test condition). In our previous work [22], beamforming was shown to produce better results. The time delay information was calculated according to each speech recording.

b Details of the experimental setup are described in section 5.

c For real reverberant speech, the processing step is the same as for artificial reverberant speech.

d For example, to estimate the clean spectrum of the 2i th window W2i, the estimated clean spectra of the 2(i −1)th window W2(i−1), the 2(i −2)th window W2(i−2) were used.

e In this study, the values of I, N, and K, in Equation 21 were set to 10, 5, and 260. γ was 92, i.e., the computational time for the likelihood calculation of K speaker models was 92 times that for feature extraction conducted on a 2.0-GHz Intel(R) Xeon(R) Server running Linux with 12-GB main memory.

References

  1. 1.

    Huang Y, Benesty J, Chen J: Acoustic MIMO Signal Processing. Berlin: Springer-Verlag; 2006.

  2. 2.

    Maganti H, Matassoni M: An auditory based modulation spectral feature for reverberant speech recognition. In Proceedings of INTERSPEECH-2010. Makuhari, Chiba, 26-30 September, Curran Associates, Inc., Red Hook, NY; 2010:570-573.

  3. 3.

    Raut C, Nishimoto T, Sagayama S: Adaptation for long convolutional distortion by maximum likelihood based state filtering approach. In Proceedings of the 2006 ICASSP Toulouse, France, 14-19 May 2006 vol. 1. IEEE, Piscataway, 2006; 1133-1136.

  4. 4.

    Yoshioka T, Sehr A, Delcroix M, Kinoshita K, Maas R, Nakatani T, Kellermann W: Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process. Mag 2012, 29(6):114-126.

  5. 5.

    Hughes TB, Kim HS, DiBiase JH, Silverman HF: Performance of an an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Trans. Speech Audio Process 1999, 7(3):346-349. 10.1109/89.759045

  6. 6.

    Furui S: Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process 1981, 29(2):254-272. 10.1109/TASSP.1981.1163530

  7. 7.

    Liu F, Stern R, Huang X, Acero A: Efficient cepstral normalization for robust speech recognition. Proceedings of the workshop on Human Language Technology Princeton, 69–74 (Association for Computational Linguistics, Stroudsburg, 1993)

  8. 8.

    Lebart K, Boucher J, Denbigh P: A new method based on spectral subtraction for speech dereverberation. Acta Acustica 2001, 87: 359-366.

  9. 9.

    Gelbart D, Morgan N: Double the trouble: handling noise and reverberation in far-field automatic speech recognition. In INTERSPEECH 2002. Denver, 16-20 September, 2002; 968-971.

  10. 10.

    Gelbart D, Morgan N: Evaluating long-term spectral subtraction for reverberant ASR. In ASRU 2001. Madonna di Campiglio, Italy, 9-13 December 2001;

  11. 11.

    Wu M, Wang D: A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans. ASLP 2006, 14(3):774-784.

  12. 12.

    Habets EA: Multi-channel speech dereverberation based on a statistical model of late reverberation. In Proceedings of IEEE ICASSP. Philadelphia, 18-23 March vol. 4, IEEE, Piscataway; 2005:173-176.

  13. 13.

    Sadjadi SO, Hasnen JHL: Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions. In Proceedings of IEEE ICASSP. Prague, Czech Republic, 22-27 May 2011; 5448-5451.

  14. 14.

    Gannot S, Moonen M: Subspace methods for multimicrophone speech dereverberation. EURASIP J. Appl. Signal Processv 2003, 2003(1):1074-1090.

  15. 15.

    Kinoshita K, Delcroix M, Nakatani T, Miyoshi M: Spectral subtraction steered by multi-step forward linear prediction for single channel speech dereverberation. In Proceedings of IEEE ICASSP 2006. Toulouse, France, 14-19 May 2006; 817-820.

  16. 16.

    Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoustics Speech Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209

  17. 17.

    Delcroix M, Hikichi T, Miyoshi M: Precise dereverberation using multi-channel linear prediction. IEEE Trans. ASLP 2007, 15(2):430-440.

  18. 18.

    Jin Q, Schultz T: A Waibel, Far-field speaker recognition. IEEE Trans. ASLP 2007, 15(7):2023-2032.

  19. 19.

    Subramaniam S, Petropulu AP, Wendt C: Cepstrum-based deconvolution for speech dereverberation. IEEE Trans. Speech Audio Process 1996, 4(5):392-396. 10.1109/89.536934

  20. 20.

    Kinoshita K, Delcroix M, Nakatani T, Miyoshi M: Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction. IEEE Trans. Audio Speech Lang. Process 2009, 17(4):534-545.

  21. 21.

    Jin Q, Pan Y, Schultz T: Far-field speaker recognition. In Proceedings ICASSP 2006. Toulouse, France, 14-19 May vol. 1 IEEE, Piscataway; 2006:937-940.

  22. 22.

    Wang L, Odani K, Kai A: Dereverberation and denoising based on generalized spectral subtraction by nutil-channel LMS algorithm using a small-scale microphone array. Eurasip J. Adv. Signal Process 2012., 2012(12):

  23. 23.

    Sim BL, Tong YC, Chang JS, Tan CT: A parametric formulation of the generalized spectral subtraction method. IEEE Trans. Speech Audio Process 1998, 6(4):328-337. 10.1109/89.701361

  24. 24.

    Inoue T, Saruwatari H, Takahashi Y, Shikano K, Kondo K: Theoretical analysis of musical noise in generalized spectral subtraction based on higher-order statistics. IEEE Trans. Audio Speech Lang. Process 2011, 19(6):1770-1779.

  25. 25.

    Wang L, Nakagawa S, Kitaoka N: Blind dereverberation based on CMN and spectral subtraction by multi-channel LMS algorithm. In Proceedings of InterSpeech 2008. Brisbane, 22-26; September 2008:1032-1035.

  26. 26.

    Reynolds DA: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun 1995, 17: 91-108. 10.1016/0167-6393(95)00009-D

  27. 27.

    Reynolds DA, Quatieri TF, Dunn R: Speaker verification using adapted Gaussian mixture models. Dig. Signal Process 2000, 10(1-3):19-41. 10.1006/dspr.1999.0361

  28. 28.

    Wang L, Kitaoka N, Nakagawa S: Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Commun 2007, 49(6):501-513. 10.1016/j.specom.2007.04.004

  29. 29.

    Farrell K, Mammone R, Assaleh K: Speaker recognition using neural networks and conventional classifiers. IEEE Trans. on Speech Audio Process 1994, 2(1):194-205. 10.1109/89.260362

  30. 30.

    Campbell W, Campbell J, Reynolds D, Singer E, Torres-Carrasquillo P: Support vector machines for speaker and language recognition. Comput. Speech Lang 2006, 20(2–3):210-229.

  31. 31.

    Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P: A study of inter-speaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process 2008, 15(7):980-988.

  32. 32.

    Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process 2011, 19(4):788-798.

  33. 33.

    Kingsbury B, Morgan N: Recognizing reverberant speech with RASTA-PLP. In Proceedings of IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). Munich, 21-24 April vol.2 IEEE, Piscataway; 1997:1259-1262.

  34. 34.

    Surendran AC, Flanagan JL: Stable dereverberation using microphone arrays for speaker verification. J. Acoust. Soc. Am 1994, 96(5):3261-3262.

  35. 35.

    Huang Y, Benesty J: Adaptive blind channel identification: multi-channel least mean square and Newton algorithms. In ICASSP Orlando, 13-17 May vol. 2. IEEE, Piscataway, 2002; 1637–1640

  36. 36.

    Huang Y, Benesty J: Adaptive multichannel least mean square and Newton algorithms for blind channel identification. Signal Process 2002, 82: 1127-1138. 10.1016/S0165-1684(02)00247-5

  37. 37.

    Huang Y, Benesty J, Chen J: Optimal step size of the adaptive multi-channel LMS algorithm for blind SIMO identification. IEEE Signal Process. Lett 2005, 12(3):173-175.

  38. 38.

    Wang L, Kitaoka N, Nakagawa S: Distant-talking speech recognition based on spectral subtraction by multi-channel LMS algorithm. IEICE Trans. Inf. Syst. 2011, E94-D(3):659-667. 10.1587/transinf.E94.D.659

  39. 39.

    Nakamura S, Hiyane K, Asano F, Nishiura T, Yamada T: Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In Proceedings of LREC 2000. Athens, Greece, 31 May - 2 June 2000; 965-968.

  40. 40.

    Nishiura T, Nakayama M, Denda Y, Kitaoka N, Yamamoto K, Yamada T, Tsuge S, Miyajima C, Fujimoto M, Takiguchi T, Tamura S, Kuroiwa S, Takeda K, Nakamura S: Evaluation framework for distant-talking speech recognition under reverberant environments. In Proceedings of INTERSPEECH 2008. Brisbane, Australia, 22-26 September 2008; 968-971.

  41. 41.

    Itou K, Takeda K, Kakezawa T, Matsuoka T, Kobayashi T, Shikano K, Itahashi S, M Yamamoto: Janpanese speech corpus for large vocabulary continuous speech recognition research. J. Acoust. Soc. Jpn. (E) 1999, 20(3):199-206. 10.1250/ast.20.199

  42. 42.

    Patrick A Naylor: Signal-based performance evaluation of dereverberation algorithms. J. Electrical Comput. Eng 2010., 2010(5): Article ID 127513. doi:10.1155/2010/127513

Download references

Acknowledgements

This work was partially supported by a research grant from the Tateisi Science and Technology Foundation.

Author information

Correspondence to Longbiao Wang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Wang, L. & Kai, A. Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation. J AUDIO SPEECH MUSIC PROC. 2014, 15 (2014) doi:10.1186/1687-4722-2014-15

Download citation

Keywords

  • Hands-free speaker recognition
  • Blind dereverberation
  • Multi-channel least mean-squares
  • Generalized spectral subtraction
  • Gaussian Mixture Model