Analysis of transition cost and model parameters in speaker diarization for meetings

Martínez-González, Beatriz; Pardo, José M.; Vallejo-Pinto, José A.; San-Segundo, Rubén; Ferreiros, Javier

doi:10.1186/s13636-021-00196-6

Research
Open access
Published: 24 February 2021

Analysis of transition cost and model parameters in speaker diarization for meetings

Beatriz Martínez-González¹^na1,
José M. Pardo ORCID: orcid.org/0000-0002-1009-590X²^na1,
José A. Vallejo-Pinto³,
Rubén San-Segundo² &
…
Javier Ferreiros²

EURASIP Journal on Audio, Speech, and Music Processing volume 2021, Article number: 12 (2021) Cite this article

2379 Accesses
1 Citations
Metrics details

Abstract

There has been little work in the literature on the speaker diarization of meetings with multiple distance microphones since the publications in 2012 related to the last National Institute of Standards (NIST) Rich Transcription Evaluation Campaign in 2009 (RT09). Lately, the Second DIHARD Challenge Evaluation has also covered diarization at dinner party meetings that include multiple distant microphones. Dinner party meetings are somehow harder than office meetings because their participants can move freely around the room. In this paper, we studied some of the algorithms on speaker diarization for meetings with multiple distant microphones for the NIST Rich Transcription Evaluation Campaign in 2007 (RT07) and RT09 and provide definite and clear improvements. On the one hand, little or no care has been taken to the problem of penalizing or favoring transitions between speakers other than proposing a minimum duration of a speaker turn or calculating the speakers’ probabilities using Variational Bayes (VB). We have studied this issue and determined that a transition penalty term is needed that should be independent both of the number of active speakers and the minimum duration of speaker turns. On the other hand, the determination of a method to automatically select the right number of parameters is crucial in developing good models for speakers. Previous studies have proposed the dynamic selection of the number of parameters based on the duration of the speaker’s speech with a mixed performance when tested at one distant microphone meetings or multiple distant microphones meetings. In this paper, we propose a new method that takes into account both the duration of speaker’s speech to determine a minimum number of parameters, and the question of overfitting issue to determine a maximum number of them, also taking into account the computation time in order to reduce it.

We have carried out experiments to support our findings, and we have been able to improve our baseline speaker error rate with multiple distant-microphone meetings. Both methods achieve improved performance over the baseline. The first method obtains a 21.6% decrease in relative speaker error for the development set and a 4.6% decrease in relative speaker error for the test set (RT09). The second method obtains a 46.47% decrease in relative speaker error for the development set and a 17.54% decrease in relative speaker error for the test set. Both methods complement each other, and when they are applied in combination, we obtain a 47.2% decrease in relative speaker error for the development set and a 22.02% decrease in relative speaker error for the test set.

The performance obtained with our proposal is outstanding in some subsets of the development test such as the NIST RT07 and among the best for RT09 using our proposed simple modifications. Furthermore, with our algorithm we obtain gains in computation time without jeopardizing performance. Results with a different publicly available database, augmented multiparty interaction (AMI) obtains a 28.44% decrease in relative speaker error confirming the validity of our methods. Preliminary experiments with a single stream (mfcc) endorse the validity of our findings. Comparisons with an x-vector system deliver superior performance of our system on unseen test data.

1 Introduction

Speaker diarization consists of transcribing a recording with speaker labels. This task is usually done with no knowledge as to the number or identity of the speakers. Thus, two tasks are necessary; the first one is to identify the number of speakers, and the second one is to identify the specific regions in which every speaker intervenes. The speaker diarization is needed when transcribing a recording with multiple speakers. An overview of automatic speaker diarization systems is given in [1,2,3].

There have been the National Institute of Standards (NIST) evaluations for speaker diarization for meetings with multiple distant microphones (MDM) in 2005, 2006, 2007, and 2009. No further NIST evaluations have been made since then. Recently, new interests in speaker diarization have appeared with the launch of The First, Second, and Third DIHARD Speech Diarization Challenge which includes diarization in complex acoustic environments such as broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and YouTube videos [4]. However, recordings with multiple distant microphones are only available for dinner parties that differ from the meetings of NIST. Also in the Third DIHARD Challenge, no multiple microphone meetings are included.

The components of a typical speaker diarization are (1) the speech activity detector, (2) the feature extractor, and (3) the segmenting and clustering algorithm. The objective of the speech activity detector is to separate speech from other sounds such as silence or others using two models (speech and non-speech) [5] or more models (i.e., speech, non-speech, and silence) [6, 7]. The feature extractor processes the speech and calculates different spectral characteristics such as the Mel-frequency cepstral coefficients (MFCC), [8, 9], fundamental frequency (F0) [10, 11], the combination of neural network features with MFCC features [12, 13], the use of a phoneme background model [14], and other long-term features [15, 16] or energy features in the case of using multiple distant microphones [17].

The segmenting and clustering algorithm can be either bottom-up [6, 18] or top-down [19]. In a work published in [20], a comparison between both methods is made. In another work, the information on the role of speakers is used to adjust the segmentation [21]. Speaker models can be established using Gaussian mixture models (GMM) [22] or more recent I-Vector models [23, 24], CNN-I-Vectors [25], or X-Vectors [26]. X-Vectors have shown good performance; however, they need much more training data than our experiments because we do not use any external data other than that available from the recording session. In this sense, our GMM system is self-contained both for speaker modeling and for speech activity detection and is independent from any external sources ^{Footnote 1}. The more generally used distance metrics depend on the speaker models and the most common are the Bayes information criterion BIC [27], T test distance [28], information theoretic approach [29, 30], and cosine distance and probabilistic linear discriminant analysis (PLDA) for I-Vectors [16, 31].

Since the duration of a speaker’s turn is not known, a significant problem is how to decide when a speaker’s turn is feasible. One way of doing it is through comparisons of acoustic models before and after the turn. Some people use Viterbi segmentation [32] but penalizing transitions dependent of the number of active clusters. Another possible parameter to use is the minimum duration of a speaker turn, which limits the total number of speaker’s turns [2] in a recording. The problem of penalizing transitions has also been analysed in [33] and [34] proposing a different alternative although the number of speakers is known in their experiments. Recent research focuses on this topic and proposes the learning of the speaker turn priors [24, 35]. In this work, we propose to revisit this problem of classic methods and present alternative solutions that produce better and more robust results.

As regards the cluster (speaker) models, an important decision when modeling a speaker with a GMM or other models is the determination of the number of mixtures or parameters needed. In general, it is known that the amount of available data for training plays a crucial role in defining the number of parameters of a model, since with little data, it is impossible to create good models if the model has a lot of parameters. On the other hand, if we have plenty of data and as many parameters as we want, we encounter the problem of overfitting and the model does not generalize well. This topic is addressed in most pattern recognition books; see for instance [36]. In [37], this problem is analyzed and the number of frames needed to create a model is determined using the so called “cluster complexity ratio” which is a parameter that relates the number of frames of data available to the number of mixtures in a GMM that models this data. After each change in the amount of data assigned to each cluster due to segmentation, a new number of mixtures is defined that is related to the number of frames now assigned to the new model. Some positive results have been obtained in single distant microphone (SDM) experiments with a database of 16 meetings in the development set and 10 meetings in the test set (improvements of 2.9% relative in the diarization error (DER) for the development set and 19.39% relative in the DER for the test set). But when new experiments with a bigger development set (24 meetings) and new set of 8 meetings in the test set (meetings from the NIST Rich Transcription Evaluation Campaign in 2006 (RT06)) and testing in both the SDM and MDM scenario, the results only improve by 2.7% relative in the the DER for the development set and no improvement at all in the test set [32]. Contradictory results are again obtained in [38], in which the SDM results in the test set do not improve but degrade performance by 17.5% relative in the DER. Furthermore, their procedure does not take into account the overfitting issue because more frames, even if they do not add new information, are modeled with more parameters and the speaker model may overfit and not generalize sufficiently. Other researchers [39] have demonstrated that the number of Gaussians used to model a speaker is important in the creation of a good segmentation. Their experiments include a consensus based on different models each trained with a different number of Gaussians.

The problem of selecting the number of parameters is also important when mixing acoustic features with delay features in a weighted model [40]. The delay features do not need as many parameters as the spectral features since their dimensionality is usually lower and should not receive the same treatment.

The objective of this paper is to study the complexity of the models in the context of the MDM meetings’ diarization, carry out a thorough analysis of it, propose two parameters and its interrelation for solving the problem, and obtain justified conclusions. This study was not done before. Furthermore, we propose a new strategy to prevent overfitting and save computation time without significantly decreasing performance. Preliminary analysis of our methodology applied to single-channel recordings is also presented.

The paper is organized as follows. In Section 2, the baseline system is described. In Section 3, the database used for experiments is explained. In Section 4, we present the analysis of the problem of the transition penalty when segmenting speakers. In Section 5, we introduce the second objective: how to select the right model for a speaker. Section 6 is a section that merges the approaches of Sections 4 and 5. Section 7 presents results for the best systems with a publicly available database and a set of comparisons of our results with other published data. Finally, Section 8 is the discussion and Section 9 ends with our conclusions.

2 Description of the baseline system

2.1 Introduction

The architecture of the system is presented in Fig. 1. Every microphone produces a signal that is filtered to suppress some channel noise. After that, there is a module that calculates the time difference of arrival (TDOA) between two signals. In our case, these signals are the output of the microphones. The method used is the generalized cross-correlation method (GCC) [41]. The cross-correlations between any pair of channels are calculated as well. The channel with the highest cross-correlation is used as a reference [42]. The next step is the creation of a beamformed signal by delaying and summing the signals coming from the different microphones (weighted sum).

The Mel-frequency cepstrum coefficients (MFCC) are extracted from the beamformed signal, every 10 ms using a window width of 30 ms. The MFCC coefficients form what we call the mfcc vector. The beamformed signal is also processed by a voice activity detector (VAD) that classifies speech frames versus non-speech frames using a two-model Gaussian mixture model (GMM) and Viterbi resegmentation [5]. The output of the VAD module is fed into the agglomerative clustering module.

The localization features estimation creates a vector of TDOAs for each 10 ms frame. This vector is obtained by choosing an optimized set of channel pairs and calculating a TDOA for every pair. The concatenation of the TDOAs forms what we call the tdoa vector. Several methods were tried and tested in order to find the optimum representation of the localization features including the principle component analysis (PCA) transformations and using cross-correlation as a measure of quality. The method used was the selection through cross-correlation between channels, see [43].

Both mfcc vectors and tdoa vectors are fed to the next block which is the segmentation and agglomerative clustering of speech regions. This block has several parts (see Fig. 2). There is an initialization module that creates a first set of segments based on a maximum number of clusters (speakers) L (we use the maximum number of expected speakers). The full recording (only the speech part) is divided uniformly into L parts.

Each cluster is modeled by a Gaussian mixture model (GMM). There is a minimum duration per cluster, typically 2.5 s (see Fig. 3) [22]. The minimum duration per cluster is determined empirically. Due to this minimum duration, short interjections such as “yes,” “heah,” and “no” will be ignored by the system. However, the scoring mechanism does not ignore such words. They will be considered errors in our system. The problem of short words or affirmation is one of the drawbacks of our method. The GMM consists initially of a minimum number of components, 5 for the mfcc vector and 1 for the tdoa vector. The next module is the segmentation and training module. The sentence is segmented by the Viterbi algorithm using the original cluster models. Then, after segmentation, a new training is carried out followed by a subsequent segmentation. This process is repeated several times (from 3 to 5). The next module is “Cluster pair comparison”. Every combination of two clusters is compared to determine if they should be merged or not. If the stopping criterion is not met, a pair of clusters is selected to be merged. When this happens, the number of components in the merged cluster is the sum of the components of the individual clusters. When the stopping criterion is met, the process ends. The number of components of any cluster model will depend on the number of times that this cluster has participated in a merging, regardless of the duration of the final cluster once the resegmentation has been carried out.

We use the ΔBIC measure to decide if any merging is still possible (see Eq. 1) [27]. Notice that there is no penalty term λ in the BIC score because there is no difference in the number of parameters from the two modeling hypotheses as shown in [22]. In the following equation, X represents the full recording. X_A represents the part of the recording assigned to speaker A, and X_B represents the part of the recording assigned to speaker B.

$$ {\displaystyle \begin{array}{l}\Lambda BIC=\log p\Big(X\left|\xi \left)-\log p\right({X}_A\left|{\xi}_A\Big)-\right.\log p\Big({X}_B\left|{\xi}_B\Big)\right.\right.\\ {}X={X}_A\cup {X}_B\\ {}{\xi}_A\mathrm{is}\ \mathrm{the}\ \mathrm{model}\ \mathrm{created}\ \mathrm{with}{X}_A\\ {}{\xi}_B\mathrm{is}\ \mathrm{the}\ \mathrm{model}\ \mathrm{created}\ \mathrm{with}{X}_B\\ {}\xi \mathrm{is}\ \mathrm{the}\ \mathrm{model}\ \mathrm{created}\ \mathrm{with}X\end{array}} $$

(1)

The combination of the mfcc vector and the tdoa vector is made using the methodology presented in [40]. We apply a weight factor to the mfcc vector of 0.85 as in [43] since we use the same set of localization features.

2.2 Baseline segmentation method

The model of a cluster consists of a series of Hidden Markov Model (HMM) states that share the same GMM. The number of these states is equal to the minimum number of frames assigned to a speaker turn (in the baseline this is 250 equivalent to 2.5 s). In the last state, following the recommendation in [32, 37], the probability of staying in the last state (alpha) or jumping to another cluster (beta) is set to 1. At this point, neither value can be considered as probabilities anymore since they do not add up to 1. But when calculating the accumulated Viterbi probability, alpha and beta do not add any extra duration model to the last state of a cluster. After the jump to another cluster, the value of beta changes to beta/M, M being the number of remaining active clusters (see Fig. 4).

This value beta/M adds a new penalization factor to a transition. Furthermore, this penalization factor is dependent on the number of active clusters since it changes after every iteration in the clustering and merging process starting from the L initial clusters and decreasing by one at each step. The penalization factor then increases at each iteration (M is lower). This increase is somehow artificial and totally independent of the number of speakers in the recording (since it is not known). This factor is not usually taken into account in classic diarization systems. Some recent research proposes methods to learn this factor [24, 35]. One of the objectives of this paper is to focus on the study of this factor and propose an alternative that improves the baseline system. Preliminary experiments on this topic have been presented in [44].

3 Database and metrics

There has been little work in the literature on speaker diarization of meetings with multiple distance microphones since the last RT09. There is some new work on RT09 but using only one distant microphone and oracle speech/non-speech detector [12] or assuming that the number of speakers is known a priori [45].

We do not have a training set. Our development set used to train hyper-parameters consists of a subset of 12 meetings extracted from NIST Rich Transcription 2002–2005 sets (RT02-05). This set was previously used by us in published work (devel06 in [40]). We add the RT06 set and RT07 set and conform what will be called DEVELSET from now on, see Table 1. The evaluation set will be RT09. The performance of the systems was calculated using the scored speaker time and the segments of the recordings officially selected by NIST for the annual evaluations. The amount of time is 15,484.34 s or 4.3 h (1,548,434 frames) for the DEVELSET and 5932.88 s or 1.64 h (593,288 frames) for the RT09 set. We did include overlap regions and 0.25 s of forgiveness factor as in the official evaluations. The calculation of the DER and the speaker error (SER) is carried out using the tools provided by NIST [46]. We will focus primarily on the SER since the miss speaker error (MISS) and the false alarm error (FA) are fixed in all our experiments. DER is also presented for comparison purposes with other published works with the same data sets.

Table 1 List of meetings used for the development set (DEVELSET)

Analysis of transition cost and model parameters in speaker diarization for meetings

Abstract

1 Introduction

2 Description of the baseline system

2.1 Introduction

2.2 Baseline segmentation method

3 Database and metrics

4 Segmentation independent of the number of active clusters

4.1 Statement of the problem

4.2 Experiments

4.3 Experiments with a single channel

5 Model selection

5.1 Introduction to the problem

5.2 Proposed method

5.3 Preliminary experiments with a single channel

6 Fusing model selection and speaker segmentation independent on the number of clusters

6.1 Experiments with two streams

6.2 Preliminary experiments with the fusion system and a single stream

6.3 Comparing results with a single stream versus two streams

7 Comparisons with other published results

7.1 Comparison with the AMI dataset

7.2 Comparison with other RT multiple streams published results

7.3 Comparison with an x-vector system for a single channel

7.4 DER comparison with information bottleneck [9] for a single channel

8 Discussion

9 Conclusions

10 Methods

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords