Analysis of transition cost and model parameters in speaker diarization for meetings

There has been little work in the literature on the speaker diarization of meetings with multiple distance microphones since the publications in 2012 related to the last National Institute of Standards (NIST) Rich Transcription Evaluation Campaign in 2009 (RT09). Lately, the Second DIHARD Challenge Evaluation has also covered diarization at dinner party meetings that include multiple distant microphones. Dinner party meetings are somehow harder than office meetings because their participants can move freely around the room. In this paper, we studied some of the algorithms on speaker diarization for meetings with multiple distant microphones for the NIST Rich Transcription Evaluation Campaign in 2007 (RT07) and RT09 and provide definite and clear improvements. On the one hand, little or no care has been taken to the problem of penalizing or favoring transitions between speakers other than proposing a minimum duration of a speaker turn or calculating the speakers’ probabilities using Variational Bayes (VB). We have studied this issue and determined that a transition penalty term is needed that should be independent both of the number of active speakers and the minimum duration of speaker turns. On the other hand, the determination of a method to automatically select the right number of parameters is crucial in developing good models for speakers. Previous studies have proposed the dynamic selection of the number of parameters based on the duration of the speaker’s speech with a mixed performance when tested at one distant microphone meetings or multiple distant microphones meetings. In this paper, we propose a new method that takes into account both the duration of speaker’s speech to determine a minimum number of parameters, and the question of overfitting issue to determine a maximum number of them, also taking into account the computation time in order to reduce it. We have carried out experiments to support our findings, and we have been able to improve our baseline speaker error rate with multiple distant-microphone meetings. Both methods achieve improved performance over the baseline. The first method obtains a 21.6% decrease in relative speaker error for the development set and a 4.6% decrease in relative speaker error for the test set (RT09). The second method obtains a 46.47% decrease in relative speaker error for the development set and a 17.54% decrease in relative speaker error for the test set. Both methods complement each other, and when they are applied in combination, we obtain a 47.2% decrease in relative speaker error for the development set and a 22.02% decrease in relative speaker error for the test set. The performance obtained with our proposal is outstanding in some subsets of the development test such as the NIST RT07 and among the best for RT09 using our proposed simple modifications. Furthermore, with our algorithm we obtain gains in computation time without jeopardizing performance. Results with a different publicly available database, augmented multiparty interaction (AMI) obtains a 28.44% decrease in relative speaker error confirming the validity of our methods. Preliminary experiments with a single stream (mfcc) endorse the validity of our findings. Comparisons with an x-vector system deliver superior performance of our system on unseen test data.


Introduction
Speaker diarization consists of transcribing a recording with speaker labels. This task is usually done with no knowledge as to the number or identity of the speakers. Thus, two tasks are necessary; the first one is to identify the number of speakers, and the second one is to identify the specific regions in which every speaker intervenes. The speaker diarization is needed when transcribing a recording with multiple speakers. An overview of automatic speaker diarization systems is given in [1][2][3].
There have been the National Institute of Standards (NIST) evaluations for speaker diarization for meetings with multiple distant microphones (MDM) in 2005, 2006, 2007, and 2009. No further NIST evaluations have been made since then. Recently, new interests in speaker diarization have appeared with the launch of The First, Second, and Third DIHARD Speech Diarization Challenge which includes diarization in complex acoustic environments such as broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and YouTube videos [4]. However, recordings with multiple distant microphones are only available for dinner parties that differ from the meetings of NIST. Also in the Third DIHARD Challenge, no multiple microphone meetings are included.
The components of a typical speaker diarization are (1) the speech activity detector, (2) the feature extractor, and (3) the segmenting and clustering algorithm. The objective of the speech activity detector is to separate speech from other sounds such as silence or others using two models (speech and non-speech) [5] or more models (i.e., speech, non-speech, and silence) [6,7]. The feature extractor processes the speech and calculates different spectral characteristics such as the Mel-frequency cepstral coefficients (MFCC), [8,9], fundamental frequency (F0) [10,11], the combination of neural network features with MFCC features [12,13], the use of a phoneme background model [14], and other long-term features [15,16] or energy features in the case of using multiple distant microphones [17].
The segmenting and clustering algorithm can be either bottom-up [6,18] or top-down [19]. In a work published in [20], a comparison between both methods is made. In another work, the information on the role of speakers is used to adjust the segmentation [21]. Speaker models can be established using Gaussian mixture models (GMM) [22] or more recent I-Vector models [23,24], CNN-I-Vectors [25], or X-Vectors [26]. X-Vectors have shown good performance; however, they need much more training data than our experiments because we do not use any external data other than that available from the recording session. In this sense, our GMM system is self-contained both for speaker modeling and for speech activity detection and is independent from any external sources 1 . The more generally used distance metrics depend on the speaker models and the most common are the Bayes information criterion BIC [27], T test distance [28], information theoretic approach [29,30], and cosine distance and probabilistic linear discriminant analysis (PLDA) for I-Vectors [16,31].
Since the duration of a speaker's turn is not known, a significant problem is how to decide when a speaker's turn is feasible. One way of doing it is through comparisons of acoustic models before and after the turn. Some people use Viterbi segmentation [32] but penalizing transitions dependent of the number of active clusters. Another possible parameter to use is the minimum duration of a speaker turn, which limits the total number of speaker's turns [2] in a recording. The problem of penalizing transitions has also been analysed in [33] and [34] proposing a different alternative although the number of speakers is known in their experiments. Recent research focuses on this topic and proposes the learning of the 1 At the time that this technology was created, voice activity detection, for instance, was pretty much dependent on the type of background noise, and in this way, the results of an external VAD could generate unstable results. Equally, if we had to use the system in different rooms, different scenarios, different types of backgrounds etc., the use of external sources would deliver spurious results. If we assume that we have a model that is universal enough that could be used as a background and adapted in a second step to our room, certainly the method could be more robust. speaker turn priors [24,35]. In this work, we propose to revisit this problem of classic methods and present alternative solutions that produce better and more robust results.
As regards the cluster (speaker) models, an important decision when modeling a speaker with a GMM or other models is the determination of the number of mixtures or parameters needed. In general, it is known that the amount of available data for training plays a crucial role in defining the number of parameters of a model, since with little data, it is impossible to create good models if the model has a lot of parameters. On the other hand, if we have plenty of data and as many parameters as we want, we encounter the problem of overfitting and the model does not generalize well. This topic is addressed in most pattern recognition books; see for instance [36]. In [37], this problem is analyzed and the number of frames needed to create a model is determined using the so called "cluster complexity ratio" which is a parameter that relates the number of frames of data available to the number of mixtures in a GMM that models this data. After each change in the amount of data assigned to each cluster due to segmentation, a new number of mixtures is defined that is related to the number of frames now assigned to the new model. Some positive results have been obtained in single distant microphone (SDM) experiments with a database of 16 meetings in the development set and 10 meetings in the test set (improvements of 2.9% relative in the diarization error (DER) for the development set and 19.39% relative in the DER for the test set). But when new experiments with a bigger development set (24 meetings) and new set of 8 meetings in the test set (meetings from the NIST Rich Transcription Evaluation Campaign in 2006 (RT06)) and testing in both the SDM and MDM scenario, the results only improve by 2.7% relative in the the DER for the development set and no improvement at all in the test set [32]. Contradictory results are again obtained in [38], in which the SDM results in the test set do not improve but degrade performance by 17.5% relative in the DER. Furthermore, their procedure does not take into account the overfitting issue because more frames, even if they do not add new information, are modeled with more parameters and the speaker model may overfit and not generalize sufficiently. Other researchers [39] have demonstrated that the number of Gaussians used to model a speaker is important in the creation of a good segmentation. Their experiments include a consensus based on different models each trained with a different number of Gaussians.
The problem of selecting the number of parameters is also important when mixing acoustic features with delay features in a weighted model [40]. The delay features do not need as many parameters as the spectral features since their dimensionality is usually lower and should not receive the same treatment.
The objective of this paper is to study the complexity of the models in the context of the MDM meetings' diarization, carry out a thorough analysis of it, propose two parameters and its interrelation for solving the problem, and obtain justified conclusions. This study was not done before. Furthermore, we propose a new strategy to prevent overfitting and save computation time without significantly decreasing performance. Preliminary analysis of our methodology applied to single-channel recordings is also presented.
The paper is organized as follows. In Section 2, the baseline system is described. In Section 3, the database used for experiments is explained. In Section 4, we present the analysis of the problem of the transition penalty when segmenting speakers. In Section 5, we introduce the second objective: how to select the right model for a speaker. Section 6 is a section that merges the approaches of Sections 4 and 5. Section 7 presents results for the best systems with a publicly available database and a set of comparisons of our results with other published data. Finally, Section 8 is the discussion and Section 9 ends with our conclusions.
2 Description of the baseline system

Introduction
The architecture of the system is presented in Fig. 1. Every microphone produces a signal that is filtered to suppress some channel noise. After that, there is a module that calculates the time difference of arrival (TDOA) between two signals. In our case, these signals are the output of the microphones. The method used is the generalized cross-correlation method (GCC) [41]. The cross-correlations between any pair of channels are calculated as well. The channel with the highest crosscorrelation is used as a reference [42]. The next step is the creation of a beamformed signal by delaying and summing the signals coming from the different microphones (weighted sum).
The Mel-frequency cepstrum coefficients (MFCC) are extracted from the beamformed signal, every 10 ms using a window width of 30 ms. The MFCC coefficients form what we call the mfcc vector. The beamformed signal is also processed by a voice activity detector (VAD) that classifies speech frames versus non-speech frames using a two-model Gaussian mixture model (GMM) and Viterbi resegmentation [5]. The output of the VAD module is fed into the agglomerative clustering module.
The localization features estimation creates a vector of TDOAs for each 10 ms frame. This vector is obtained by choosing an optimized set of channel pairs and calculating a TDOA for every pair. The concatenation of the TDOAs forms what we call the tdoa vector. Several methods were tried and tested in order to find the optimum representation of the localization features including the principle component analysis (PCA) transformations and using cross-correlation as a measure of quality. The method used was the selection through cross-correlation between channels, see [43].
Both mfcc vectors and tdoa vectors are fed to the next block which is the segmentation and agglomerative clustering of speech regions. This block has several parts (see Fig. 2). There is an initialization module that creates a first set of segments based on a maximum number of clusters (speakers) L (we use the maximum number of expected speakers). The full recording (only the speech part) is divided uniformly into L parts.
Each cluster is modeled by a Gaussian mixture model (GMM). There is a minimum duration per cluster, typically 2.5 s (see Fig. 3) [22]. The minimum duration per cluster is determined empirically. Due to this minimum duration, short interjections such as "yes," "heah," and "no" will be ignored by the system. However, the scoring mechanism does not ignore such words. They will be considered errors in our system. The problem of short words or affirmation is one of the drawbacks of our method. The GMM consists initially of a minimum number of components, 5 for the mfcc vector and 1 for the tdoa vector. The next module is the segmentation and training module. The sentence is segmented by the Viterbi algorithm using the original cluster models. Then, after segmentation, a new training is carried out followed by a subsequent segmentation. This process is repeated several times (from 3 to 5). The next module is "Cluster pair comparison". Every combination of two  clusters is compared to determine if they should be merged or not. If the stopping criterion is not met, a pair of clusters is selected to be merged. When this happens, the number of components in the merged cluster is the sum of the components of the individual clusters. When the stopping criterion is met, the process ends. The number of components of any cluster model will depend on the number of times that this cluster has participated in a merging, regardless of the duration of the final cluster once the resegmentation has been carried out. We use the ΔBIC measure to decide if any merging is still possible (see Eq. 1) [27]. Notice that there is no penalty term λ in the BIC score because there is no difference in the number of parameters from the two modeling hypotheses as shown in [22]. In the following equation, X represents the full recording. X A represents the part of the recording assigned to speaker A, and X B represents the part of the recording assigned to speaker B.
The combination of the mfcc vector and the tdoa vector is made using the methodology presented in [40].
We apply a weight factor to the mfcc vector of 0.85 as in [43] since we use the same set of localization features.

Baseline segmentation method
The model of a cluster consists of a series of Hidden Markov Model (HMM) states that share the same GMM. The number of these states is equal to the minimum number of frames assigned to a speaker turn (in the baseline this is 250 equivalent to 2.5 s). In the last state, following the recommendation in [32,37], the probability of staying in the last state (alpha) or jumping to another cluster (beta) is set to 1. At this point, neither value can be considered as probabilities anymore since they do not add up to 1. But when calculating the accumulated Viterbi probability, alpha and beta do not add any extra duration model to the last state of a cluster. After the jump to another cluster, the value of beta changes to beta/M, M being the number of remaining active clusters (see Fig. 4).
This value beta/M adds a new penalization factor to a transition. Furthermore, this penalization factor is dependent on the number of active clusters since it changes after every iteration in the clustering and merging process starting from the L initial clusters and decreasing by one at each step. The penalization factor then increases at each iteration (M is lower). This increase is somehow artificial and totally independent of the number of speakers in the recording (since it is not known). This factor is not usually taken into account in classic diarization systems. Some recent research proposes methods to learn this factor [24,35]. One of the objectives of this paper is to focus on the study of this factor and propose an alternative that improves the baseline system. Preliminary experiments on this topic have been presented in [44].

Database and metrics
There has been little work in the literature on speaker diarization of meetings with multiple distance microphones since the last RT09. There is some new work on RT09 but using only one distant microphone and oracle speech/non-speech detector [12] or assuming that the number of speakers is known a priori [45]. We do not have a training set. Our development set used to train hyper-parameters consists of a subset of 12 meetings extracted from NIST Rich Transcription 2002-2005 sets (RT02-05). This set was previously used by us in published work (devel06 in [40]). We add the RT06 set and RT07 set and conform what will be called DEVE LSET from now on, see Table 1. The evaluation set will be RT09. The performance of the systems was calculated using the scored speaker time and the segments of the recordings officially selected by NIST for the annual evaluations. The amount of time is 15,484.34 s or 4.3 h (1,548,434 frames) for the DEVELSET and 5932.88 s or 1.64 h (593,288 frames) for the RT09 set. We did include overlap regions and 0.25 s of forgiveness factor as in the official evaluations. The calculation of the DER and the speaker error (SER) is carried out using the tools provided by NIST [46]. We will focus primarily on the SER since the miss speaker error (MISS) and the false alarm error (FA) are fixed in all our experiments. DER is also presented for comparison purposes with other published works with the same data sets.
4 Segmentation independent of the number of active clusters

Statement of the problem
We have mentioned above that in the baseline, every change of speaker includes a factor 1/M. M is the number of current clusters after the previous merging. The factor 1/M (always less than 1) decreases the probability of changing the speaker versus staying with the current speaker. An undesirable extra effect is that M is variable at each iteration so the factor 1/M is also variable. One would reasonably be tempted to think that if M is bigger, the probability of a speaker change should be higher but this kind of probability is not known neither it is attempted to use in our system. Thus, in the absence of this information, what is not right is to take into consideration the number of "remaining clusters M" in the algorithm. Let us use a penalizing or regularizing factor similar to the penalizing factor that weighs language model versus acoustic model in speech recognition.
When a Viterbi segmentation is carried out, there is an accumulated log-likelihood associated with the last sub-state of each cluster, which is the accumulated sum of log-likelihoods corresponding to the previous frames. A speaker's turn takes place when the left-hand part in the formula below is lower than the right-hand part in which logL() is the log-likelihood, K the transition weight (in the baseline system this is 1/M), cl u the candidate cluster ,cl j the current cluster, and fr i the frame being evaluated. The left part represents the sum of the last "minimum duration" log-likelihoods of the frames if they belong to the current cluster. The right-hand part represents the same total of log-likelihoods of the frames if they belong to a different cluster plus the log of a transition weight K. Every increase in the transition weight force a speaker turn since the condition in (2) is easily met. On the other hand, if K is very small (much smaller than one if M is big), it makes the transition more difficult because in the right-hand part, we substract some quantity. In summary, in the current formula, a factor is included that has no relation to the current acoustics and is somehow arbitrary.
In the baseline system, at the beginning of the agglomerative clustering, M is large, K is small, log (K) is negative, and the condition in (2) does not hold, so the transitions are penalized; however, at the end of the iterations, M is much lower, thus favoring transitions. This undesired effect is the one that we want to eliminate.
In order to do so, we propose a set of experiments with variations of this factor, but independently of the number of current active clusters. We will also experiment with high values of K, thus favoring transitions between speakers. The case of K = 1 (the change of speaker determined only by the acoustics) will also be tested.

Experiments
As explained before, we will be focusing on the SER. The difference between the speaker error rate and the DER is that the DER includes SER plus MISS that is the part in which the system does not find or identify a speaker (in our system, each overlap time will contribute to one or more errors since we deliver just one speaker hypothesis) and FA which is the part in which the system proposes a speaker and there is silence (the VAD module is responsible for this error). The VAD module also contributes to the MISS error (there is a true speaker and the VAD module thinks that it is not speech). Since we are not changing the VAD module neither doing any overlap handling, the MISS error plus the FA error will be 7.44% in our DEVELSET in all the experiments. In our test set (RT09), the MISS error plus the FA error is 8.70% in all cases. We give those two values for comparison purposes to be able to calculate the DER. We use a no-score collar of 0.25 at speaker boundaries as usual in standard Rich Transcription (RT) evaluations. We use a weight factor of 0.85 for the mfcc vector and 0.15 for the tdoa vector. Figure 5 represents the speaker error versus the transition weight K in formula (2) for different values of the minimum duration of a speaker's turn. The baseline SER is also shown. Analyzing the results, we notice a big dispersion across the K values and across the minimum duration values. For minimum duration = 200 (2 s), the new methodology improves the results of the baseline for an ample range of values of K. At the same time, it is less dependent on the values of K. It is interesting to note that for K < 1, the results are less stable than for K > 1. We cannot find a good reason to justify it. It is very much dependent on the kind of acoustics and the type of meetings and speakers at each meeting. But we have used many different meetings in different rooms, so the experimental results are solid. The fact that for K < 1, the results are less stable give us a good reason not to rely on a K = 1/M which is even more unstable since it depends on the iteration of the algorithm. Remember that in the baseline, K is variable at each iteration and less than one. Two proofs are shown in the picture, the first one is that K should not follow the previous strategy (changing it depending on the number of active speakers) but it should be independent of it. At the same time, the parameter "minimum duration" is dependent on K, so both parameters should be explored to find and optimum.
In Table 2, the performance of both DEVELSET and the test set (RT09) are presented for different values of K. Since in the baseline, the minimum duration is 250, and in the new methodology, the minimum duration is 200; we have included the case for baseline and minimum duration equal to 200 in the table. In the baseline system, there is no significant difference from minimum duration of 250 to the minimum duration 200. It can be observed that for the development set, any value of K  between 1 and 4 and minimum durantion 200 is better than the baseline with either minimum duration of 250 or 200, demonstrating the validity of our approach. If we analyze the results on the test set, we notice that every value of K with the exception of K = 1 improves the baseline. The best result with the DEVELSET, which is K = 3 or 4 also improves the baseline results. We can conclude that the new methodology delivers better results than the baseline methodology.
It is interesting to note that with K > 1, we are favoring speaker changes while in the baseline, K is always less than 1, thus penalizing speaker changes. At the same time that we have discovered that favoring speaker changes is better in our experiments, we have eliminated the somehow arbitrary variations of K depending on the iteration of the algorithm and the number of active speakers at each iteration (baseline).
One important characteristic of speaker diarization for meetings is that the results across different rooms, different location of microphones, different number of microphones, and different number of speakers etc., are very unstable. Some of them are very good but some others are terrible [47]. Thus, the best way to demonstrate technological improvements is to test the system with as many recordings as possible. We have tried the system with 28 meetings for development and 9 for test so our experimentation is ample. Furthermore, the data that we use belong to a community standard and can be contrasted with results of other researchers.

Experiments with a single channel
In order to check if the previous method works for single-channel recordings, we have selected the mfcc vector coming from the acoustic fusion (see Fig. 1) and discarded the tdoa feature channel. In this case, the diarization is similar to the use of a single microphone recording. Figure 6 represents the speaker error across different values of the transition weight and different minimum duration values. It can be seen that there are several values below the baseline of 8.98 SER. This picture demonstrates the validity of our proposal. The baseline uses a transition weight dependent on the number of remaining clusters, but a constant transition weight improves the SER performance. However, it can be noticed that in this case, the minimum SER values are located at different working points. Two minimums can be considered, one at the point 350 minimum duration and transition weight of 0.001 with an 8.29% SER which represents an improvement of 8.3% relative and another one at a minimum duration of 400 frames and transition values of 0.01 with an 8.39% SER that represents an improvement of 7.0% relative SER. Table 3 presents the results obtained for the working points for the test set. Both points improve the baseline of the system by 42% relative SER and 7.0% relative SER, respectively. We can notice in the table that the optimum working point with a single channel differs substantially from the optimum working point obtained previously (minimum duration of 200 and transition weight of 3). The conclusion that we extract from this result is that the method is valid also for a single channel and it can be used, but the parameters should be tuned for each case. The minimum duration and the transition weight interact with each other in the system, and they cannot be universally determined but through an empiric study. But it can be proved that both working points also improve the test set, in a case with noticeable improvement.

Introduction to the problem
In the baseline system, when merging two clusters, the ΔBIC distance used to determine whether the clusters should be merged eliminates the need for the adjustable λ parameter by setting the number of Gaussians of the merged cluster as the sum of the Gaussians of the original clusters to be merged. In this way, the merged clusters now have many more Gaussians independently of their duration. But the new number of Gaussians may be too small or too big to model properly the new cluster and the remaining clusters after a segmentation step have been carried out.
In the proposal of Anguera [37], an attempt to solve the problem was addressed. Instead of keeping the number of Gaussians dependent on the number of times that a cluster has been merged with another one (because the total number of parameters is kept constant after merging), the number of Gaussians is always recalculated depending on the duration of the clusters after merging and resegmenting. In this way, a small cluster could be modeled with a single Gaussian. But the proposal by Anguera does not address the problem of using many Gaussians for a long cluster-thus expending a lot of resources-or the risk of overfitting the model. In this paper, we have shown that there is very little improvement by increasing the number of Gaussians after a certain limit because even if more data were available, this data would not add new information to the model.
In Fig. 7, the normalized log likelihood of a speaker extracted from a session from the development set using the true references is plotted versus the number of Gaussians used to train it. The speaker has 55,114 frames (551 s). We normalize the log likelihood by dividing the total log likelihood by the number of frames. It can be observed that the likelihood has a long tail, and it does not improve substantially when the number of Gaussians is over 100 indicating that there is no need to use so many Gaussians to model the speaker. This fact is better illustrated when we plot the derivative of the normalized log-likelihood (see Fig. 8). We notice that after a certain number (i.e., 100) of Gaussians, the derivative remains approximately constant. On the one hand, we need a minimum number of Gaussians to model a speaker of a certain number of frames adequately (duration). On the other hand, we do not gain a lot by augmenting the number of Gaussians after a certain value and we could save computation by limiting the maximum number of them. Figure 9 illustrates the same concept, this time the number of Gaussians is kept constant at 5 and the normalized log-likelihood is plotted against the number of frames. This picture clearly demonstrates that when few frames are available 5 Gaussians is not a good parameter to use in this case and it distorts the model. The picture shows a minimum of loglikelihood at 3674 frames (36 s) having a value at that point which is comparable to values in the previous picture for the same number of Gaussians. This figure  shows that having 3674 frames generates a model that can be compared to other clusters. But if we have many fewer frames, using 5 Gaussians is not appropriate and the comparison would have been biased favoring the cluster with a smaller duration (note also that we are using logarithms so the dynamic range of the arithmetic is lower). On the other hand, adding more frames to the model does not significantly change its log likelihood.
Our proposal is to modify this strategy and use two new parameters to determine the number of Gaussians per model, one is the number of frames, and the other is the maximum number of Gaussians per cluster, as will be presented in the next section.

Proposed method
We propose a method to take into account the problems mentioned in Section 1. The algorithm is presented in Fig. 10 in which the modules that change with respect to the baseline algorithm are marked with "NEW". In the new proposed algorithm, both at the initialization step and after any new segmentation a recalculation of the number of Gaussians used to model each cluster is implemented according to formula (3). Two parameters are used, A = the minimum duration to train any single Gaussian and B = the maximum number of Gaussians used to model a cluster.
In Fig. 11, the DER values for the DEVELSET for different parameters of the minimum number of seconds per Gaussian (A) and the maximum number of Gaussians (B) are represented together with the average of all of the values (marked "AVE"). We can see that with this  algorithm after using 50 or more Gaussians as the maximum, there are improvements in the DEVELSET. We can also observe that the minimum values are obtained with 7 s per Gaussian which also corresponds to the minimum number of frames found in Fig. 9 for 5 Gaussians. The absolute minimum is found with 100 Gaussians which corresponds to the turning point in Fig. 8 in which increasing the number of Gaussians does not add information in the log-likelihood. One hundred Gaussians are also the minimum of the average line (AVE) in this picture. In Fig. 12, the DER for different values of parameter "A" are represented for "B" = 100. It can be clearly seen that below 7 s per Gaussian, the results are worse than those above it although the evolution of the DER values across parameter A is not descending monotonically. It is important to highlight that the standard DER and SER measure for speaker diarization for meetings is very sensitive to errors in the final number of speakers detected. This occurs because SER is a frame-based measure and one error in its calculation and depending on the duration of the speaker's speech may change the SER significantly. The best way to obtain good conclusions in this area of research is to experiment with as many diverse meetings as possible as mentioned before. In Fig. 13, the ratio of computation time of the proposed method (using parameter A = 7) to the computation time of the baseline for the DEVELSET versus the parameter B (maximum number of Gaussians) is presented. The computation cost increases with the maximum number of Gaussians. When there are more Gaussians to train, the algorithm takes longer. It has a saturation limit at 200 because the maximum is rarely reached at over 200. By observing Fig. 11, we can see that after 100 Gaussians, the error does not diminish. Thus, a good compromising working point would be to use a maximum number of 100 Gaussians. In fact, if we would like to obtain a good working point, we could think of a merit factor that weights 90% the SER and 10% the ratio of computation time over the baseline. If we plot this merit factor against the maximum number of Gaussians (Fig. 14), we can observe this minimum at B = 100. There is another minimum at B = 50. By using a limit in the number of Gaussians, we can obtain a saving of 25.38% of computational time compared to not using the limit.
The relative improvement in SER over the baseline in the development set is 42.09% for the pair of parameters A-B = (7-50) and 46.47% for the pair of parameters A-B = (7-100) see Table 4. This is a very impressive result. For comparison purposes, we have calculated the SER for a subset of the development set (the RT07 set) obtaining a value of 2.1% which is outstanding performance (remember that the MISS+FA error for RT07 is 6.82). The meetings of this subset is part of our DEVE LSET and has therefore been used for training, still we include the speaker error of this subset separately only for a fast comparison with other works which were using this RT07 set. In Table 4, we also present the results of SER for the test set RT09. Improvements can be obtained for both combinations of parameters provided. Relative improvements in SER range from 15.36 to 17.54% for the two proposed working points. In Table 5, we present detailed results for the RT09 set meeting by meeting both for the baseline system and for our proposed method using the optimum values (A, B) = (7-100). Four of the meetings present a decrease in SER, two others remain with a similar SER and one of the meetings increases the SER. On average, the SER decreases as we have already mentioned. The results obtained are among the best published to date [10,39] (you need to add 8.7% of MISS+FA error to obtain the DER).
In Table 6, the number of identified, missed, and false-alarm speakers is presented. The proposed method reveals two more correctly identified speakers than the baseline and two fewer missed speakers although there are three new false-alarm speakers. The influence of new false-alarm speakers in the SER is small. This fact can be easily explained by the fact that the SER is a time-weighted measure, and the new false-alarm speakers possibly intervene for a short period of time and it is not significant in the overall computation. It can also be seen in Table 5 that the meetings that usually have a very high SER also have a very high overlap error, and since we do not propose any solution for the overlaps, we cannot decrease this error with our method.

Preliminary experiments with a single channel
In order to check how the model selection method behaves for a single channel, we have selected the mfcc vector coming from the acoustic fusion and discarded the tdoa vector. Figure 15 represents the SER for a single mfcc stream and a mixture of parameters (A) and minimum duration across different values of maximum number of Gaussians (B). It can be noticed that the method also improves the baseline results for single-channel recordings although the minimum SER values are obtained at a slightly different parameter values. The first thing to notice is that the optimum minimum duration is now 350 frames compared to 250 frames of the baseline. This change was also noticed in Section 4.3 above with the experiments changing the transition weight. The second change is the number of seconds per Gaussian that in this case is 11 compared to the optimum in previous  section that was 7. More seconds per Gaussian are needed in order to create good models. This parameter change may be due to the numeric interaction of the mfcc vector with the tdoa vector that only occurs when using both vector streams. The system that uses both streams has better performance and stability, and it is less sensitive to parameter variations. Both streams complement each other. Looking at Fig. 15, we find several values that improve the baseline. We can choose the value 11-350-40, and the value 12-350-30 that give a SER of 8.3% and 8.14%, respectively, which represents 8.1% and 10.3% relative improvement over the baseline. Table 7 presents the results with the RT09 set. It can be seen that for the point 11,350,40, the relative SER improvement over the baseline is 36.75% which is a very significant improvement. With these results, we demonstrate that the method works not only when using both mfcc and tdoa streams but also for a single mfcc stream.
6 Fusing model selection and speaker segmentation independent on the number of clusters

Experiments with two streams
After analyzing previous results with transition weight independent of the number of clusters and the results using a method to select an appropriate number of Gaussians per cluster, the obvious next step is to merge both methods. However, the first method as seen in formula (2) uses a tuning parameter K to adjust transition probabilities to penalize or favor speaker changes in the same manner as in a speech recognition system in which the acoustics are appropriately weighted with the linguistic model probabilities in order to insert more or less word hypothesis. If we now consider, our new method of selecting the number of frames per cluster that is dependent on the duration of each cluster and taking into account that there is a maximum number of Gaussians per cluster to model a speaker, the likelihoods calculated in formula (2) may vary. In fact, if the model is a better fit to the speech, the likelihoods should be greater and the transition weight could change. In the same manner, the optimum value for minimum duration may also change.
In Fig. 16, we present the SER across different transition probabilities and for minimum duration 250 (We also did experiments with a minimum duration of 200 with worse results). In Fig. 17, we show the SER for the working point 7-50 compared with the baseline transition probability (1/M). Analyzing Figs. 16 and 17, we notice that now the best K is found at 0.01 with both systems but now the working point 7-50 is slightly better than the 7-100 (although not significant) in contrast with our previous results (SER 2.17% vs 2.2%). What is interesting is that if we compare the results of this experiment for the working point 7-50 with those obtained with the baseline transition weight (1/M), the results   Table 8. The first three rows in Table 8 reproduce the results of the previous section.
If we now analyze the results for the RT09 set, we can see that at the point 7-50, K = 0.01 is significantly better than the others and better than those using K = 1/M. The results on the test set confirm that both methods contribute to improvements in the system. The fact that the results on the DEVELSET do not change for the 7-100 system may be due to the already very low SER which is quite difficult to decrease. In Table 9, we present the results meeting by meeting for the RT-09 set. We can see that our proposed method improves in all the meetings except one. This single meeting is the one that create the biggest part of the error (SER 46.97), and it is the meeting that has also the biggest overlap error. The average SER for the rest of the meetings is comparable or even better than the results for the baseline and better than the state of the art (see Section 7.2 below).

Preliminary experiments with the fusion system and a single stream
In this section, we present experiments fusing the transition weight scheme with the model selection scheme for a single mfcc stream. Figure 18 presents results for the DEVELSET using the fusion scheme. We can find several sets of parameters that have a SER below the baseline. But unfortunately, in our preliminary search, we could not find a minimum better than the minimum that we found using the model selection scheme. Both optimums found are not statistically different (8.22 vs 8.14 SER). The results obtained for the RT09 set for  these optimums are presented in Table 10 and are not as good as the results that we get for a different minimum in the DEVELSET (i.e., the 8.29 DEVELSET SER in Table 10). The search for an optimum using the fusion scheme is more complex since there are many parameters involved. In any case, the results for both the development set and for the evaluation set improve significantly the baseline system, both in the DEVELSET and the RT09 set. Table 11 shows detailed results by meeting comparing the results when using the tdoa features versus the case in which we use a single mfcc stream. It can be noticed a big performance degradation when the tdoa information is not present. There is an exception with the meeting NIST 20080201-1405 whose results are very bad anyhow, but the results with a single mfcc stream are superior. We think that this is due to the speakers moving around the room that corrupts tdoa information.

Comparison with the AMI dataset
The databases used in this paper devel06, rt06, rt07, and rt09 are not publicly available in full (they are only available to the institutions that participated in the corresponding competitions). In order to test our proposals with other publicly available databases, we have used a subset of the AMI meeting corpus available from the University of Edinburgh [48] just for testing without changing the development set. The set of meetings used (specified in Table 12) includes recordings only from the Idiap Research Institute (IDIAP) site and has been used by other authors [49].
Our results with those databases using the optimum parameters obtained in our development set are presented in Table 13. The MISS+FA error is the same for every experiment and equal to 12.64%.
If we analyse the results, we can see that using the first alternative (changing just the transition cost parameter from the baseline) and using the cost that was obtained in the DEVELSET database does not improve performance. This result may easily be due to the mismatch  between the development set and the test set. We think that the transition cost is an important parameter but that it has to be adjusted with a development set similar to the test set. In contrast, the parameters obtained with the DEVELSET database for the second approach separately (model selection) do improve in both cases (7-50) and (7-100) demonstrating that the second method has somehow obtained robust parameter settings. When the model selection mechanism is joined with the transition cost approach, there are also extra improvements for the (7-100) case. The fact that there is no improvement with the (7-50) set may be due again to the use of a different database for which no development set has been developed and the transition cost may be dependent on the database and on the maximum number of Gaussians per state. In any event, it can be demonstrated again that both the transition cost and the model selection are good strategies that may influence the results in a positive manner.

Comparison with other RT multiple streams published results
The best results on RT09 published up to now are the ones by Nwe et al. [39], in their Table IV. Table 14 shows their published results compared to ours. It can be noticed that our results are extremely bad for a single meeting (the NIST 20080201-1405 meeting) possibly because the speakers in the meeting move around the room. Our way of using the tdoa vector needs that the speakers stand in one place; otherwise, the tdoa vector corrupts the decision of the system. If we had to report on the SER excluding meeting NIST 20080201-1405 our results would be better than the published results.
Another comparison could be made if we consider the RT07 set (8 meetings). We obtained a SER of 1.84% for this subset, although they were part of our 28 development set meetings. The best RT07 SER published up to now used as a test set is 2.8% (see [39], Table III). In terms of computational complexity, our method is only 2.5 times more  expensive than our baseline. The computational demand of our baseline is just an iterative algorithm of segmenting and merging process AHC. In contrast, the state of the art system in [39] uses several steps one after the other. The first step is by itself an initial clustering process using only tdoa values but with two phases, a previous intra-pair segmentation and clustering and a subsequent inter-pair clustering fusion. The second step is similar to ours with a cluster modelling and cluster merging process. But the third step is quite complex since it includes 10 iterations of training and clustering runs with different number of Gaussians settings (55 different) and MAP adaptation for each run. In total, there are 550 training and clustering runs. Then, there is a process of consensus-based clustering. Although we do not have data to compare absolute computational cost of both systems, we could say that our model is much simpler and easier to reproduce.

Comparison with an x-vector system for a single channel
With the objective of comparing our system with the recent proposals of neural network-based x-vectors, we have processed our DEVELSET and tested our RT09 set with a system that is available at and that was proposed as a baseline for the Second DIHARD Challenge [4,50,51]. We used the same waveform files that we have used in our research for a singlevector stream and the voice activity detector of our system. Table 15 presents our findings. The x-vector approach consists of an x-vector extraction mechanism followed by a PLDA scoring and an adaptation to the development database. The system adjusts its thresholds to the development set. While the SER results for the DEVELSET are better than our results, the results with the test database are much worse. It could be said that the x-vector system overfits its training to the development database but that it has lower prediction power in the test database. Table 16 presents the results of this comparison meeting by meeting. The xvector system is worse than ours in 5 out of 7 meetings.

DER comparison with information bottleneck [9] for a single channel
In [9], information bottleneck principle for a single channel is proposed. Our DER results for a subset of the DEVELSET, the devel06 set (see Table 17) are 12.24% compared to the weighted published results of [9] 16.53% (see Table 1). For the RT06 set, our DER is 21.56% compared to 22.8% for their system. However, it is fair to mention that the information bottleneck method is faster than our method.

Discussion
In the first part of the paper, we have analyzed the effect of the transition cost on the SER by demonstrating that there is a strong influence of the transition cost on the performance of the system both for two streams and for a single stream but the results should be taken carefully. The tuning transition weight K may depends on the quality of the models (as shown in the last part of the paper) and on the minimum duration applied. We have discovered that the method that we were using previously (1/M) is not supported by any solid theory as it varies during the diarization process and that it is better to look for a good match of the transition weight for the problem at hand. In summary, the transition weight K should be adapted to the development data. The adaptation should explore also possible variations on the minimum duration applied to a speaker turn. Both adaptations should be done in tandem. The experiments done with a single stream (mfcc) demonstrate also the validity of our proposal being able to improve the relative SER in the test set by 42%.
Published works on speaker diarization [39] showed some evidence that the number of parameters used to model a speaker is a significant topic. This is also known from other areas of pattern recognition such as speaker verification or speech recognition. The solution in [39] uses a consensus method based on many repetitions of the algorithm and it is very computationally demanding.
We have researched and proposed a simple modification to our previous baseline system that consistently improves the results significantly without dramatically increasing the computation cost. Instead of defining the model only at the initialization step based on empirical data and sticking to it throughout the entire process as   We have determined two working points of our parameters and have achieved improvements with this method in all sets that we have used. Our algorithm has resulted in astonishing improvements using only 2.5 times the computation time compared to the baseline particularly for the development set (a 42.09% reduction in speaker error). The algorithm also provides an optimum at 4 times the computation time of the baseline (a 46.47% reduction in speaker error). The improvements in SER for the test set with the model selection technique is more modest (17.54%) but still relevant and demonstrates the validity of the approach.
We have tested also the model selection proposal with a single stream meeting obtaining improved performance over the baseline, both in the development set and the test set.
When both methods are combined together, the results go down to 2.17% and 6.09% SER for the development and the test set respectively (a relative improvement of 47.20% and 22.02%). The SER obtained for a subset of the development set, the RT07 set (1.84%) is outstanding without the need for complicated algorithms and using a very simple modification of our baseline.
If we had to report on the DER obtained for the RT09 set, we should notice that it is still high but we should be aware that a large part of it is due to overlap and MISS+FA error (5.58% overlap and a MISS+FA error of 8.7%) and it heavily depends on one single meeting (14.76% overlap and 19,3% total MISS-FA error). One possible reason for it is that our method assumes that the speakers do not move from their places. Another possible reason may be due to overlap. Our algorithm does not take overlap into account because there is only one hypothesized output for every frame. The fact that this meeting has 14.76% of overlap surely corrupts our models. The overlap error is still a problem that remains mostly unsolved [49,52,53]. Taking into account that the biggest part of our DER error comes from overlap and speech/non-speech detection, our efforts should go in this direction in the future. If we had to report on SER results for RT09 without taking into account this meeting, our SER results would be better than the state of the art.
We have extrapolated our approach to a new test database (AMI), demonstrating that the proposed methods consistently improve the performance of the baseline method although again a tunning of the K parameter is needed.  Finally, we made an effort to compare our system with a more recent x-vector diarization system. While the SER of the x-vector system is lower than ours for the development set, the results for the test set are much higher indicating that the x-vector system is not working well with unseen data. We have compared also the results of our single-stream system with the information bottlenect system in [9] obtaining a superior performance on a subset of the DEVELSET.

Conclusions
In this paper, we have demonstrated that a new transition weight and the minimum duration of a cluster are important parameters that should be explored in diarization algorithms. We have also investigated a method to automatically determine the number of GMMs needed to model a speaker. We have established a system that takes into account both the duration of the speaker's speech and the maximum number of Gaussians used. We have added it to our current diarization algorithm and tested it and demonstrated its value. We have obtained improvements in all sets used, both development and test, and reached relative improvement values ranging from 17.54 to 46.47% in speaker error for the test set and development set respectively. When looking for the optimum of these parameters, significant improvements can be made. Our final combined methods obtain 47.2% and 22.02 % relative improvements in SER for the development and test set, respectively. The results obtained are particularly good with a subset of the development set, the RT07 set. Most of the remaining errors of SER for the test set concern a single meeting that has a lot of overlap that corrupts our speaker models. When our methods are applied to a new publicly available database, they show an improvement in performance of 28.44% relative error against the baseline method. Preliminary experiments with a single-stream (mfcc) endorse the validity of our findings. Comparisons with an x-vector system deliver superior performance of our system when tested on unseen data.

Methods
The aim of this study is to revise, analyze, and improve some algorithms for speaker diarization of meetings with multiple microphone recordings. The meetings are held in different places and different cities as established in NIST evaluations. The participants in each meeting are variable in number and depend on the meeting place and date of recording. The number of participants is unknown for the algorithms and one of the objectives of the algorithms is to discover it. The characteristics of the participants are detailed in the NIST documents although their identity remains anonymous. All of the participants in the meeting have approved the availability of their recordings for research purposes.
The materials obtained after the recording are the files containing the digitized microphone outputs. The recordings are processed by the algorithms proposed in this paper. The statistical analysis tool to present the results is the standard evaluation script provided by NIST and it is available on their web page [46].