 Research
 Open Access
Latent class model with application to speaker diarization
 Liang He^{1}Email authorView ORCID ID profile,
 Xianhong Chen^{1},
 Can Xu^{1},
 Yi Liu^{1},
 Jia Liu^{1} and
 Michael T. Johnson^{2}
https://doi.org/10.1186/s136360190154z
© The Author(s) 2019
 Received: 29 August 2018
 Accepted: 28 May 2019
 Published: 9 July 2019
Abstract
In this paper, we apply a latent class model (LCM) to the task of speaker diarization. LCM is similar to Patrick Kenny’s variational Bayes (VB) method in that it uses soft information and avoids premature hard decisions in its iterations. In contrast to the VB method, which is based on a generative model, LCM provides a framework allowing both generative and discriminative models. The discriminative property is realized through the use of ivector (Ivec), probabilistic linear discriminative analysis (PLDA), and a support vector machine (SVM) in this work. Systems denoted as LCMIvecPLDA, LCMIvecSVM, and LCMIvecHybrid are introduced. In addition, three further improvements are applied to enhance its performance. (1) Adding neighbor windows to extract more speaker information for each short segment. (2) Using a hidden Markov model to avoid frequent speaker change points. (3) Using an agglomerative hierarchical cluster to do initialization and present hard and soft priors, in order to overcome the problem of initial sensitivity. Experiments on the National Institute of Standards and Technology Rich Transcription 2009 speaker diarization database, under the condition of a single distant microphone, show that the diarization error rate (DER) of the proposed methods has substantial relative improvements compared with mainstream systems. Compared to the VB method, the relative improvements of LCMIvecPLDA, LCMIvecSVM, and LCMIvecHybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments on our collected database, CALLHOME97, CALLHOME00, and SRE08 short2summed trial conditions also show that the proposed LCMIvecHybrid system has the best overall performance.
Keywords
 Speaker diarization
 Variational Bayes
 Latent class model
 ivector
1 Introduction
Speaker diarization task aims to address the problem of “who spoke when” in an audio stream by splitting the audio into homogeneous regions labeled with speaker identities [1]. It has a wide application in automatic audio indexing, document retrieving and speakerdependent automatic speech recognition.
In the field of speaker diarization, variational Bayes (VB) proposed by Patrick Kenny [2–5] and VBhidden Markov model (HMM) introduced by Mireia Diez [6] have become the stateoftheart approaches. This system has two characteristics. First, unlike mainstream approaches (i.e., segmentation and clustering approaches, discussed in the following section), it uses a fixedlength segmentation instead of speaker change point detection to do speaker segmentation, dividing an audio recording into uniform and short segments. These segments are short enough that they can be regarded as each containing only one speaker. This type of segmentation leaves the difficulty to the clustering stage and requires a better clustering algorithm that includes temporal correlation. Second, the VB approach utilizes a soft clustering approach that avoids premature hard decisions. Despite its accuracy, there are still some deficiencies of the approach. The VB approach is a singleobjective method. Its goal is to increase the overall likelihood, which is based on a generative model, not to distinguish speakers. Furthermore, because the segmented segments are very short, the probability that an individual segment occurs given a particular speaker is inaccurate and may degrade system performance. In addition, some researchers have also noted that the VB system is very sensitive to its initialization conditions [7]. For example, if one speaker dominates the recording, a random prior tends to result in assigning the segments to each speaker evenly, leading to a poor result.
In this paper, to address the drawbacks of VB, we apply a latent class model (LCM) to speaker diarization. LCM was initially introduced by Lazarsfeld and Henry [8]. It is usually used as a way of formulating latent attitudinal variables from dichotomous survey items [9, 10]. This model allows us to compute \(p(\mathcal {X}_{m}, \mathcal {Y}_{s}, i_{{ms}})\), which represents the likelihood that both the segment representation \(\mathcal {X}_{m}\) and the estimated class representation \(\mathcal {Y}_{s}\) are from the same speaker, in a more flexible and discriminative way. We introduce the probabilistic linear discriminative analysis (PLDA) and support vector machine (SVM) into the computation, and propose LCMIvecPLDA, LCMIvecSVM, and LCMIvecHybrid systems. Furthermore, to address the problem caused by the shortness of each segment, in consideration of speaker temporal relevance, we take \(\mathcal {X}_{m}\)’s neighbors into account at the data and score levels to improve the accuracy of \(p(\mathcal {X}_{m},\mathcal {Y}_{s})\). A hidden Markov model (HMM) is applied to smooth frequent speaker changes. When the speakers are imbalanced, we use an agglomerative hierarchical cluster (AHC) approach [11] to address the system sensitivity to initialization.
The parameter selection experiments are mainly carried out on the NIST RT09 SPKD database [12] and our collected speaker imbalanced database. In practice, the number of speakers in a meeting or telephone call is relatively easy to be obtained. We assume that this number is known in advance. RT09 has two evaluation conditions: single distant microphone (SDM), where only one microphone channel is involved; and multiple distant microphone (MDM), where multiple microphone channels are involved. In this paper, we mainly consider the speaker diarization task under the SDM condition. We also conduct performance comparison experiments on the RT09, CALLHOME97 [13], CALLHOME00 (a subtask of NIST SRE00), and SRE08 short2summed trial condition. Experiment results show that the proposed method has better performance compared with the mainstream systems.
The remainder of this paper is organized as follows. Section 2 describes mainstream approaches and algorithms. Section 3 introduces the latent class model (LCM), and Section 4 realizes the LCMIvecPLDA, LCMIvecSVM, and LCMIvecHybrid systems. Further improvements are presented in Section 5. Section 6 discusses the difference between our proposed methods and related works. Experiments are carried out and the results are analyzed in Section 7. Conclusions are drawn in Section 8.
2 Mainstream approaches and algorithms
Speaker diarization is defined as the task of labeling speech with the corresponding speaker. The most common approach consists of speaker segmentation and clustering [1, 14].
The mainstream approach to speaker segmentation is finding speaker change points based on a similarity metric. This includes Bayesian information criterion (BIC) [15], KullbackLeibler [16], generalized likelihood ratio (GLR) [17], and ivector/PLDA [18]. More recently, there are also some metrics based on deep neural networks (DNN) [19, 20], convolutional neural networks (CNN) [21, 22], and recurrent neural networks (RNN) [23, 24]. However, the DNNrelated methods need a large amount of labeled data and might suffer from a lack of robustness when working in different acoustic environments.
In speaker clustering, the segments belonging to the same speaker are grouped into a cluster. The problem of measuring segment similarity remains the same as for speaker segmentation and the metrics described above can also be used for clustering. Cluster strategies based on hard decisions include agglomerative hierarchical clustering (AHC) [11] and division hierarchical clustering (DHC) [25]. A soft decisionbased strategy is the variational Bayes (VB) [5], which is combined with eigenvoice modeling [2]. Taking temporal dependency into account, HMM [6] and hidden distortion models (HDM) [26, 27] are successfully applied in speaker diarization. There are also some DNNbased clustering strategies. In [28], a clustering algorithm is introduced by training a speaker separation DNN and adapting the last layer to specific segments. Another paper [29] introduces a DNNHMMbased clustering method, which uses a discriminative model rather than a generative model, i.e., replacing GMMs with DNNs, for the estimation of emission probability, achieving better performance.
Some diarization systems based on ivector, VB, or DNN are trained in advance, rely on the knowledge of application scenarios, and require large amount of matched training data. They perform well in fixed conditions. While some other diarization systems, such as BIC, HMM, or HDM, have little prior training. They are condition independent and more robust to the change of conditions. They perform better if the conditions, such as channels, noises, or languages, vary frequently.
2.1 Bottomup approach
The bottomup approach is the most popular one in speaker diarization [11], which is often referred to as an agglomerative hierarchical clustering (AHC). This approach treats each segment, divided by speaker change points, as an individual cluster, and merges a pair of clusters into a new one based on the nearest neighbor criteria. This merging process is repeated until a stopping criterion is satisfied. To merge clusters, a similarity function is needed. When clusters are represented by a single Gaussian or sometimes Gaussian mixture model (GMM), Bayesian information criterion (BIC) [30–32] is often adopted. When clusters are represented by ivectors, cosine distance [33] or probabilistic linear discriminant analysis (PLDA) [34–37] is usually used. The stopping criteria can be based on thresholds, or on a preassumed number of speakers, alternatively [38, 39].
Bottomup approach is more sensitive to nuisance variations (compared with the topdown approach), such as speech channel, speech content, or noise [40]. A similarity function, which is robust to these nuisance variations, is crucial to this approach.
2.2 Topdown approach
The topdown approach is usually referred to as a divisive hierarchical clustering (DHC) [25]. In contrast with the bottomup approach, the topdown approach first treats all segments as unlabeled. Based on a selection criterion, some segments are chosen from these unlabeled segments. The selected segments are attributed to a new cluster and labeled. This selection procedure is repeated until no more unlabeled segments are left or until the stopping criteria, similar to those employed in the bottomup approach, is reached. The topdown approach is reported to give worse performance on the NIST RT database [25] and has thus received less attention. However, paper [40] makes a thorough comparative study of these two approaches and demonstrates that these two approaches have similar performance.
The topdown approach is characterized by its highcomputational efficiency but is less discriminative than the bottomup approach. In addition, topdown is not as sensitive to nuisance variation and can be improved through cluster purification [25].
Both approaches have common pitfalls. They make premature hard decisions which may cause error propagation. Although these errors can be fixed by Viterbi resegmentation in next iterations [40, 41], a soft decision is still more desirable.
2.3 Hidden distortion model
Different from AHC or DHC, HMM takes temporal dependencies between samples into account. Hidden distortion model (HDM) [26, 27] can be seen as a generalization of HMM to overcome its limitations. HMM is based on the probabilistic paradigm while HDM is based on the distortion theory. In HMM, there is no regularization option to adjust the transition probabilities. In HDM, a regularization of transition cost matrix, used as a replacement of transition probability matrix, is a natural part of the model. Both HMM and HDM do not suffer from error propagation. They do resegmentation via a Viterbi or forwardbackward algorithm. And each iteration may fix errors in previous loops.
2.4 Variational Bayes
The equality holds if and only if \(q(\mathcal {Y},I) = q(\mathcal {Y},I\mathcal {X})\). The VB assumes a factorization \(q(\mathcal {Y},I) = q(\mathcal {Y}) q(I)\) to approximate the true posterior \(q(\mathcal {Y},I\mathcal {X})\) [2]. Then, \(q(\mathcal {Y})\) and q(I) are iteratively refined to increase the lower bound of \(\log q(\mathcal {X})\). The final speaker diarization label can be assigned according to segment posteriors [2]. The implementation of VB approach is shown in Algorithm 1. Compared with the bottomup or topdown approach, the VB approach uses a soft decision strategy and avoids a premature hard decision.
3 Latent class model
 1The objective function is factorized as$$ \begin{aligned} \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(\mathcal{X}_{m}, \mathcal{Y}_{s}, i_{{ms}}) & = \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(\mathcal{X}_{m}, \mathcal{Y}_{s}) p(i_{{ms}}\mathcal{X}_{m}, \mathcal{Y}_{s}) \\ & = \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(\mathcal{X}_{m}, \mathcal{Y}_{s}) q_{{ms}} \end{aligned} $$(3)In this step, \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is assumed to be known. We use q_{ms} denote \(p(i_{{ms}}\mathcal {X}_{m}, \mathcal {Y}_{s})\) for simplicity. Note that, q_{ms}≥0 and \({\sum \nolimits }_{s=1}^{S} q_{{ms}} = 1\). The (3) is optimized by Jensen’s inequality and Lagrange multiplier method. The updated \(q_{{ms}}^{(u)}\) is$$ q_{{ms}}^{(u)} =\frac { q_{{ms}} p(\mathcal{X}_{m}, \mathcal{Y}_{s})} { {\sum\nolimits}_{s'=1}^{S} q_{ms'} p(\mathcal{X}_{m}, \mathcal{Y}_{s'})} $$(4)
The explanation for step 1 is that q_{ms} is updated, given \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is known.
 2The objective function is factorized as$$ \begin{aligned} \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(\mathcal{X}_{m}, \mathcal{Y}_{s}, i_{{ms}}) & = \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(i_{{ms}}) p(\mathcal{X}_{m}, \mathcal{Y}_{s}i_{{ms}}) \\ & \approx \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} q_{{ms}} p(\mathcal{Y}_{s}) p(\mathcal{X}_{m} \mathcal{Y}_{s}, i_{{ms}}) \end{aligned} $$(5)
There are two approximations used in this step. First, we use the posterior probability q_{ms} in step 1 as the prior probability p(i_{ms}) in this step. Second, \(p(\mathcal {Y}_{s}i_{{ms}}) = p(\mathcal {Y}_{s})\) is assumed. According to our understanding, \(\mathcal {Y}_{s}\) is the speaker representation and i_{ms} is the indicator between segment and speaker. Since \(\mathcal {X}_{m}\) is not referenced, \(\mathcal {Y}_{s}\) and i_{ms} are assumed to be independent of each other. A similar explanation is also given in Kenny’s work, see (10) in [2]. The goal of this factorization is to put \(\mathcal {Y}_{s}\) on the position of parameter, which provides a way to optimize it. And this step is to estimate \(\mathcal {Y}_{s}\), given p(i_{ms}) is known.
 3The objective function is factorized as$$ \begin{aligned} \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(\mathcal{X}_{m}, \mathcal{Y}_{s}, i_{{ms}}) & = \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} p(i_{{ms}}) p(\mathcal{X}_{m}, \mathcal{Y}_{s}i_{{ms}}) \\ & \approx \sum\limits_{m=1}^{M} \log \sum\limits_{s=1}^{S} q_{{ms}} p(\mathcal{X}_{m}) p(\mathcal{Y}_{s} \mathcal{X}_{m}, i_{{ms}}) \end{aligned} $$(6)
There are also two approximations used in this step. First, we use the posterior probability q_{ms} in step 1 as the prior probability p(i_{ms}) in this step. Second, \(p(\mathcal {X}_{m}i_{{ms}}) = p(\mathcal {X}_{m})\) is assumed. According to our understanding, \(\mathcal {X}_{m}\) is the segment representation and i_{ms} is the indicator between segment m and speaker s. Since \(\mathcal {Y}_{s}\) is not referenced, \(\mathcal {X}_{m}\) and i_{ms} are assumed to be independent of each other. The explanation for step 3 is that \(p(\mathcal {X}_{m}, \mathcal {Y}_{s} i_{{ms}})\) is calculated, given p(i_{ms}) and \(\mathcal {Y}_{s}\) are known. We compute the posterior probability \(p(\mathcal {Y}_{s} \mathcal {X}_{m}, i_{{ms}})\) rather than \(p(\mathcal {X}_{m} \mathcal {Y}_{s}, i_{{ms}})\) to approximate \(p(\mathcal {X}_{m}, \mathcal {Y}_{s}i_{{ms}})\) with the goal that this factorization is to take advantages of S speaker constraint. In next loop, \(p(\mathcal {X}_{m}, \mathcal {Y}_{s}i_{{ms}})\) is used as the approximation of \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) and go to step 1, see Fig. 1.

Although the form of objective function (\(\arg _{Q,\mathcal {Y}} \max \log p(\mathcal {X}, \mathcal {Y}, I)\)) is the same in these three steps, the prior setting, factorized objective function and variables to be optimized are different, see Table 1 and Fig. 1. This will also be further verified in the next section.

The connection between step 1 and steps 2 and 3 are p(i_{ms}) and \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\), see the upper left text box in Fig. 1. We use the posterior probability (\(p(i_{{ms}}\mathcal {X}_{m}, \mathcal {Y}_{s})\) and \(p(\mathcal {X}_{m}, \mathcal {Y}_{s}i_{{ms}})\)) in the previous step or loop as the prior probability (p(i_{ms}) and \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\)) in the current step or loop.

The main difference between step 2 and step 3 is whether \(\mathcal {Y}_{s}\) is known, see the lower left text box in Fig. 1. The goal of step 2 is to make a more accurate estimation of speaker representation while the goal of step 3 is to compute \(p(\mathcal {X}_{m},\mathcal {Y}_{s}i_{{ms}})\) in a more accurate way. The explicit functions in step 2 and step 3 can be different as long as \(\mathcal {Y}_{s}\) is the same.

A unified objective function or not? Not necessary. Of course, a unified objective function is more rigorous in theory, e.g., VB [2]. In fact, we can use the above model to explain the VB in [2]. The (15), (19), and (14) in [2] are corresponding to steps 1, 2, and 3, respectively ^{1}. However, the prior setting in each step is different, as stated in Table 1, we can take advantage of it to make a better estimation or computation. For example, we have two additional ways to improve \(p(\mathcal {Y}_{s}, \mathcal {X}_{m} i_{{ms}})\) in step 3, compared with the VB. First, the (14) in [2] is the eigenvoice scoring, given \(\mathcal {X}_{m}\) and \(\mathcal {Y}_{s}\) are known, which can be further improved by more effective scoring method, e.g., PLDA. Second, there are S classes constraint, turning the openset problem into the closeset problem.

Whether the loop is converged? Not guaranteed. Since the estimation of \(\mathcal {Y}_{s}\) and computation of \(p(\mathcal {X}_{m},\mathcal {Y}_{s}i_{{ms}})\) are choices of designers, the loop will not converge for some poor implementation. But, if \( p^{u}(\mathcal {X}_{m},\mathcal {Y}^{u}_{s^{*}}i_{ms^{*}}=1) > p(\mathcal {X}_{m},\mathcal {Y}_{s^{*}}i_{ms^{*}}=1)\) (monotonically increase with upper bound) is satisfied, the loop will converge to a local or global optimal. The notation with star means that it’s the ground truth. The \(\mathcal {Y}\) with a superscript u means the updated \(\mathcal {Y}\) in step 2 and the p with a superscript u means another (or updated) similarity function in step 3. This also implies that we have two ways to optimize the objective function. One is to use a better \(\mathcal {Y}\) (e.g., updated \(\mathcal {Y}\) in step 2) and the other one is to choose a more effective similarity function.

Whether the converged results conform to the diarization task? The KullbackLeibler divergence between Q and I is \(D_{\text {KL}}(I\Q) =  {\sum \nolimits }_{m=1}^{M} \log q_{{ms}}\). The minimization of KL divergence between Q and I is equal to the maximization of \({\sum \nolimits }_{m=1}^{M} \log q_{{ms}}\). According to (3), q_{ms} depends on \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\). If \(p(\mathcal {X}_{m}, \mathcal {Y}_{s^{*}}) > p(\mathcal {X}_{m}, \mathcal {Y}_{s'}), s^{*} \neq s'\) (\(\phantom {\dot {i}\!}i_{ms^{*}}=1\) is the ground truth), the converged results will satisfy the diarization task.

In addition to explicit unknown Q and \(\mathcal {Y}\), the unknown factors also include implicit functions, e.g., \(p(\mathcal {X}_{m},\mathcal {Y}_{s} i_{{ms}})\) in steps 2 and 3. These implicit functions are statistical models selected by designers in implementation. What we want to emphasize is that we can do optimization on its parameters for a already selected function, we can also do optimization by choosing more effective functions based on known setting, e.g., from eigenvoice to PLDA or SVM scoring.
Settings for LCM in each step
Step  Prior setting  Factorized objective function  To be updated 

1  \( p(\mathcal {X}_{m}, \mathcal {Y}_{s})\)  \({\sum \nolimits }_{m=1}^{M} \log {\sum \nolimits }_{s=1}^{S} p(\mathcal {X}_{m}, \mathcal {Y}_{s}) q_{{ms}} \)  q _{ ms} 
2  \(\mathcal {X}_{m}, q_{{ms}}\)  \({\sum \nolimits }_{m=1}^{M} \log {\sum \nolimits }_{s=1}^{S} q_{{ms}} p(\mathcal {X}_{m} \mathcal {Y}_{s}, i_{{ms}}) p(\mathcal {Y}_{s}) \)  \(\mathcal {Y}_{s}\) 
3  \(\mathcal {X}_{m},q_{{ms}},\mathcal {Y}_{s}\)  \({\sum \nolimits }_{m=1}^{M} \log {\sum \nolimits }_{s=1}^{S} q_{{ms}} p(\mathcal {Y}_{s}\mathcal {X}_{m}, i_{{ms}}) p(\mathcal {X}_{m}) \)  \( p(\mathcal {X}_{m}, \mathcal {Y}_{s}i_{{ms}})\) 
4 Implementation
 1
In VB, \(\mathcal {X}_{m}\) is an acoustic feature. \(\mathcal {Y}_{s}\) is specified as a speaker ivector. \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is the eigenvoice scoring (Eq. (14) in [2]).
 2
In LCMIvecPLDA, \(\mathcal {X}_{m}\) is specified as a segment ivector. \(\mathcal {Y}_{s}\) is specified as a speaker ivector. \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is calculated by PLDA.
 3
In LCMIvecSVM, \(\mathcal {X}_{m}\) is specified as a segment ivector. \(\mathcal {Y}_{s}\) is specified as a SVM model trained on speaker ivectors. \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is calculated by SVM.
Actually, \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) can be regarded as a speaker verification task of short utterances, which will benefit from the large number of previous studies on speaker verification.
 1
segment ivector w_{m} is extracted from x_{m} and its neighbors, which will be further explained in Section 5.
 2
speaker ivector w_{s} is estimated based on Q={q_{ms}} and X={x_{m}}.
 3
\(p(\mathcal {X}_{m}, \mathcal {Y}_{s}) = p(\mathrm {w}_{m}, \mathrm {w}_{s})\) is computed through PLDA scoring.
 4
Update q_{ms} by \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\).
This above process is repeated until the stopping criterion is met. The step 1 is a standard ivector extraction procedure [42] and step 4 is realized by (4). So, we will put more attention on steps 2 and 3 in the following subsections.
4.1 Estimate speaker ivector w_{s}
In the above estimation, T and Σ are assumed to be known. These can be estimated on a large auxiliary database in a traditional ivector manner.
4.2 Compute \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\)
To compute \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\), we first extract segment ivectors w_{m} from x_{m} and its neighbors, and evaluate the probability that w_{m} and w_{s} are from the same speaker. We take advantages of PLDA and SVM to improve system performance, and propose LCMIvecPLDA, LCMIvecSVM, and LCMIvecHybrid systems.
4.2.1 PLDA
4.2.2 SVM
Another discriminative option is using a support vector machine (SVM). After the estimation of w_{s}, we train SVM models for all speakers. When training a SVM model (η_{s},b_{s}) with a linear kernel for speaker s, w_{s} is regarded as a positive class and the other speakers \(\phantom {\dot {i}\!}\omega _{s'}\)(s^{′}≠s) are regarded as negative classes. η_{s},b_{s} are linearly compressed weight and bias.
where κ is a also scale factor (κ=10 in the SVM setting). As \(p(\mathcal {X}_{m})\) is the same for S speakers and \(p(\mathcal {Y}_{s},\mathcal {X}_{m} i_{{ms}}) = p(\mathcal {X}_{m}) p(\mathcal {Y}_{s}\mathcal {X}_{m}, i_{{ms}}) \), the \(p(\mathcal {X}_{m})\) will be canceled in the following computation. The flow chart of LCMIvecSVM is shown in Fig. 3 without the flow path denoted as PLDA.
4.2.3 Hybrid
The calculation of \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is not explicitly specified in the LCM algorithm, which is just like the kernel function in SVM. As long as the kernel matrix satisfies the Mercer criterion [48], different choices may make the algorithm more discriminative and more generalized. In addition, multiple kernel learning is also possible by combining several kernels to boost the performance [49]. In the LCM algorithm, as long as the probability \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) satisfies the condition that the more likely both \(\mathcal {X}_{m}\) and \(\mathcal {Y}_{s}\) are from the same class s, the larger \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) will be, we can take it and embrace more algorithms, e.g., the abovementioned PLDA and SVM. We combine PLDA with SVM by iteration, see Fig. 3. This iteration takes advantages of both PLDA and SVM and is expected to reach a better performance. This hybrid iterative system is denoted as LCMIvecHybrid system.
5 Further improvements
5.1 Neighbor window
In fixedlength segmentation, each segment is usually very short to ensure its speaker homogeneity. However, this shortness will lead to inaccuracy when extracting segment ivectors and calculating \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\). Intuitively, if a speaker s appears at time m, the speaker will appear at a great probability in the vicinity of time m. So its neighboring segments can be used to improve the accuracy of \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\). We propose two methods of incorporating neighboring segment information. At data level, we extract longterm segmental ivector \(\mathcal {X}_{m}\) to use the neighbor information. At score level, we build homogeneous Poisson point process model to calculate \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\).
5.1.1 Data level window
5.1.2 Score level window
5.2 HMM smoothing
After several iterations, speaker diarization results can be obtained according to q_{ms}. However, the sequence information is not considered in the LCM system, there might be a number of speaker change points in a short duration. To address the frequent speaker change problem, a hidden Markov model (HMM) is applied to smooth the speaker change points. The initial probability of HMM is \(\pi _{s} = p(\mathcal {Y}_{s})\). The selfloop transition probability is a_{ii} and the other transition probabilities are \(a_{{ij}} = \frac {1a_{{ii}}}{S1}, i \neq j\). Since the probability that a speaker transits to itself is much larger than that of changing to a new speaker, the selfloop probability is set to be 0.98 in our work. The emission probability is calculated based on PLDA (13) or SVM (16). With this HMM parameters, q_{ms} can be smoothed using the forwardbackward algorithm.
5.3 AHC initialization
Although random initialization works well in most cases, LCM and VB systems tend to assign the segments to each speaker evenly in the case where a single speaker dominates the whole conversation, leading to poor results. According to the comparative study [40], we know that the bottomup approach will capture comparatively purer models. Therefore, we recommend an informative AHC initialization method, similar to our previous paper [51]. After using PLDA to compute the log likelihood ratio between two segment ivectors [34, 35], AHC is applied to perform clustering. Using the AHC results, two prior calculation methods, hard prior and soft prior, are proposed [51].
5.3.1 Hard prior
where \(\mathcal {I} (\cdot)\) is the indicator function. \(\mathcal {I} \left ({\mathcal {X}}_{m} \in s \right)\) means a segment m is classified to speaker s.
5.3.2 Soft prior
where d_{max,s}= maxx_{m}∈s(d_{ms}), k is a constant value. This soft prior probability varies from 0.5 to 1, ensuring that if w_{s} is closer to \(\phantom {\dot {i}\!}{\mathrm {\mu }}_{\mathrm {w}_{s}}, q_{{ms}}\) will be larger. For other speakers at time m, the prior probability is (1−q_{ms})/(S−1).
6 Related work and discussion
6.1 Core problem of speaker diarization
where \(\mathcal {X}\) be the observed data, \(\mathcal {Y}\) and Q are hidden speaker representation and latent class probability matrix. Both objective functions can solve the problem of speaker diarization. However, the objective function (23) involves segmentation which introduces a premature hard decision that may degrade the system performance. The objective function (24) has difficulty in solving speaker overlapping problem and depends on the accurate estimate of speaker number.
6.2 Compared with VB
In VB, \(\mathcal {Y}_{s} \) is a speaker ivector and \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is the eigenvoice scoring (Eq. (14 in [2]), a generative model. In our paper, we replace eigenvoice scoring with PLDA or SVM scoring to compute \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) which benefits from the discriminability of PLDA or SVM. Both VB and LCMIvecPLDA/SVM are iterative processes, and there are two important steps: Step 1 estimate Q based on \(\mathcal {X}\) and \(\mathcal {Y}\). Step 2 estimate \(\mathcal {Y}\) based on \(\mathcal {X}\) and Q.
The two algorithms are almost the same in the second step. However, in step 1, the calculation of Q is more accurate by introducing the PLDA or SVM. In recent speaker recognition evaluations (e.g., NIST SREs), the IvecPLDA performed better than eigenvoice model (or joint factor analysis, JFA) [3]. The SVM is suitable for classification task with small samples. This is the reason why we introduce these two methods to LCM. Compared with VB, the main benefit of LCMIvecPLDA/SVM is that it takes advantages of PLDA or SVM to improve the accuracy of \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\). Besides, the \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) is enhanced by its neighbors both at the data and score level.
6.3 Compared with IvecPLDAAHC
The PLDA has many applications in speaker diarization. Similar to GMMBICAHC method, the IvecPLDAAHC method has become popular in many research works. This way of using ivector and PLDA follows the idea of segmentation and clustering. The role of PLDA is to evaluate the similarity of clusters divided by speaker change point, as done in paper [18, 34–37]. Based on the PLDA similarity matrix, AHC is applied to the clustering task. Although the performance is improved, it still has the premature hard decision problem.
6.4 Compared with PLDAVB
In paper [7], PLDA is combined with VB, and is similar to ours. We believe that the probabilisticbased iterative framework, as depicted in the LCM, and not just the introduction of PLDA, is the key to solving the problem of speaker diarization. Our subsequent experiments also prove that using SVM can achieve a similar performance. The hybrid iteration inspired by the LCM can improve the performance further. In addition, we also study the use of neighbor information, HMM smoothing, and initialization method.
7 Experiments
Experiments have been implemented on five databases: NIST RT09 SPKD SDM (RT09), our own speaker imbalanced TL (TL), LDC CALLHOME97 American English speech (CALLHOME97) [13], NIST SRE00 subset of the multilingual CALLHOME (CALLHOME00), and NIST SRE08 short2summed (SRE08) databases to examine the performance of LCM. Speaker error (SE) and diarization error rate (DER) are adopted as metrics to measure the system performance according to the RT09 evaluation plan [12] for RT09, TL, CALLHOME97, and CALLHOME00 database. Equal error rate (EER) and minimum detection cost function (MDCF08) are adopted as auxiliary metrics for SRE08 database.
7.1 Common configuration
Perceptual linear predictive (PLP) features with 19 dimensions are extracted from the audio recordings using a 25 ms Hamming window and a 10 ms stride. PLP and logenergy constitute a 20 dimensional basic feature. This base feature along with its first derivatives are concatenated as our acoustic feature vector. VAD is implemented using the frame logenergy and subband spectral entropy. The UBM is composed of 512 diagonal Gaussian components. The rank of the total variability matrix T is 300. For the PLDA, the rank of the subspace matrix is 150. For segment neighbors, ΔM_{d},ΔM_{s} and λ are 40, 40, and 0.05, respectively.
7.2 Experiment results with RT09
The NIST RT09 SPKD database has seven English meeting audio recordings and is about 3 h in length. The BeamformIt toolkit [52] and QualcommICSIOGI [53] frontend are adopted to realize acoustic beamforming and speech enhancement. We use SwitchboardP1, RT05, and RT06 to train UBM, T, and PLDA parameters. Three sets of experiments have been implemented to verify the performance of our proposed LCM systems, usage of neighbor window, and HMM smoothing on RT09 database, respectively.
7.2.1 Comparison among different methods
Miss and FA of LCMIvecHybrid system for RT09
Miss[%]  FA[%]  

EDI_200711281000  3.64  4.81 
EDI_200711281500  8.36  6.68 
IDI_200901281600  4.09  1.32 
IDI_200901291000  5.91  7.78 
NIST_200802011405  20.01  2.54 
NIST_200802271501  8.86  1.26 
NIST_200803070955  5.35  2.49 
Average  8.03  3.84 
Experiment results of different methods on RT09
DER[%]  Speaker #  BIC  VB  LCMIvec  

PLDA  SVM  Hybrid  
Given speaker #    Yes  Yes  Yes  Yes  Yes 
EDI_200711281000  4  29.32  10.67  9.89  9.91  9.83 
EDI_200711281500  4  35.61  48.66  19.68  19.87  17.40 
IDI_200901281600  4  29.12  11.15  7.02  7.14  7.14 
IDI_200901291000  4  37.27  35.85  31.99  32.37  21.82 
NIST_200802011405  5  61.54  49.05  44.67  43.05  38.53 
NIST_200802271501  6  40.32  39.97  24.76  25.66  13.96 
NIST_200803070955  11  46.62  23.50  22.86  16.44  16.00 
Average    39.97  31.26  22.98  22.06  17.81 
Compared with other work performance on RT09. Scoring overlapped speech is accounted in the error rates
7.2.2 Effect of different neighbor window
Performance of LCM system with or without neighbor windows
DER[%]  LCMIvecPLDA  LCMIvecSVM  

neighbor window  No  Data  Data+score  No  Data  Data+score 
EDI_200711281000  10.67  10.66  9.89  10.72  10.64  9.91 
EDI_200711281500  45.14  20.93  19.68  43.02  20.77  19.87 
IDI_200901281600  11.38  7.04  7.02  8.06  7.61  7.14 
IDI_200901291000  34.00  32.11  31.99  33.19  32.24  32.37 
NIST_200802011405  49.17  49.17  44.67  44.43  43.82  43.05 
NIST_200802271501  58.49  47.11  24.76  27.01  26.18  25.66 
NIST_200803070955  24.91  23.52  22.86  21.85  20.44  16.44 
7.2.3 Effect of HMM smoothing
Experiment result of LCMIvecPLDA system with or without HMM smoothing
SE[%]  DER[%]  

noHMM  HMM  noHMM  HMM  
EDI_200711281000  1.5  1.4  9.91  9.89 
EDI_200711281500  29.5  4.5  44.68  19.68 
IDI_200901281600  12.1  1.7  18.67  7.02 
IDI_200901291000  13.4  12.1  33.28  31.99 
NIST_200802011405  29.2  25.1  48.72  44.67 
NIST_200802271501  14.8  13.7  26.91  24.76 
NIST_200803070955  10.1  14.6  16.83  22.86 
7.3 Experiment results with TL
The AHC initialization aims to solve of problem of speaker imbalance. When there is one speaker dominating the whole conversation (> 80% of the speech), VB and LCM will be sensitive to the initialization. Random initialization results in poor performance. But, if the conversation is not speaker imbalance, the initialization method has little influence on the performance. All the experiments except this section are random initialized.
Experiment result of AHC initialization
AHC initial  SE[%]  DER[%] 

TL 7  3.0  5.9 
TL 8  6.4  11.4 
TL 9  7.8  9.5 
Experiment result with random initialization and AHC initialization
SE[%]  DER[%]  

VB  random  hard prior  soft prior  random  hard prior  soft prior 
TL 7  36.9  1.7  1.9  40.1  4.9  5.2 
TL 8  24.1  6.1  1.3  28.7  10.8  6.1 
TL 9  30.6  6.6  1.1  32.4  8.4  2.9 
LCMIvecPLDA  random  hard prior  soft prior  random  hard prior  soft prior 
TL 7  38.8  8.5  0.6  42.0  10.5  2.6 
TL 8  32.2  2.3  0.8  36.9  7.1  5.5 
TL 9  44.7  6.2  1.1  46.5  8.0  2.9 
7.4 Experiment results with CALLHOME97
The LDC CALLHOME97 American English speech database (CALLHOME97) consists of 120 conversations. Each conversation is about 30 min and includes about 10min transcription. Only the transcribed parts are used. There are 109, 9, and 2 conversations containing 2, 3, and 4 speakers, respectively. We follow the practice of [55] and [56], conversations with 2 speakers are examined. We use Switchboard P13/Cell and SRE0406 to train UBM, T, and PLDA parameters.
We find an interesting thing. In the low region of DER (< 6%), the performance of VB and LCM systems is similar. In the middletohigh region of DER (> 6%), LCM is not better than VB for all test conversations, but it has a significant performance improvement for a considerable number of conversations, see the distribution of blue diamonds and grey triangles in Fig. 9. The same situation is also reflected in Table 3. We believe that the VB is trapped in a local optimum for these segments. By contrast, the LCM avoids this situation by incorporating with different methods. In addition, the standard deviation of DER and SE of the LCM is smaller (Table 9), indicating that the performance of the LCM system is more stable.
Table 9 compares the results. It can be seen that compared with the VB system, the LCMIvecHybrid system has a relatively improvement of 26.6% and 17.3% in SE and DER, respectively. Compared with other listed methods, the LCMIvecHybrid system also performs best on the CALLHOME97 database. Diarization systems based on ivector, VB, or LCM are trained in advance and perform well in fixed conditions. While diarization systems based on HDM have little prior training, it can perform better if test conditions vary frequently.
7.5 Experiment results with CALLHOME00
The CALLHOME00, a subtask of NIST SRE00, is a multilingual telephone database and consists of 500 recordings. Each recording is about 2∼5 min in duration, containing 2∼7 speakers. We use oracle speech activity marks and speaker numbers. Similar to [34, 38, 57–59], overlapping error is not accounted. So, the DER is identical to the SE in this section. We use Switchboard P13/Cell and SRE0406 to train UBM, T, and PLDA parameters.
Result (in DER [%]) on CALLHOME00 database
7.6 Experiment results with SRE08
The NIST SRE08 short2summed channel telephone data consists of 1788 models and 2215 test segments. Each segment is about 5 min in duration (about 200 h in total). We find that there is no official speaker diarization key for the summed data. Thus, neither DER or SE is adopted for this set of experiments. The paper [2] reports that “We see that there is some correlation between EER and DER, but this is relatively weak.” So, we measure the effect of diarization through EER and MDCF08 in an indirect way. On the one hand, we use the NIST official trials (short2summed, short2summedeng). On the other hand, we follow the practice of [60] and make extended trials (extshort2summed, extshort2summedeng).
We use Switchboard P13/Cell and SRE0406 to train UBM, T and PLDA parameters. Here, our speaker verification system is a traditional GMMIvecPLDA system. The extracted 39 dimension PLP feature has 13 dimension static feature, Δ and ΔΔ. A diagonal GMM with 2048 components is genderindependent. The rank of the total variability matrix T is 600. For the PLDA, the rank of the subspace matrix is 150 [44].
Results on NIST SRE08 summed channel telephone data
Case  Trials (IvecPLDA)  Diarization  EER[%]  MDCF08 

1  Short2short3  –  4.47  0.245 
2  Short2summed  –  16.94  0.686 
3  Short2summed  9.0  –  
4  Short2summed  –  0.493  
5  Short2summed  VB + windows  9.64  0.410 
6  Short2summed  LCMIvecHybrid  8.71  0.374 
7  Extshort2summed  –  10.77  0.438 
8  Extshort2summed  4.39  0.209  
9  Extshort2summed  VB + windows  5.48  0.228 
10  Extshort2summed  LCMIvecHybrid  4.99  0.201 
11  Short2short3eng  –  1.76  0.0895 
12  Short2summedeng  –  14.25  0.504 
13  Short2summedeng  –  0.282  
14  Short2summedeng  VB + windows  6.33  0.236 
15  Short2summedeng  LCMIvecHybrid  5.62  0.245 
16  Extshort2summedeng  –  10.00  0.400 
17  Extshort2summedeng  VB + windows  4.13  0.154 
18  Extshort2summedeng  LCMIvecHybrid  3.48  0.133 
According to our literature research, there are few documents that report EER and MDCF08 on the short2summed condition. We list stateoftheart diarizationverification systems developed by the LPT [61, 62] in 2008 in Table 11. Paper [2] also presents the related EER in its Fig. 4. Compared with them, our system works better. Part of the reason is the advance of speaker verification system, and the other part is the effectiveness of our proposed methods.
Paper [60] gives results on the extended trials which is more convincing in our opinion. On the extshort2summed trials, although our EER (4.99%) is worse than their report (4.39%), but our MDCF08 (0.201) is better than their report (0.209). Besides, paper [60] is a fusion system but our work is a single system.
8 Conclusion
In this paper, we have applied a latent class model (LCM) to the task of speaker diarization. LCM provides a framework that allows multiple models to compute the probability \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\). Based on this algorithm, additional LCMIvecPLDA, LCMIvecSVM and LCMIvecHybrid systems are introduced. These approaches significantly outperform traditional systems.
There are five main reasons for this improvement: (1) introducing a latent class model to speaker diarization and using discriminative models in the computation of \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\) which enhances the system’s ability at distinguishing speakers. (2) Incorporating temporal context through neighbor windows, which increases speaker information extracted from each short segment. This incorporation is used both at the data level, taking \(\mathcal {X}_{m}\) and its neighbors to constitute X_{m} when extracting \(\mathcal {Y}_{m}\), and at the score level, considering the contribution of neighbors when calculating \(p(\mathcal {X}_{m}, \mathcal {Y}_{s})\). 3) Performing HMM smoothing, which takes the audio sequence information into consideration. (4) AHC initialization is also a crucial factor when the conversation is dominated by a single speaker. (5) The hybrid schema can avoid the algorithm falling into local optimum in some cases.
Finally, our proposed system has the best overall performance on NIST RT09, CALLHOME97, CALLHOME00, and SRE08 short2summed database.
Declarations
Acknowledgements
We would like to thank the editor and anonymous reviewers for their careful work and thoughtful suggestions that have helped improve this paper substantially.
Funding
The work was supported by the National Natural Science Foundation of China under Grant No. 61403224.
Authors’ contributions
The contributions of each author are as follows: LH proposed the LCM methods, score level windowing, and the AHC initialization for speaker diarization. He realized these systems in C++ code (Aurora3 Project) and wrote the paper. XC did a lot of literature research, wrote the section of mainstream approaches and algorithms, and studied the data level windowing and HMM smoothing. CX did experiments on the database based on the code provided by He Liang. He also recorded and analyzed the results. YL provided the comparative systems. JL gave some advices. MTJ checked the whole article and polished English language usage.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 X. A. Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, O. Vinyals, Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process.20(2), 356–370 (2012). https://doi.org/10.1109/TASL.2011.2125954.View ArticleGoogle Scholar
 P. Kenny, D. Reynolds, F. Castaldo, Diarization of telephone conversations using factor analysis. IEEE J. Sel. Top. Signal Proc.4(6), 1059–1070 (2010). https://doi.org/10.1109/JSTSP.2010.2081790.View ArticleGoogle Scholar
 P. Kenny, Bayesian analysis of speaker diarization with eigenvoice priors. Technical report, CRIM (2008).Google Scholar
 D. Reynolds, P. Kenny, F. Castaldo, in Proceedings of Interspeech 2009. A study of new approaches to speaker diarization (ISCA(International Speech Communication Association)Brighton, 2009), pp. 1047–1050.Google Scholar
 F. Valente, Variational bayesian methods for audio indexing. PhD thesis, Eurecom, SophiaAntipolis, France (2005). http://dx.doi.org/10.5075/epflthesis2092. http://www.eurecom.fr/publication/1739.
 M. Diez, P. M. L. Burget, in Odyssey The Speaker and Language Recognition Workshop. Speaker diarization based on bayesian hmm with eigenvoice priors (ISCALes Sables d’Olonne, 2018), pp. 147–154.View ArticleGoogle Scholar
 A. E. Bulut, H. Demir, Y. Z. Isik, H. Erdogan, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Pldabased diarization of telephone conversations (IEEEBrisbane, 2015), pp. 4809–4813. https://doi.org/10.1109/ICASSP.2015.7178884.View ArticleGoogle Scholar
 P. F. Lazarsfeld, N. W. Henry, Latent Structure Analysis (Houghton Mill, Boston, 1968).MATHGoogle Scholar
 J. Magidson, J. K. Vermunt, Latent class models for clustering: A comparison with kmeans. Can. J. Mark. Res.20:, 37–44 (2002).Google Scholar
 L. M. Collins, S. T. Lanza, Latent Class and Latent Transition Analysis with Applications in the Social, Behavioral, and Health Sciences, 1st edn. (Wiley, New York, 2009).View ArticleGoogle Scholar
 K. J. Han, S. S. Narayanan, in Proceedings of Interspeech 2007. A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system (ISCAAntwerp, 2007), pp. 1853–1856.Google Scholar
 NIST, The 2009 (RT09) Rich Transcription Meeting Recognition Evaluation Plan (2009). Available: https://www.nist.gov/itl/iad/mig/richtranscriptionevaluation.
 L. data consortium, LDC97S42, Catalog (1997). Available: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogIdDLDC97S42. Accessed 29 Aug 2018.
 M. Zelenák, H. Schulz, J. Hernando, Speaker diarization of broadcast news in albayzin 2010 evaluation campaign. EURASIP J. Audio Speech Music Process.2012(1), 19 (2012). https://doi.org/10.1186/16874722201219.View ArticleGoogle Scholar
 C. Wooters, M. Huijbregts, The ICSI RT07s Speaker Diarization System. (R. Stiefelhagen, R. Bowers, J. Fiscus, eds.) (Springer, Berlin, Heidelberg, 2008).Google Scholar
 V. Gupta, G. Boulianne, P. Kenny, P. Ouellet, P. Dumouchel, in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. Speaker diarization of french broadcast news, (2008), pp. 4365–4368. https://doi.org/10.1109/ICASSP.2008.4518622.
 S. Meignier, T. Merlin, in CMU SPUD Workshop. Lium spkdiarization: an open source toolkit for diarization (HALDallas, 2010).Google Scholar
 B. Desplanques, K. Demuynck, J. P. Martens, in Proceedings of Interspeech 2015. Factor analysis for speaker segmentation and improved speaker diarization (ISCADresden, 2015), pp. 3081–3085.Google Scholar
 S. Jothilakshmi, V. Ramalingam, S. Palanivel, Speaker diarization using auto associative neural networks. Eng. Appl. Artif. Intell.22(4), 667–675 (2009). https://doi.org/10.1016/j.engappai.2009.01.012.View ArticleGoogle Scholar
 V. Gupta, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker change point detection using deep neural nets (IEEEQueensland, 2015), pp. 4420–4424. https://doi.org/10.1109/ICASSP.2015.7178806.View ArticleGoogle Scholar
 M. Hruz, Z. Zajic, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Convolutional neural network for speaker change detection in telephone speaker diarization system (IEEENew Orleans, 2017), pp. 4945–4949.View ArticleGoogle Scholar
 Z. Zajic, M. Hruz, L. Muller, in Proceedings of Interspeech 2017. Speaker diarization using convolutional neural network for statistics accumulation refinement (ISCAStockholm, 2017), pp. 3562–3566.View ArticleGoogle Scholar
 H. Bredin, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Tristounet: Triplet loss for speaker turn embedding (IEEENew Orleans, 2017), pp. 5430–5434.View ArticleGoogle Scholar
 R. Yin, H. Bredin, C. Barras, in Proceedings of Interspeech 2017. Speaker change detection in broadcast tv using bidirectional long shortterm memory networks (ISCAStockholm, 2017), pp. 3827–3831.View ArticleGoogle Scholar
 S. Bozonnet, N. W. D. Evans, C. Fredouille, in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The liaeurecom rt’09 speaker diarization system: Enhancements in speaker modelling and cluster purification (IEEEDallas, 2010), pp. 4958–4961. https://doi.org/10.1109/ICASSP.2010.5495088.View ArticleGoogle Scholar
 I. Lapidot, J. F. Bonastre, in Odyssey The Speaker and Language Recognition Workshop. On the importance of efficient transition modeling for speaker diarization (ISCASingapore, 2012), pp. 138–145.Google Scholar
 I. Lapidot, J. F. Bonastre, in Proceedings of Interspeech 2016. On the importance of efficient transition modeling for speaker diarization (ISCASan Francisco, 2016), pp. 2190–2193.View ArticleGoogle Scholar
 R. Milner, T. Hain, in Proceedings of Interspeech 2016. Dnnbased speaker clustering for speaker diarisation (ISCASan Francisco, 2016), pp. 2185–2189.View ArticleGoogle Scholar
 M. Najafian, J. H. L. Hansen, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Environment aware speaker diarization for moving targets using parallel dnnbased recognizers (IEEENew Orleans, 2017), pp. 5450–5454. https://doi.org/10.1109/ICASSP.2017.7953198.View ArticleGoogle Scholar
 X. Zhu, C. Barras, S. Meignier, J. L. Gauvain, in Proceedings of Interspeech 2005. Combining speaker identification and bic for speaker diarization (ISCALisbon, 2005), pp. 2441–2444.Google Scholar
 T. L. Nwe, H. Sun, B. Ma, H. Li, in Proceedings of Interspeech 2010. Speaker diarization in meeting audio for single distant microphone (ISCAMakuhari, 2010), pp. 1505–1508.Google Scholar
 G. Friedland, A. Janin, D. Imseng, X. Anguera, L. Gottlieb, M. Huijbregts, M. T. Knox, O. Vinyals, The icsi rt09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process.20(2), 371–381 (2012). https://doi.org/10.1109/TASL.2011.2158419.View ArticleGoogle Scholar
 S. Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds, J. Glass, in Proceedings of Interspeech 2011. Exploiting intraconversation variability for speaker diarization (ISCAFlorence, 2011), pp. 945–948.Google Scholar
 G. Sell, D. GarciaRomero, in 2014 IEEE Spoken Language Technology Workshop (SLT). Speaker diarization with plda ivector scoring and unsupervised calibration (IEEE, 2014), pp. 413–417. https://doi.org/10.1109/SLT.2014.7078610.
 G. L. Lan, D. Charlet, A. Larcher, S. Meignier, in Proceedings of Interspeech 2016. Iterative plda adaptation for speaker diarization (ISCASan Francisco, 2016), pp. 2175–2179.View ArticleGoogle Scholar
 A. Sholokhov, T. Pekhovsky, O. Kudashev, A. Shulipa, T. Kinnunen, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Bayesian analysis of similarity matrices for speaker diarization (IEEEFlorence, 2014), pp. 106–110.View ArticleGoogle Scholar
 W. Zhu, J. Pelecanos, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Online speaker diarization using adapted ivector transforms (IEEEShanghai, 2016), pp. 5045–5049. https://doi.org/10.1109/ICASSP.2016.7472638.View ArticleGoogle Scholar
 S. H. Shum, N. Dehak, R. Dehak, J. R. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process.21(10), 2015–2028 (2013). https://doi.org/10.1109/TASL.2013.2264673.View ArticleGoogle Scholar
 G. Sell, A. McCree, D. GarciaRomero, in Proceedings of Interspeech 2016. Priors for speaker counting and diarization with ahc (ISCASan Francisco, 2016), pp. 2194–2198.View ArticleGoogle Scholar
 N. Evans, S. Bozonnet, D. Wang, C. Fredouille, R. Troncy, A comparative study of bottomup and topdown approaches to speaker diarization. IEEE/ACM Trans. Audio, Speech, Lang. Process.20(2), 382–392 (2012).View ArticleGoogle Scholar
 T. L. Nwe, H. Sun, B. Ma, H. Li, Speaker clustering and cluster purification methods for rt07 and rt09 evaluation meeting data. Audio Speech Lang. Process. IEEE Trans.20(2), 461–473 (2012). https://doi.org/10.1109/TASL.2011.2159203.View ArticleGoogle Scholar
 N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Frontend factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process.19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307.View ArticleGoogle Scholar
 I. S. Gradshteyn, I. M. Ryzhik, Tables of Integrals, Series, and Products, 7th ed. (Academic Press, San Diego, 2000).MATHGoogle Scholar
 D. GarciaRomero, C. EspyWilson, in Proceedings of Interspeech 2011. Analysis of ivector length normalization in speaker recognition systems (ISCAFlorence, 2011), pp. 249–252.Google Scholar
 N. Brümmer, E. D. Villiers, in Odyssey The Speaker and Language Recognition Workshop. The speaker partitioning problem (ISCABrno, 2010), pp. 194–201.Google Scholar
 S. J. D. Prince, J. H. Elder, in 2007 IEEE 11th International Conference on Computer Vision. Probabilistic linear discriminant analysis for inferences about identity (IEEERio de Janeiro, 2007), pp. 1–8. https://doi.org/10.1109/ICCV.2007.4409052.Google Scholar
 D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, K. Visweswariah, in IEEE International Conference on Acoustics, Speech and Signal Processing. Boosted mmi for model and featurespace discriminative training (IEEELas Vegas, 2008), pp. 4057–4060.Google Scholar
 M. J, Xvi. functions of positive and negative type, and their connection the theory of integral equations. Philos. Trans. R. Soc. Lond. A Math. Phys. Eng. Sci.209(441458), 415–446 (1909). https://doi.org/10.1098/rsta.1909.0016. http://rsta.royalsocietypublishing.org/content/209/441458/415.full.pdf.View ArticleGoogle Scholar
 F. R. Bach, G. R. G. Lanckriet, M. I. Jordan, in Proceedings of the Twentyfirst International Conference on Machine Learning. ICML ’04. Multiple kernel learning, conic duality, and the smo algorithm (ACMNew York, 2004), p. 6. https://doi.org/10.1145/1015330.1015424. http://doi.acm.org/10.1145/1015330.1015424.View ArticleGoogle Scholar
 D. L. Snyder, M. I. Miller, Random Point Processes in Time and Space (Springer, New York, 1991). https://doi.org/10.1007/9781461231660.View ArticleGoogle Scholar
 L. He, X. Chen, C. Xu, Tianyu, J. Liu, in IEEE Workshop on Signal Processing Systems (SIPS). Ivecpldaahc priors for vbhmm speaker diarization system (IEEELorient, 2017).Google Scholar
 X. Anguera, C. Wooters, J. Hernando, Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process.15(7), 2011–2022 (2007). https://doi.org/10.1109/TASL.2007.902460.View ArticleGoogle Scholar
 A. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S. Kajarekar, N. Morgan, S. Sivadas, in Proc. ICSLP. Qualcommicsiogi features for asr (ISCADenver, 2002), pp. 4–7.Google Scholar
 S. H. Yella, H. Bourlard, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Information bottleneck based speaker diarization of meetings using nonspeech as side information (IEEEFlorence, 2014), pp. 96–100. https://doi.org/10.1109/ICASSP.2017.7953097.View ArticleGoogle Scholar
 I. Lapidot, A. Shoa, T. Furmanov, L. Aminov, A. Moyal, J. F. Bonastre, Generalized viterbibased models for timeseries segmentation and clustering applied to speaker diarization. Comput. Speech Lang.45:, 1–20 (2017).View ArticleGoogle Scholar
 Z. Zajic, M. Kunesova, V. Radova, Investigation of segmentation in ivector based speaker diarization of telephone speech. Springer Int. Publ.45:, 411–418 (2016).Google Scholar
 F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, C. Vair, in IEEE International Conference on Acoustics, Speech and Signal Processing. Streambased speaker segmentation using speaker factors and eigenvoices (IEEELas Vegas, 2008), pp. 4133–4136.Google Scholar
 M. Senoussaoui, P. Kenny, T. Stafylakis, P. Dumouchel, A study of the cosine distancebased mean shift for telephone speech diarization. IEEE/ACM Trans. Audio Speech Lang. Process.22(1), 217–227 (2013).View ArticleGoogle Scholar
 D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. The kaldi speech recognition toolkit (IEEEWaikoloa, 2011). IEEE Catalog No.: CFP11SRWUSB.Google Scholar
 C. Vaquero, A. Ortega, A. Miguel, E. Lleida, Quality assessment for speaker diarization and its application in speaker characterization. IEEE Trans. Audio Speech Lang. Process.21(4), 816–827 (2013).View ArticleGoogle Scholar
 E. Dalmasso, F. Castaldo, P. Laface, D. Colibro, C. Vair, in IEEE International Conference on Acoustics, Speech and Signal Processing. Loquendo  politecnico di torino’s 2008 nist speaker recognition evaluation system (IEEETaipei, 2009), pp. 4213–4216.Google Scholar
 F. Castaldo, D. Colibro, C. Vair, S. Cumani, P. Laface, in IEEE International Conference on Acoustics, Speech and Signal Processing. Loquendo  politecnico di torino’s 2010 nist speaker recognition evaluation system (IEEEPrague, 2011), pp. 4213–4216.Google Scholar