Integration of evolutionary computation algorithms and new AUTO-TLBO technique in the speaker clustering stage for speaker diarization of broadcast news
- Karim Dabbabi^{1}Email author,
- Salah Hajji^{2} and
- Adnen Cherif^{1}
https://doi.org/10.1186/s13636-017-0117-1
© The Author(s). 2017
Received: 15 March 2017
Accepted: 30 August 2017
Published: 19 September 2017
Abstract
The task of speaker diarization is to answer the question "who spoke when?" In this paper, we present different clustering approaches which consist of Evolutionary Computation Algorithms (ECAs) such as Genetic Algorithm (GA), Particle Swarm Optimization (PSO) algorithm, and Differential Evolution (DE) algorithm as well as Teaching-Learning-Based Optimization (TLBO) technique as a new optimization technique at the aim to optimize the number of clusters in the speaker clustering stage which remains a challenging problem. Clustering validity indexes, such as Within-Class Distance (WCD) index, Davies and Bouldin (DB) index, and Contemporary Document (CD) index, is also used in order to make a correction for each possible grouping of speakers' segments. The proposed algorithms are evaluated on News Broadcast database (NDTV), and their performance comparisons are made between each another as well as with some well-known clustering algorithms. Results show the superiority of the new AUTO-TLBO technique in terms of comparative results obtained on NDTV, RT-04F, and ESTER datasets of News Broadcast.
Keywords
1 Introduction
Nowadays, the fast progress in multimedia sources make the use of archived audio documents an increasing need for efficient and effective means of searching and indexing through voluminous databases. In order to facilitate the access to the recording in audio databases, searching and tagging based on who is speaking can be at the top of many basic components required for dealing with audio archives, such as recorded meetings or an audio portion of News Broadcast shows.
The old approaches in speaker recognition are developed for speakers’ identification and verification in a speech sample pronounced by one person. However, the basic recognition approach has to be extended to include both speaker detection and tracking in multi-speaker audio. In this work, we highlight the speakers’ indexation and research in audio broadcast news (NDTV) for speaker diarization task. Indeed, speaker diarization is one of the speaker-based processing techniques in which the feature representation of the acoustic signal aims to represent the speaker information and discriminate between different talkers. It has been introduced in the NIST project of Rich Transcription in “who spoke when” evaluations [1]. According to the first definition in 1999 NIST Speaker Recognition evaluation, the identification of audio regions based on a given speaker is a tracking speaker task [2]. Concerning the speaker detection task in audio data, it is performed by diarization and tracking procedures and it has an objective to make speaker-based indexation according to the detected speaker and ensure good retrieve of speaker-based information in audio recording. For the speaker diarization task, it aims to structure audio documents into speaker turns and give their true identities so that we can make an automatic transcription.
In the speaker diarization task, there is any prior knowledge about the speakers and their number. It consists of two main phases: the first one is a segmentation phase in which the speech is segmented into many smaller segments at the detected change points in a recording. Ideally, each small segment contains speech from just one speaker. The second phase is a clustering phase, which makes the clustering of the neighboring segments uttered by the same speaker. Currently, a bottom-up approach known as Hierarchical Agglomerative Clustering (HAC) is the most popular method for clustering [3]. Speaker diarization has been applied in several speech areas [4]. The transcription of telephone and broadcast meetings, auxiliary video segmentation, and dominant speaker detection represent its main applications. The alleviation of the amount of speech document management tasks can be performed by such an effective tool like speaker clustering [5, 6]. This latter can group and attribute similar audio utterances to the same speaker in audio document by some distance measures and clustering schemes in an unsupervised condition [7]. In previous years, spectral clustering has been proved have better effect than hierarchical clustering in speaker clustering [8, 9]. This is due to greedy research for hierarchical clustering, which has high computation complexity and produces a suboptimal solution. In contrast, spectral clustering has relative lower computation complexity and can produce a global solution.
Searching for suitable model which can represent short segments and enable a similarity and difference measure between neighboring segments for clustering represent an open search topic. Previous approaches have been used to model each segment with a single GMM model or I-vectors extracted from a Universal Background Model (UBM) like it has been described in [10]. Indeed, a Gaussian Mixture Model (GMM) adapted from the UBM which has been used to form an I-vector represents the state-of-the-art systems to represent segments. Generally, good results have been reported using UBM/I-vector. In [11], deep neural network (DNN) has been trained to construct UBM and T-matrix in order to make the extracted I-vectors better models of the underlying speech. This method has shown capability to construct accurate models of speech, even for short segments. This system has also achieved a significant improvement on the NIST 2008 speaker recognition evaluation (SRE) telephone task data compared to state-of-the-art approaches. In this work, we have tried to optimize the clustering of the extracted I-vectors using evolutionary algorithms (EAs), and teaching-learning-based optimization (TLBO) technique.
Speaker clustering based on feature vector distance employs the distance of samples to measure the similarity of two speech segments. Thus, a two-step clustering based on Cross Likelihood Ratio (CLR) has been used by some researchers at the aim to measure the similarity between segments [10]. This approach has been shown its effectiveness to resolve the problem of a single Gaussian model describing the complex distribution of the features. Also, in [12], Rand index has shown good efficiency by reducing the overall clustering errors when it has been used to measure the similarity between utterances. During the agglomeration procedure, Bayesian Information Criteria (BIC) can only make each individual cluster as homogenous as possible, but it cannot guarantee that the homogeneity for all clusters can finally be summed to reach a maximum [12]. In this work, we have used EAs at the aim to optimize the generated clusters and the required number of clusters by estimating and minimizing the clustering validity indexes (criteria). These metrics reflect the clustering errors that arise when utterances from the same speaker are clustered in different clusters or when utterances from different speakers are clustered in the same cluster. We approximate the clustering validity index by a function of similarity measures between utterances and then use the EAs to determinate the cluster in which each utterance should be located, such that function is minimized.
For the clustering stage, there are many techniques used to regroup unlabeled dataset into groups of similar objects called clusters. Indeed, the integration of the evolutionary computation (EC) techniques by researchers in object clustering has an objective to develop clusters in complex dataset. In addition, the EAs are general stochastic search methods which have been at first applied in the biological world simulating natural selection and evolution. Also, they are not limited to keep only one solution for a problem, but they are extended to conserve a population of potential solutions for a problem. Therefore, the EA algorithms have many advantages compared to other traditional search and classification techniques, such as they need less domain-specific information and they can be used easily on a set of solutions (they so-called population). Also, the EA algorithms are so popular in many fields of applications especially in pattern recognition and they include many algorithms, such as genetic algorithm (GA), particle swarm optimization (PSO), evolution programming (EP), evolution strategies (ES), and differential evolution (DE) algorithm. A common concept based on simulating the evolution of the individuals that form the population using a predefined set of operators is a shared concept by all these algorithms. Therefore, the selection and search operators are the two kinds of operators commonly used. For the mutation and recombination, they constitute the most widely used search operators. To determine the optimal number of clusters, the within-class distance (WCD), Davies and Bouldin (DB), and contemporary document (CD) clustering validity indexes have been used in this work at the aim to provide global minima/maxima at the exact number of classes in the dataset. Thus, the quantitative evolution with global clustering validity index permits a correction of each possible grouping. For the evolution process, it starts by the domination of the best solutions in the population and the elimination of the bad ones. After that, the evolution of solutions converges when the near optimal partitioning of the dataset is represented by the fittest solution with the respect of the employed clustering validity index. By this way, in only one run of the evolutionary optimization algorithm, the optimal number of classes along with the accuracy cluster center coordinates can be located. In fact, the performance of the evolutionary optimization algorithm relies sharply on the selection of the clustering validity index.
For the GA, it has been first applied in 1975 by Holland [13] and it is well known in many application fields as a new tool for complex systems optimization. Its main feature is represented by its capability to avoid local minima. Also, the GA is an unsupervised optimization method which can be used freely without any constraint to find the best solution. Therefore, it is the most popular EA and it is well known for resolving hard optimization problems. The GAs have shown their best performance in many application areas, such as pattern recognition, image processing, and machine learning [14]. Comparing GAs to EP and ES techniques, these latter techniques have performed better than GAs for real-valued function optimization. In the speaker diarization research area, there are some works recorded using GA, such as that one in [15] where the GA has been explored to design filter bank in feature extraction method destined for speaker diarization application. Also, in [16], the feature dimension reduction has been made through GAs in the objective to speed up speaker recognition task. In this work, we have used both GA binary and real-coded representations beside different variation of the major control parameters like selection, crossover, and different distance measures of the fitness function using WCD, DB, and CD clustering validity indexes. These indexes have been also explored by PSO and DE algorithms.
Concerning the PSO algorithm, it is a population-based stochastic optimization technique which has been developed in [17, 18]. This algorithm simulates the social behavior of bird stocking or fish schooling. Its first applications have been performed to optimize clustering results in mining tasks. Also, it has been applied for clustering task in wireless sensor networks in which it has been shown its robustness comparing to random search (RS) and simulated annealing (SA) [19]. In addition, PSO algorithm has been tested in document clustering and more compact clusters has been generated by hybridizing PSO algorithm with k-means comparing to the use of k-means algorithm alone [20]. Therefore, the combination of k-means algorithm with PSO algorithm for data clustering has demonstrated high accuracy and fast convergence to optimum solution [21]. In speaker diarization field, PSO algorithm has known many applications, such as it has been used with mutual information (MI) in multi-speaker environment [22]. In 2009, PSO algorithm has been also used with SGMM algorithm in text-independent speaker verification, and good performance has been registered using both algorithms compared to SGMM algorithm alone [23]. In 2011, PSO algorithm has been exploited to encode possible segmentations of an audio record by computing a measure as a fitness function of PSO algorithm between the obtained segments and the audio data using MI. This algorithm has shown good results in all test problems effectuated on two datasets which contain up to eight speakers [22]. Moreover, an optimization of artificial neural network (ANN) for speaker recognition task using PSO algorithm has been performed and shown an improvement in performance comparing to the use of ANN algorithm alone [24]. Like other algorithms, the global PSO algorithm has its drawbacks which are summarized in its tendency to trapper in local optimum under some initialization conditions [25]. More information about the PSO variants as well as about its applications can be found in [26, 27]. Concerning the DE algorithm, it needs little or no parameter to tune for numerical optimization as well as it has shown good performance [5]. Also, this algorithm is characterized by its small parameters to be determinate, high convergence speed, and hardness to fall in local optimum [28]. The previous applications of this approach in real-world and artificial problems have shown that it is superior to GA and PSO algorithms in single objective, noise free, and numerical optimization. One among few works which have been carried out using DE algorithm in speaker recognition applications has been oriented to optimize GMM parameters [29]. In this work, GA, PSO, and DE algorithms have been applied and compared to new TLBO optimization technique, which has been used for automatic clustering of large unlabeled dataset. Indeed, the TLBO technique does not need any prior information about the data to be classified, and it can find the optimal number of data partitions in some iterations. Therefore, this algorithm can be defined as a population-based iterative learning and it possesses more common characteristics than other EC algorithms. Indeed, this technique has shown more improvement in convergence time for solving an optimization problem in real-world real-time applications compared to GA, PSO, DE, and artificial bee colony (ABC) algorithms. In [30], an investigation has been performed about the effect of the introduction of the elitist concept in TLBO algorithm on the performance. In addition, another investigation about the common controlling parameters (population size and the number of generations) and their effects on the performance of the algorithm has been performed too. Moreover, the TLBO technique has been used in [31] in order to optimize four truss structures. In [32], the introduction of the concepts of number of teachers, adaptive teaching factor, tutorial training, and self-motivated learning has been proposed at the aim to improve the performance of the TLBO algorithm. In [33], the θ-multi-objective TLBO algorithm has been presented in the purpose of resolving the dynamic economic emission dispatch problem. Therefore, for the purpose of global optimization problems, a dynamic group strategy has been suggested in [34] in order to improve the performance of the TLBO algorithm too. In addition, the ability of the population has been explored in the original TLBO technique by introducing a ring neighborhood topology [35]. In [36], it has been considered that TLBO technique is one of the simplest and most efficient techniques, as it has been empirically shown to perform well on many optimization problems. From our knowledge, there is any work recorded in speaker diarization research area using TLBO algorithm. More details about the basis TLBO concept can be found in [37].
The remained sections of this paper are organized as follows: In Section 2, we explain the different components of our proposed model. Concerning the next section, we discuss the experimental results. In Section 4, we conclude our paper with fewer discussions.
2 Overview of the methodology
Our model consists of many phases, and a detailed description of each phase is given in the following sub-sections.
2.1 Feature extraction (MFCCs)
Only the first 19 Mel Frequency Cepstral Coefficient (MFCC) features have been used in the Speech Activity Detector (SAD) module, speaker segmentation module, and speaker clustering module. Beside these features, the short-time energy (STE) and the zero-crossing ratio (ZCR) plus the first- and second-order derivatives of MFCCs have been employed in the SAD module. Also, for the speaker segmentation, only 19 MFCCs and short-time energy (STE) have been used, whereas in the speaker clustering stage, the first- and second-order derivatives of MFCCs have been added. The frame sizes for the analysis windows were set to 30 ms with 20 ms frame overlap. For the sampling frequency, it was set to 16 KHz.
2.2 SAD
This subsystem was used for both silence and music removal modules. For the silence removal module, the silence was suppressed from the whole audio recording using energy-based bootstrapping algorithm followed by an iterative classification. After the removal silence, the identification of music and other audible no-speech sounds from the recording have been performed using music vs. speech bootstrap discriminator, which consists to train music model from frames, which are identified as music and have high confidence level. Thus, the music model is refined iteratively. For both silence and music removal modules, in order to avoid the sporadic no-speech to speech transitions, only the segments with more than 1 s duration has been considered as no-speech.
2.2.1 Silence removal
This phase has been performed by concatenating 19 MFCC features plus their first and second derivatives with STE. Each frame has been attributed to silence or speech classes according to a confidence value of energy. Thus, the frames with 20%, the lowest energies are called high-confidence silence frames, and the frames with 10%, the highest energies are called high-confidence speech frames. A Gaussian mixture of size 4 over the 60-dimensional feature space has been used to train bootstrap silence model. The same size has been also employed to train bootstrap speech model using speech frames, which have high confidence level of energy. An iterative classification is employed to perform the frame classification into speech or silence classes. The remained frames between these frames which have high confidence level of energy have been used to train silence and speech models at the next iteration. Increasing the number of iterations engenders an increase in the number of 60-dimensional Gaussians employed to model the speech and silence GMMs till the maximum. The Gaussian Mixtures Model (GMM) with 32 components for the speech and 16 components for no-speech have been given the best results for silence and pauses removal. Also, the high-energy no-speech named the audible no-speech, such as music and jingles, have been classified as speech because the MFCCs and frames energy for music are more similar to speech more than silence.
2.2.2 Music removal
The frames which have high confidence level from the histogram of ZCR for music and from the histogram of energy for the speech are used to train both music and speech models in order to estimate their initial models. Thus, only 40% of the highest zero-crossing rate frames from the ZCR histograms are used as high-confidence music frames and train the music model. After that, a refinement of speech and music classes has been performed in order to discard only music segments in the iterative classification. Thus, this refinement is similar to that performed in silence removal module. In this stage (music removal), 19 MFCC features and their first- and second-order derivatives concatenated with ZCR have been used. Also, the STE has not exploited within the iterative classification process, and by its elimination, the speech with background music which has been classified as music has been changed to speech class.
2.3 Speaker segmentation
Growing window based on the delta Bayesian Information Criteria (∆BIC) distance has been used as a speaker segmentation algorithm. It consists to make a research of a single change point in each frame of the audio recording. This research restarts from the next frame each time when a single change point is detected. In this case, the window size is initialized to 5 s, and for that frame, the distance ∆BIC is calculated. Indeed, a change point is declared as maximum point if the maxima in the window exceed a threshold value θ. In contrast, if there is no change point is detected, then the window size is increased by 2 s and the process is repeated till a change point is detected. We have to remember here that we deal only with speech frames as those no-speech are discarded by the SAD module. Thus, the corresponding locations of change points in the original audio found in these speech frames are declared as change points.
where Σ is the covariance matrix of the merged cluster (c _{1} and c _{2}), Σ _{1} of cluster c _{1}, and Σ _{2} of cluster c _{2}, and N _{1} and N _{2} are, respectively, the number of acoustic frames in cluster c _{1} and c _{2}, λ is a tunable parameter dependent on the data. N = N _{1} + N _{2} denotes the size of two merged clusters. In this speaker segmentation stage, only the 19 MFCC features have been used with their short time energies.
2.4 I-vector extraction
2.4.1 WCCN
The use of the within-class covariance (WCC) matrix to normalize data variances has become widely dispread in the speaker recognition field [41, 43]. The need to be normalized for I-vectors which differ from one application to another is due to its representation of a wide range of the speech variability. Here, within-class covariance normalization (WCCN) is set to accomplish this task by penalizing axes which have high intra-class variance by making data rotation using decomposition of the inverse of the WCC matrix. After the I-vectors normalization, the different EAs and the TLBO technique have been applied in order to regroup the extracted I-vectors into an optimal number of clusters.
2.5 Speaker clustering
2.5.1 EAs
2.5.2 DE algorithm
Concerning the PSO algorithm, more details about it can be found in [45].
2.5.3 TLBO algorithm
The TLBO method is one of the population-based methods, which relies on population of solutions to reach the global one (solution). It has been used for clustering tasks [46]. The main idea behind this optimization method is to profit from the influence of a teacher on the learners’ output in a class [47]. For this, in this algorithm, the population is considered as a group of learners. Concerning the optimization algorithms, the population is composed of different design variables, while for the TLBO approach, different design variables are similar to different subjects that are suggested to learners. Thus, concerning the learners’ result here, it is similar to the “fitness” like in other population-based optimization techniques. In TLBO algorithm, the best solution obtained so far is considered to be given by the teacher.
The TLBO technique consists of two phases: the first one is the “teacher phase” and the second one is the “learner phase.” Concerning the teacher phase, it consists to make learning from the teacher, and concerning the learner phase, the learning is made via the interaction between learners.
Teacher phase
The main idea behind this phase is to consider a teacher as the knowledgeable person in the society who transfers his knowledge among learners, which can contribute to increase the knowledge level of the entire class and allows learners to get good marks or grades. So, the mean of the class is increased by the teacher’s capability, i.e., moving the mean M1 towards the teacher’s level is performed according to the capability of the teacher T1, which enables to increase the learner’s level into a new mean M2. Also, the student’s knowledge is increased according to his quality in the class and to the teaching quality given by the teacher T1. Changing the student’s quality from M1 to M2 is relied on the effort of the teacher T1. Consequently, the student at the new level needs a new teacher T2 who has more quality than him [45].
Learner phase
2.6 Statistical clustering criteria
Where, \( {\overline{x}}^{(k)}=\left(\sum_{l=1}^{n_k}{x}_l^{(k)}\right)/{n}_k \) is the vector of the centroids for cluster C _{ k } [48]. In this work, we have used the trace of the pooled-within groups scatter matrix (W) as a distance measure of the fitness function and it is denoted by WCD. Also, the computation of the fitness function has been carried out according to distance measures using DB and CS indexes as clustering validity index.
2.6.1 DB index
with kk = 1 , … , K and k ≠ kk.
Where, diam denotes the perfect diameter which is defined as the inter-cluster and intra-cluster distance of C _{ k } and C _{kk} clusters.
2.6.2 CS index
Where Z _{ i } denotes the cluster center of C _{ i } , C _{ i } designs the set whose elements are the data points attributed to the ith cluster, N _{ i } the number of elements in C _{ i }, and d designs a distance function.
where CS_{ i } is the CS measure computed for the ith particle, and eps is a very small-valued constant.
3 Experiments and analysis
3.1 Evaluation criteria
Where n _{ j } is the total number of frames spoken by speaker j and N _{ c } is the total number of clusters.
It is important to mention that the DER values obtained in all experiments of this work are the overall diarization error rates which can be calculate as the averages of the individual DER per episode multiplied by the duration of the episode.
3.2 NDTV evaluation corpus
The experiments presented below for speaker diarization have been developed on MATLAB and have been tested on the News database (NDTV). The development database (NDTV) contains 22 episodes of the Hindu news Headlines Now Show from the NDTV news channel. It includes English new reading of a length of 4 h and 15 min with Indian accent, and it was manually annotating. The dominant speaker in the episodes is the anchor as he takes more much time talking than other speakers. Also, across all episodes, the anchors differentiate to each another. The announcement of the headlines is accompanied with music in the background, which is a common point in all episodes. In addition, the speaker in a single episode is labeled by its genre, background environment (clean, noise, or music), and identity (ID). Therefore, the silence segment length varies from 1 to 5 s, and there is no advertisement jingles presented in the dataset. For the silence, noise, speaker’s pauses, or music, they are labeled as no-speech, which represents 7% of the total recording. Thus, the annotation of the speaker overlap has been performed with the most dominant speaker in the overlap.
3.3 Implementation and parameter setting
The best parameters setting given the best cost solution for all proposed algorithms
GA parameters | PSO parameters |
---|---|
Maximum number of iterations = 200 | |
Population size(nPop) = 100 | Constriction coefficients |
Crossover percentage = 0.7 | phi1 = 2.05, phi2 = 2.05 phi = phi1 + phi2 |
Number of offsprings (nc) = 2*round(pc*nPop/2) | |
Mutation percentage (pm) = 0.3 | chi = 2/(phi-2 + sqrt(phi^2-4*phi)) |
Number of mutants (nm) = round (pm*nPop) | Inertia weight w = chi |
Mutation rate (mu) = 0.02 | Inertia weight damping ratio (wdamp) = 1 |
Selection pressure (beta) = 8 | Personal learning coefficient (c1) = chi*phi1 |
Gamma = 0.2 | Global learning coefficient (c2) = chi*phi2 |
Velocity maximal = 0.1*(VarMax-VarMin) | |
Velocity minimal = − VelMax | |
VarMin = − 10; VarMax = 10 | |
DE parameters | |
Maximum number of iterations (MaxIt) = 200 | |
Population size (nPop) = 50 | |
Lower bound of scaling factor (beta_min) C _{ r }min = 0.2 | |
Upper bound of scaling factor (beta_max) C _{ r }max = 0.8 | |
Crossover probability (pCR) = 0.2 | |
TLBO parameters | |
MaxIt = 1000; nPop = 50; T _{ F } = 1 |
3.4 Results
SAD results obtained using: silence removal module alone, music removal module alone, and by cascading both modules
Error | MSR | FASR | Total SAD |
---|---|---|---|
method | |||
Silence removal | 1.37 | 3.31 | 4.68 |
Music removal | 1.42 | 5.62 | 7.04 |
Cascade | 2.79 | 8.93 | 11.72 |
DER results obtained using ILP clustering algorithm
BIC criterion | ||||||
---|---|---|---|---|---|---|
λ = 1 ϴ = 0 | λ = 1 ϴ = 1000 | λ = 1 ϴ = 2000 | λ = 10 ϴ = 0 | λ = 10 ϴ = 1000 | λ = 10 ϴ = 2000 | |
ILP | 31.52 | 23.67 | 12.35 | 33.41 | 16.54 | 16.10 |
Best DER result obtained for both speaker models and clustering algorithms
Clustering algorithm | ||
---|---|---|
Speaker model | HAC | ILP |
GMM | 19.45 | 17.15 |
I-vectors | 16.95 | 16.10 |
DER results for GA algorithm obtained using DB and CD clustering validity indexes, with different selection, and the DER results obtained with “sphere” cost function for both GA-based binary representation and GA-based real-coded representation
ACP % | ASP % | DER % | |||
---|---|---|---|---|---|
GA-based binary representation | Random selection | 93.54 | 98.13 | 14.62 | |
Roulette wheel selection | 92.72 | 88.14 | 14.8 | ||
Tournament selection | 90.37 | 86.16 | 14.4 | ||
Ga-based real-coded represention | Roulette wheel selection | 91.32 | 87.63 | 14.3 | |
Tournament selection | 90.17 | 85.90 | 14.19 | ||
Random selection | 89.35 | 85.75 | 14.35 | ||
GA with DB and CD indexes | Roulette wheel selection | DB index | 94.36 | 91.60 | 14.19 |
CD index | 95.17 | 92.85 | 14.12 |
DER results obtained for GA algorithm with WCD index, using different selection as well as different crossover modes
Crossover/selection type | Roulette wheel selection | Tournament selection | Random selection |
---|---|---|---|
Uniform crossover | ACP 92.44 ASP 88.73 DER 14.52 | ACP 92.41 ASP 88.53 DER 14.5 | ACP 91.62 ASP 88.98 DER 14.67 |
Single-point crossover | ACP 91 87 ASP 88.66 DER 14.25 | ACP 90.65 ASP 87.63 DER 14.38 | ACP 89.52 ASP 87.14 DER 14.43 |
Double-point crossover | ACP 92.27 ASP 90.26 DER 14.22 | ACP 91.35 ASP 88.87 DER 14.35 | ACP 91.83 ASP 89.43 DER 14.32 |
ASP, ACP, and DER results obtained using PSO and DE algorithms with different selection modes and with different clustering validity indexes (DB and CS indexes). Also, ASP, ACP, and DER results are obtained using TLBO algorithm
ACP % | ASP % | DER % | |||
---|---|---|---|---|---|
PSO algorithm | Roulette wheel selection | 93.54 | 89.13 | 14.62 | |
Tournament selection | 92.72 | 88.14 | 14.8 | ||
Random selection | 90.37 | 86.16 | 14.4 | ||
DE algorithm | Roulette wheel selection | 91.32 | 87.63 | 14.3 | |
Tournament selection | 91.87 | 88.66 | 14.25 | ||
Random selection | 90.54 | 88.48 | 14.37 | ||
Roulette wheel selection | DB index | 95.35 | 91.35 | 13.87 | |
CD index | 96.88 | 93.40 | 13.72 | ||
TLBO algorithm | 98.63 | 95.65 | 13.27 |
Segmentation results using NDTV dataset
Number of speakers | File duration | F | |||
---|---|---|---|---|---|
GA | PSO | DE | TLBO | ||
3 | 10 mn | 96.4 | 97.1 | 97.3 | 97.62 |
10 mn and 2 s | 96.75 | 97.12 | 97.42 | 97.84 | |
10 mn and 33 s | 97.3 | 97.85 | 98.32 | 98.45 | |
Average | 96.81 | 97.35 | 97.54 | 97.97 | |
5 | 10 mn | 95.52 | 97.32 | 97.12 | 97.54 |
10 mn and 2 s | 95.95 | 97.32 | 97.26 | 97.69 | |
10 mn and 33 s | 95.3 | 97.65 | 98.00 | 98.30 | |
Average | 95.59 | 97.26 | 97.46 | 97.84 |
Performance results of TLBO algorithm, c-bic, c-sid, and p-asr systems obtained on the RT-04F and ESTER datasets. Scores are given for missed speech (MS), false alarms (FA), speaker errors (SPK), and overall diarization error rate (DER). #REF and #Sys are, respectively, the reference and system speaker number
RT-04F dev1 dataset | |||||||
---|---|---|---|---|---|---|---|
System | Method | #Ref | #Sys | MS | FA | SPK | Overall DER |
Dev1 | c-sid | 121 | 161 | 0.4 | 1.3 | 5.4 | 7.1 |
TLBO algorithm | 121 | 161 | 0.383 | 1.116 | 5.75 | 7.249 | |
Show | ABC | 27 | 35 | 1.4 | 1.1 | 12.2 | 14.7 |
VOA | 20 | 22 | 0.2 | 1.1 | 2.1 | 3.4 | |
PRI | 27 | 29 | 0.1 | 0.8 | 2.7 | 3.6 | |
NBC | 21 | 30 | 0.1 | 0.9 | 11.5 | 12.5 | |
CNN | 16 | 19 | 0.4 | 1.2 | 5.4 | 7.0 | |
MNB | 10 | 13 | 0.1 | 1.6 | 0.6 | 2.3 | |
Dev2 | c-sid | 90 | 130 | 0.5 | 3.1 | 4.1 | 7.6 |
TLBO algorithm | 90 | 130 | 0.516 | 3.083 | 4.216 | 7.725 | |
Show | CSPN | 3 | 4 | 0.2 | 2.8 | 0.1 | 3.1 |
CNN | 17 | 20 | 0.6 | 4.1 | 4.9 | 9.6 | |
PBS | 27 | 28 | 0.1 | 2.6 | 7.2 | 10.0 | |
ABC | 23 | 26 | 2.1 | 6.7 | 12.1 | 20.9 | |
CNNHL | 9 | 15 | 0.0 | 1.4 | 0.3 | 1.7 | |
CNBC | 11 | 16 | 0.1 | 0.9 | 0.7 | 1.7 | |
RT-04F dev2 dataset | |||||||
c-bic | – | – | 0.4 | 1.8 | 14.8 | 17.0 | |
c-sid (δ = 0.1) | – | – | 0.4 | 1.8 | 6.9 | 9.1 | |
p-asr | – | – | 0.6 | 1.1 | 5.2 | 7.6 | |
TLBO algorithm | – | – | 0.6 | 1.8 | 7.8 | 10.2 | |
ESTER development dataset | |||||||
c-bic | – | – | 0.7 | 1.0 | 12.1 | 13.8 | |
c-sid (δ = 1.5) | – | – | 0.7 | 1.0 | 9.8 | 11.5 | |
TLBO algorithm | – | – | 0.6 | 1.0 | 9.7 | 12.3 | |
Post-evaluation result on ESTER dataset | |||||||
c-sid (δ = 2.0) | – | – | 0.7 | 1.0 | 7.4 | 9.1 |
4 Conclusions
Benchmark function
Function | Formula | Range | Optima |
---|---|---|---|
Sphere | \( {F}_1(x)=\sum_{i=1}^D{x}_i^2 \) | [− 100, 100] | 0 |
Declarations
Authors’ contributions
DK and HS designed the speaker diarization model, performed the experimental evaluation, and drafted the manuscript. CA reviewed the paper and provided some advice. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- J. Kennedy, Some issues and practices for particle swarms, in IEEE Swarm Intelligence Symposium, pp.162-169, 2007.Google Scholar
- A. Veiga, C. Lopes, and F. Perdig~ao. Speaker diarization using Gaussian mixture turns and segment matching. Proc. FALA, 2010.Google Scholar
- Tranter, S., & Reynolds, D. (2006). An overview of automatic speaker diarization systems, Audio, Speech, and Language Processing. IEEE Transactions on, 14(5), 1557–1565.Google Scholar
- Gauvain, J. L., Lamel, L., & Adda, G. (1998). Partitioning and transcription of broadcast news data. In ICSLP (Vol. 98-5, pp. 1335–1338).Google Scholar
- Tang, H., Chu, S., et al. (2012). Partially supervised speaker clustering. IEEE Trans. Pattern Anal. Mach. Intell., 34, 959–971.View ArticleGoogle Scholar
- Li, Y.-X., Wu, Y., & He, Q.-H. (2012). Feature mean distance based speaker clustering for short speech segments. Journal of Electronics & Information Technology, 34, 1404–1407 (In Chinese).View ArticleGoogle Scholar
- W. Jeon, C. Ma, D. Macho, An utterance comparison model for speaker clustering using factor analysis, IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 4528-4531.Google Scholar
- Iso, K. (2010). Speaker clustering using vector quantization and spectral clustering (pp. 4986–4989). Dallas: IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
- Ning, H. Z., Liu, M., Tang, H., et al. (2006). A spectral clustering approach to speaker diarization (pp. 2178–2181). Pittsburgh: IEEE Proceedings of the 9th International Conference on Spoken Language Processing.Google Scholar
- Wu, K., Song, Y., Guo, W., & Dai, L. (2012). Intra-conversation intra-speaker variability compensation for speaker clustering. In Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on (pp. 330–334). https://doi.org/10.1109/ISCSLP.2012.6423465.View ArticleGoogle Scholar
- Rouvier M., Favre B. Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? INTER SPEECH 14-18 September 2014, Singapore.Google Scholar
- Wei-Ho Tsai and Hsin-Min Wang, Speaker clustering based on minimum rand index, Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan, 2009.Google Scholar
- S. Paterlini, T. Krink Differential evolution and particle swarm optimization in partitional clustering. Science direct. Computational Statistics & Data Analysis 50 (2006) 1220 – 1247.Google Scholar
- Goldberg, Genetic algorithms in search, optimization, and machine learning, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA ©1989.Google Scholar
- G. Liu, Y. Xiang Li and G. He Design of digital FIR filters using differential evolution algorithm based on reserved genes. 978-1-4244-8126-2/10/$26.00 ©2010 IEEE.Google Scholar
- Jain, A.K., Murty, M.N. and Flynn, P.J. (1999), Data clustering: a review. ACM Computing Surveys, 31.264-323. https://doi.org/10.1145/331499.331504.
- Kennedy, J., & Eberhart, R. C. (1995). Particle swarm optimization. In Proceedings of IEEE International Conference on Neural Networks, Piscataway, New Jersey (pp. 1942–1948).View ArticleGoogle Scholar
- Clerc, M., & Kennedy, J. (2002). The particle swarm––explosion, stability, and convergence in a multidimensional complex space. IEEE Trans. Evol. Comput., 6(1), 58–73.View ArticleGoogle Scholar
- Tillett, J. C., Rao, R. M., Sahin, F., & Rao, T. M. (2003). Particle swarm optimization for clustering of wireless sensors. In Proceedings of Society of Photo-Optical Instrumentation Engineers (Vol. 5100, p. No. 73).Google Scholar
- Cui, X., Palathingal, P., & Potok, T. E. (2005). Document clustering using particle swarm optimization. In IEEE Swarm Intelligence Symposium, Pasadena, California (pp. 185–191).Google Scholar
- Alireza, Ahmadyfard and Hamidreza Modares, Combining PSO and k-means to enhance data clustering, 2008 International Symposium on Telecommunications, pp. 688-691, 2008.Google Scholar
- S.M. Mirrezaie & S.M. Ahadi, Speaker diarization in a multi-speaker environment using particle swarm optimization and mutual information, Department of Electrical Engineering, Amirkabir University of Technology 424 Hafez Avenue, Tehran 15914, Iran. 978-1-4244-2571-6/08/$25.00 ©2008 IEEE.Google Scholar
- M. Zhang, W. Zhang and Y. Sun, Chaotic co-evolutionary algorithm based on differential evolution and particle swarm optimization, Proceedings of the IEEE International Conference on Automation and Logistics Shenyang, China August 2009.Google Scholar
- R. Yadav, D. Mandal, Optimization of artificial neural network for speaker recognition using particle swarm optimization. International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-1, Issue-3, July 2011.Google Scholar
- Poli, R., Kennedy, J., & Blackwell, T. (2007). Particle swarm optimization. Swarm Intelligence, 1(1), 33–57.View ArticleGoogle Scholar
- R. Poli, An analysis of publications on particle swarm optimization applications, Essex, UK: Department of Computer Science, University of Essex, May - Nov2007.Google Scholar
- Sun, J., Feng, B., & Xu, W. (2004). Particle swarm optimization with particles having quantum behavior. In Proceedings of Congress on Evolutionary Computation, Portland (OR, USA) (pp. 325–331).Google Scholar
- Z. Hong, Z. JianHua Application of differential evolution optimization based Gaussian mixture models to speaker recognition. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237. 978-1-4799-3708-0/14/$31.00_c , 2014, IEEE.Google Scholar
- R. Storn and K. V. Price, Differential evolution: a simple and efficient adaptive scheme for global optimization over continuous spaces, ICSI, USA, Tech. Rep. TR-95-012, 1995 [Online]. Available: http://www.icsi.berkeley.edu/~storn/litera.html.
- R.V. Rao and V. Patel, An elitist teaching-learning-based optimization algorithms for solving complex constrained optimization problems. International Journal of Industrial Engineering Computations, vol. 3, no. 4, pp. 535–560, 2012.Google Scholar
- S. O. Degertekin and M. S. Hayalioglu, Sizing truss structures using teaching-learning-based optimization. Computers and Structures, vol. 119, pp. 177–188, 2013.Google Scholar
- R. V. Rao and V. Patel, An improved teaching-learning-based optimization algorithm for solving unconstrained optimization problems. Scientia Iranica, vol. 20, no. 3, pp. 710–720, 2013.Google Scholar
- T. Niknam, F. Golestaneh, and M. S. Sadeghi, 휃-Multiobjective teaching-learning-based optimization for dynamic economic emission dispatch. IEEE Systems Journal, vol. 6, no. 2, pp. 341– 352, 2012.Google Scholar
- Zou, F., Wang, L., Hei, X., Chen, D., & Yang, D. (2014). Teaching-learning-based optimization with dynamic group strategy for global optimization. Inf. Sci., 273, 112–131.View ArticleGoogle Scholar
- Wang, L., Zou, F., Hei, X., Yang, D., Chen, D., & Jiang, Q. (2014). An improved teaching learning-based optimization with neighborhood search for applications of ANN. Neurocomputing, 143, 231–247.View ArticleGoogle Scholar
- Suresh Chandra Satapathy, Anima Naik and K Parvathi, A teaching learning based optimization based on orthogonal design for solving global optimization problems. SpringerPlus 2013.Google Scholar
- Waghmare, G. (2013). Comments on a note on teaching-learning-based optimization algorithm. Information Sciences, 229, 159–169.Google Scholar
- M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, and S. Meignier, An open source state-of-the-art toolbox for broadcast news diarization. Technical report, Idiap, 2013.Google Scholar
- M. Anthonius, H. Huijbregts, Segmentation, diarization, and speech transcription, surprise data unraveled. 2008.Google Scholar
- X. Anguera and J. Hernando, Xbic, Real-time cross probabilities measure for speaker segmentation. Univ. California Berkeley, ICSIBerkeley Tech. Rep, 2005.Google Scholar
- S. Cheng, H. Min Wang, and H. Fu, Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. Audio, Speech, and Language Processing, IEEE Transactions on, 18(1):141-157, 2010.Google Scholar
- T. Nguyen, H. Sun, S. Zhao, S. Khine, HD Tran, TLN Ma, B Ma, ES Chng, and H Li. The speaker diarization systems for RT 2009. In RT’09, NIST Rich Transcription Workshop, May 28-29, 2009, Melbourne, Florida, USA, volume 14, pages 1740, 2009.Google Scholar
- H. K. Maganti, P. Motlicek, and D. Gatica-Perez. Unsupervised speech/non-speech detection for automatic speech recognition in meeting rooms. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE international conference on, volume 4, pages IV{1037. IEEE, 2007.Google Scholar
- Luz-Marina Sierra, Carlos Cobos, Juan-Carlos Corrales (2014), Continuous optimization based on a hybridization of differential evolution with K-means, In computer science, November, 2014.Google Scholar
- Murty, M.R., et al. (2014), Automatic clustering using teaching learning based optimization. AppliedMathematics, 5, 1202-1211. https://doi.org/10.4236/am.2014.58111.
- Pal, S.K. and Majumder, D.D. (1977). Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Transactions on Systems, Man, and Cybernetics, 7, 625-629.Google Scholar
- Satapathy, S.C. and Naik, A. (2011), Data clustering based on teaching-learning-based optimization. Lecture Notes in Computer Science, 7077, 148-156.Google Scholar
- Blake, C., Keough, E., & Merz, C. J. (1998). UCI repository of machine learning database http://www.ics.uci.edu/~mlearn/MLrepository.html.
- D. Davies and D. Bouldin, A cluster separation measure. Determining the number of clusters In CROKI2 algorithm, IEEE PAMI, vol. 1, no. 2, pp. 224–227, 1979.Google Scholar
- Malika Charrad RIADI and CEDRIC, Determining the number of clusters In CROKI2 algorithm. First Meeting on Statistics and Data Mining, MSriXM ‘09, 2009Google Scholar
- S. Bozonnet, NWD Evans, and C. Fredouille, The LIA-EURECOM RT’09 speaker diarization system: enhancements in speaker Gaussian and cluster purification. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 4958{4961. IEEE, 2010.Google Scholar
- S.M. Mirrezaie & S.M. Ahadi (2008), Speaker diarization in a multi-speaker environment using particle swarm optimization and mutual information. 978-1-4244-2571-6/08/$25.00 ©2008 IEEE.View ArticleGoogle Scholar
- C. Barras, X. Zhu, S. Meignier, and J. Gauvain, (2006), Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, VOL. 14, NO.5, SEPTEMBER 2006.Google Scholar
- D. Reynolds and P. Torres-Carrasquillo, Approaches and applications of audio diarization, in Proc. Int. Conf. Acoust., Speech, Signal Process, Philadelphia, PA, 2005, pp. 953–956.Google Scholar