 Methodology
 Open Access
 Published:
Masked multicenter angular margin loss for language recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2022, Article number: 17 (2022)
Abstract
Language recognition based on embedding aims to maximize interclass variance and minimize intraclass variance. Previous researches are limited to the training constraint of a single centroid, which cannot accurately describe the overall geometric characteristics of the embedding space. In this paper, we propose a novel masked multicenter angular margin (MMAM) loss method from the perspective of multiple centroids, resulting in a better overall performance. Specifically, numerous global centers are used to jointly approximate entities of each class. To capture the local neighbor relationship effectively, a small number of centers are adapted to construct the similarity relationship between these centers and each entity. Furthermore, we use a new reverse label propagation algorithm to adjust neighbor relations according to the ground truth labels to learn a discriminative metric space in the classification process. Finally, an additive angular margin is added, which understands more discriminative language embeddings by simultaneously enhancing intraclass compactness and interclass discrepancy. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17OLR) corpus. We compare the proposed MMAM method with seven stateoftheart baselines and verify that our method has 26.2% and 31.3% relative improvements in the equal error rate (EER) and C_{avg} respectively in the fulllength test (“fulllength” means the average duration of the utterances is longer than 5 s). Also, there are 31.2% and 29.3% relative improvements in the 3s test and 14% and 14.8% relative improvements in the 1s test.
1 Introduction
Language recognition (LR) is the task of automatically identifying or verifying a language or languages being spoken in a given speech utterance [1]. It plays an essential role in multilingual speech preprocessing, which is typically followed by speech recognition systems and automatic translation systems [2].
Generally speaking, there are two types of LR tasks: closeset LR and openset LR. Most current research focuses on closeset LR, meaning that all test utterances correspond to a target language. In other words, the lan guage of the training set and the test set are the same. However, the openset LR means that the test utterances are unlikely to be strictly restricted to a target language but may also correspond to some unknown languages [3]. This paper mainly improves the performance of the former category.
Due to the similarity in research fields, recent advances in automatic speech recognition and speaker recognition based on singlecenter loss (SCL) have improved language recognition applications. Singlecenter loss can be divided into two types, that is, classification loss and metric loss.
The pioneering work of using the classification loss is to learn the speaker embedding for speaker recognition [4–6]. Since then, popular methods train embeddings using softmax classifiers [7–11]. Although the softmax loss can learn separable embeddings, since it is not explicitly designed to optimize embedding similarity, it is not distinguishable enough. Therefore, the model trained by softmax is usually combined with the back end of PLDA [6, 12] to generate a scoring function [13, 14]. Wang et al. [15] proposed angular softmax (Asoftmax), using cosine similarity as the logit input of the softmax layer to solve this problem. Many studies have proven that Asoftmax is superior to softmax in speaker recognition [16–19]. Additive margin variables AMSoftmax [15, 20] and AAMSoftmax [21] have been proposed to increase the variance between classes by introducing a cosine margin penalty on the target logit, which has been well applied due to their good performances [16–18]. However, training AMSoftmax and AAMSoftmax have proven to be challenging because they are sensitive to the scale and the marginal value of the loss function. To improve the performance of AMSoftmax loss, Zhou et al. [22] proposed to dynamically set the margin of each training sample different from the cosine angle of that sample. Specifically, the smaller the cosine angle, the greater the distance between the training sample and the corresponding class in the feature space, and the better the intraclass compactness.
The embedding learned from the classification loss is only optimized for the separation between classes. Differently, the metric loss is used to embed the speaker, which not only expands the interclass variance but also reduces the intraclass variance [22]. Triplet loss [23, 24] and contrast loss [25] optimize the embedding space by minimizing the distance between feature pairs and the same speaker and maximizing the distance between feature pairs and different speakers. However, these methods require careful attention to the choice of the couple and triplet, which is timeconsuming and performancesensitive. The generalized endtoend (GE2E) loss [26] is an enhanced contrast loss, which directly optimizes the cosine distance between the speaker embedding and the centroid, without the need for complicated sample selection such as triple loss [23, 24] and contrast loss [25]. This metric loss also has the final classification layer, and the extraction embedding also needs to be removed.
To sum up, the classification loss only optimizes the distance between a sample and the center without considering the relationship between any two samples. On the contrary, the metric loss only optimizes the distance between two samples without considering the sample and center relationship. In this paper, we employ the advantages of both the classification loss and the metric loss and propose to use the multicenter loss (MCL). More specifically, given C classes, MCL designs K centers for each class, so there are K·C centers. For a training sample, we will get K positive centers and K·(C−1) negative centers, where “positive center” means a sample belongs to a class, and correspondingly, “negative center” represents a sample does not belongs to a class. A similar method was also studied in [27–29]. Deng et al. [27] is to optimize the distance between the sample and one of the predefined multicenters without considering the other centers. Although [28] optimizes the distance between the sample and all the predefined centers, it only optimizes the distanceweighted of all centers. Zhu et al. [29] proposes a new proxybased deep graph metric learning (ProxyGML) method for graph classification, which uses fewer proxies but has better overall performances. As [29] provides a good example of the optimization method, we also use this method for our multicenter loss in this paper. Moreover, Wang et al. [15], Wang et al. [20] and Deng et al. [21] introduce an additional corner penalty between the speaker embeddings and the centroid, which reduces the distance of the class inner corners so that the speaker embeddings belonging to the same speaker are gathered closely around the centroid. Inspired by this, we also penalize the distance between the sample and the center cosine by increasing the margin. In summary, based on MCL, the optimization method provided by ProxyGML and additional corner penalties, we propose masked multicenter angular margin (MMAM) loss in this paper. Our contributions are summarized as follows:

1.
We propose to use multicenter loss to optimize the cosine distance between the language embedding and the corresponding multicenters and optimize the cosine distance between the multicenters simultaneously while taking advantage of both the classification loss and the metric loss.

2.
In addition, we added an additional angular margin to the multicenter loss, which learns more discriminative language embeddings by simultaneously explicitly enhancing intraclass compactness and interclass differences.

3.
Thirdly, we add a masking operation to the multicenter loss, so that the network itself adaptively selects the optimization method of samples and multiple centroids.

4.
The proposed masked multicenter angular margin loss is evaluated by comparing it with seven stateoftheart baselines. Both its performance and convergence speed far exceed those of the baselines.

5.
The proposed MMAM loss can be readily applied to various similar tasks, such as speaker verification and speaker recognition. To the best of our knowledge, we introduce the multicenter loss into language recognition for the first time.
This paper is arranged as follows. In Section 2, we review the GE2E loss [26], AMCentroid loss [30], Softmax loss [7], AAMSoftmax loss [21], and DAMSoftmax loss [22], Subcenter loss [27], and Softtriple loss [28], which are the startoftheart types of loss used in LR methods. In Section 3, we describe our MMAM loss. We give experimental setup and experimental results in Sections 4 and 5, respectively. Finally, in Section 6, we conclude this paper.
2 Baselines
Our MMAM loss is inspired by metric loss (e.g., GE2E [26] and AMCentroid [30]) and classification loss (e.g., Softmax [7], AAMSoftmax [21], and DAMSoftmax [22]), as well as the newly popular MCL (e.g., Subcenter [27] and Softtriple [28]).
2.1 Metric Loss
GE2E and AMCentroid are two types of metric loss, which serve as two baselines for our experiments.
2.1.1 GE2E
Let a batch consist of N languages and M utterances per language. We use x_{ij}(1≤i≤N,1≤j≤M) to denote the language embedding extracted from language i utterance j. In GE2E training, every utterance in the batch except the query itself is used to form centroids. As a result, the embedding centroids of sample k that belong to different classes and the same class from the query are defined as follows:
The similarity matrix is defined as scaled cosine similarity between the embeddings and all centroids:
where w, b are learnable parameters and \(\theta _{x_{ij},c_{k}}\) refers to the angle between x_{ij} and c_{k}. The final GE2E loss [26] is defined as:
2.1.2 AMCentroid
Although GE2E loss promotes the embedding of language k to be closer to its centroid c_{k} than other centroids, there is still a sizeable intraclass distance. If the included margin between each embedded language and its center of mass is large, it will be penalized. After setting b = 0, replacing w with a scalar value s, and adding the angle margin m to the target angle, we get AMCentroid [30] from Eq. 3 as follows:
2.2 From softmax to angular softmax
This section mainly introduces the development process of classification loss, and we choose Softmax [7], AAMSoftmax [21], and DAMSoftmax [22] as three baselines for subsequent experiments.
2.2.1 Softmax
The softmax loss consists of a softmax function followed by a multiclass crossentropy loss. Its basic form is defined as:
where N and C are the numbers of training samples and the number of classes respectively, and x_{i} and y_{i} are the feature representation of the ith sample and the target class of the ith sample, respectively, and W and b are the weight and bias of the last layer of the backbone network respectively. This loss function only penalizes classification errors and does not explicitly enforce intraclass compactness and interclass separation.
2.2.2 Angular softmax
By normalizing the weight and the input vector, the softmax loss can be reexpressed. The posterior probability only depends on the cosine of the angle between the weight and the input vector. The expression \(W_{y_{i}}^{T}x_{i}+b_{y_{i}}\) in the numerator on the righthand side of Eq. 6 can be rewritten as:
From Eq. 7, we normalize the weight vector to unit norm, and discard the deviation term by setting \(\lVert W_{y_{i}} \rVert =1 \lVert x_{i} \rVert =1 \) and \( b_{y_{i}} =0 \), which leads to the socalled angular softmax loss [15], defined as follows:
Equation 8 is just a rewrite of Eq. 6, which has the same advantages and disadvantages as Eq. 6. To alleviate this problem, the cosine margin m is added to Eq. 8. The additive angular margin penalty is equal to the geodesic distance margin penalty in the normalized hypersphere. There are two types of additional corner penalties. One is the penalty for corners in the angle range, and the other is for angles. The corresponding AMSoftmax [20] and AAMSoftmax [21] loss formulas are:
where s is a fixed scale factor to prevent the gradient of the training phase from being too small. The cosine margin m is manually tuned and is usually larger than 0.
2.2.3 DAMSoftmax
In Eq. 9, the cosine margin m is a constant shared by all training samples. However, the penalty scales for different samples should be different. DAMSoftmax [22] is based on the assumption that the smaller the cos(θ), the farther the sample is from the corresponding class in the feature space, and the margin m should be set larger to force compactness within the class, so DAMSoftmax loss changes margin m to:
where m_{i} is the cosine margin value of the ith sample, and m is the essential margin value, and λ is the control factor that controls the margin value range.
2.3 Multiple centroids
Multicentroid loss is first proposed in the field of graphics [27, 28]. Subcenter [27] and Softtriple [28] are two types of the existing multicenter classification loss, which serve as two baselines for subsequent experiments.
2.3.1 Subcenter
Assuming that each class has K centers, then the similarity between sample x_{i} and class c can be defined as:
Subcenter loss [27] is designed to optimize the distance between the sample and one of the predefined multicenters without considering the other centers. The loss is defined as follows:
where \(\theta _{i, j}= arccos\left (max_{k}\left (W_{jk}^{T}x_{i}\right)\right), and \; k \in \{1,2,\ldots, K1,K\}\).
2.3.2 Softtriple
Softtriple loss [28] mainly considers the similarity distance between the sample and all centers by weighted summation. Its main formulas are as follows:
where λ is the weighted summation of the similarity between the representative sample and all centers, and δ is similar to the previous parameter m ∈(0,1).
3 The proposed method
3.1 Formulation
Our goal is to design a more discriminative feature embedding by adjusting the network structure parameters. Given the training set with C classes, a small batch of B samples is randomly selected from the training set as in usual batch training. Indicating that the embedding vector of the ith data sample is \(x_{i}^{s} \), and the corresponding label is \(y_{i}^{s}\), then the embedding output of the small batch of samples extracted by the deep neural network can be defined as \(S=\left \{ \left (x_{1}^{s},y_{1}^{s}\right), \left (x_{2}^{s},y_{2}^{s}\right),\ldots, \left (x_{B}^{s},y_{B}^{s}\right) \right \}\). In addition, we assign K trainable centers to each class. The value of K is preset, so it requires to find the optimal value of K. The total number of central sets to be trained is C·K. The central set can be expressed as \(C=\left \{ \left (x_{1}^{c},y_{1}^{c}\right),\left (x_{2}^{c},y_{2}^{c}\right),\ldots, \left (x_{C \cdot K}^{c},y_{C \cdot K}^{c}\right) \right \}\). In order to constrain the similar relationship between the sample and the center, we also represent the center label in the set C as a onehot label matrix Y^{c}∈{0,1}^{(C·K)×C} with \(Y_{ij}^{c}=1\) if \(y_{i}^{c}=j\) and \(Y_{ij}^{c}=0\) else.
3.2 Masking of similar relationships
The basic idea behind MMAM loss is to ensure that each data sample is close to its associated positive center and away from its negative center. Given the embedding vector \(\begin {array}{l} \left \{ x_{i}^{s} \right \}_{i=1}^{B}\end {array}\)and all centers \(\begin {array}{l} \left \{ x_{j}^{c} \right \}_{j=1}^{C \cdot K}\end {array}\) in the small batch processing, we first construct the similarity matrix S between the sample and the center, where S∈ℜ^{B×(C·K)} represents the similarity between the sample and the center, which is calculated by Eq. 12. Both \(x_{i}^{s}\) and \(x_{j}^{c}\) are normalized to be the unit length, and thus S_{ij}∈[−1,1].
Figure 1 illustrates the overall architecture of MMAM. By generating a similarity matrix between the sample and all the centers, the local relationship around each sample can be further constructed into a series of similarity matrices, capturing the finegrained neighborhood structure better, that is, the similarity matrix S in Fig. 1 can be transformed into a series of subsimilarity matrices W, and the optimization process changes from coarsegrained to finegrained. Our approach keeps the maximum value of p in each row in S to construct a pnearest neighbor (pNN) matrix. Since all centers are initialized randomly, directly selecting the center closest to K for each sample may miss many positive centers. These centers of the same category cannot be updated at the same time in each iteration. Therefore, we introduce a positive mask S^{pm} to ensure that all the positive centers of each sample are selected, which can also be regarded as a kind of “soft” constraint on the centers, and as shown in Fig. 1, a series of subsimilarity matrices W encourages similar centers through the guidance of reverse label propagation in subsection 3.3 close to their corresponding samples while keeping similar centers close to each other.
The positive mask S^{pm} is based on the label of the sample and the center, which essentially reflects the genuine similarity between them:
We calculate the index of the pmax value of each row of (S+S^{pm}) and store it in the set of p elements Ω={…,(i,j),…}. Then construct the subsimilarity matrix to be represented by a sparse neighbor matrix W:
where W∈ℜ^{B×(C·K)}. With the help of the positive mask S^{pm}, even if p is relatively small, all the positive centers of each sample will participate in each subsimilarity matrix. In detail, p is given by p=⌈r·C·K⌉, where a scale factor r∈(0,1] is introduced to easily obtain subsimilarity matrices of different scales.
3.3 Reverse label propagation
In subsection 3.2, the subsimilarity matrix W between samples and centers has been constructed, and the samples and centers from the same category should be close to each other [31]. In semisupervised learning, the idea behind traditional label propagation (LP) is to infer unknown labels through manifold structure [32]. Zhu et al. [29] utilizes the known labels to adjust the manifold structure using the proposed reverse label propaga tion (RLP) algorithm. Inspired by [29, 31, 32], we use the known sample labels to guide the subsimilarity matrix W for optimization. According to the LP idea, all the subsimilarity matrices W are encoded as the predicted output Z.
where Z∈ℜ^{B×C}. The predefined multicenter label Y guides the subsimilarity matrix W to learn the distance between the positive center and the related sample, reflecting how its neighboring center points guide the sample’s classification information from the same category to be close to each other. Specifically, in optimizing the target through RLP, the positive center is close to the relevant training sample, and the negative center will be far away from the training sample.
3.4 Marginbased optimization
After introducing the previous two subsections, the subsimilarity matrix W of the similarity matrix has been constructed. The relationship Z between the positive and negative centers and the relevant samples has been designed. Finally, the process of classification learning is analyzed. Like the classification loss, the prediction output Z is first converted into a prediction score P through the softmax operation and then optimized with the groundtruth label. In this way, each value in P that reflects the similarity between the sample and the positive center or the negative center will be increased or decreased. Because there are many masking terms in Z, the denominator in the softmax function will be overcalculated, so it cannot be predicted correctly. Therefore, we use a new mask softmax function to prevent the mask value from affecting the prediction score:
where \(\tilde {y_{i}^{s}}\) indicates the prediction label of the ith sample \(x_{i}^{s}\) in S, and Z_{ij} indicates the jth predictive element of the ith sample, and mask M∈{0,1}^{B×C} is defined as follows:
Crossentropy loss is computed between the predicted score and the groundtruth label for each sample. Its performance can be improved when introducing a cosine marginal penalty on the target logit to increase the variance between classes. Since the previous optimization method for constructing the similarity matrix is similar to ProxyGML, the difference between MMAM and ProxyGML is mainly in the optimization calculation method. Therefore, unlike the classificationbased optimization calculation method in ProxyGML, we introduce an additional cosine marginal penalty to increase the betweenclass variance and reduce the withinclass variance, which can be proved very effective by later experiments. After adding the cosine marginal penalty, the calculation is as follows:
where \(\theta _{i,j} = arccos(P(\tilde {y_{i}^{s}}=j\mid x_{i}^{s}))\). Also, we impose a constraint on the center to ensure that similar centers are very close and dissimilar centers are far away. Specifically, we regard each center as the second type of sample and other similar or different centers as positive and negative centers. Repeating the above method, first construct the total similarity matrix between the centers as:
where S^{c}∈ℜ^{(C·K)×(C·K)}, and both \(x_{i}^{c}\) and \(x_{j}^{c}\) are normalized to unit length. Since the multicenter is initialized randomly, we no longer construct the pNN submatrix for S^{c}. The scale factor r is set to 1. Then, according to the RLP, the predicted output of the relationship between the intermediates is as follows:
The output Z^{c} can also turn into a prediction score through softmax:
where \(\tilde {y_{i}^{c}}\) and Z_{ij} indicate the prediction label of the ith center \(x_{i}^{c}\) in C and the jth predictive element of the ith center, respectively.
Similar to Eq. 21, we also introduce an additional cosine marginal penalty and get Eq. 25.
Combining Eq. 21 and Eq. 25, respectively, representing the MMAM loss between the sample and the center, and the MMAM loss between the center and the center, our final loss function is
where λ balances the loss between the sample and the center and between the centers. Equation 26 can lead to more discriminative language embeddings.
The source code of computing all the loss functions in this paper is available at https://github.com/hangxiu/mmam_loss/.
4 Experimental setup
4.1 The dataset
The proposed MMAM loss model is evaluated on the AP17OLR dataset, which is for the second Oriental Language Recognition Challenge [33, 34]. The dataset is composed initially of Speechocean and Multilingual Minor Language Automatic Speech Creation and Recognition (M2ASR). There are 10 languages in the dataset, including Kazakh in China (kacn), Tibetan in China (ticn), Uyghur in China (uyid), Cantonese in China Mainland and Hong Kong (ctcn), Mandarin in China (zhcn), Indonesian in Indonesia (idid), Japanese in Japan (jajp), Russian in Russia(ruru), Korean in Korea (kokr), and Vietnamese in Vietnam(vivn) [35].
The dataset is divided into a train/dev part and a test part. The number of speakers and total volume of each language is shown in Table 1. For male and female speakers, the volume of each speaker is balanced. There is no overlap of speakers in the train/dev and test subsets. All speech utterances are recorded via mobile phones with a sampling rate of 16kHz and a sampling capacity of 16 bits. In the train/dev subset, there are approximately 10 hours of recordings of each language. This dataset provides a fulllength subset, including trainall, devall, and testall. Besides, it also provides two shortterm (shortduration audio segments) subsets, including train1s, train3s, dev1s, dev3s, test1s, and test3s, which are randomly selected from trainall, devall and testall according to duration.
Our system is evaluated on test1s, test3s, and testall, including 22,051, 19,999, and 22,051 utterances, respectively. In particular, we use the train and dev subsets jointly as the training set.
4.2 Data augmentation
It is a wellknown fact that neural networks benefit from data augmentation that generates additional training samples. Therefore, we generate a total of 4 additional samples for each utterance. Specifically, our paper studies two enhancement methods commonly used in speech processingadditive noise and room impulse response (RIR) simulation [36]. For additive noise, we use the MUSAN corpus [37], which contains 60 hours of speech, 42 hours of music, and 6 hours of noise, such as dial tone or environmental sounds. For the room impulse response, we use the simulated RIR filter provided in [36]. In each training step, the noise and RIR filters are randomly selected [38]. The type of enhancement used is similar to [5, 39]. The recording is enhanced by one of the following four methods:

1.
RIR filters: We change the gain of the RIR filter to produce a more diverse reverberation signal.

2.
Speech: Randomly select three to seven recordings from MUSAN, and then add a random signaltonoise ratio (SNR) from 13 to 20 decibels to the original signal. The duration of the additive noise is matched to the sampled period.

3.
Music: A single music file is randomly selected from MUSAN and added to the original signal, with a signaltonoise ratio ranging from 5 to 15 dB.

4.
Noise: Background noise in MUSAN is added to the recordings from 0 to 15 dB SNR.
4.3 Implementation details
4.3.1 Input features
We extract 80dimensional log Melfilterbank energies for each speech frame of width 25 ms and step 10 ms. The training speech segment is set to 2 s, which generates a spectrogram with the size of 200 × 80. Twosecond random crops of the log Melfilterbank feature vectors are normalized through cepstral mean subtraction, and no voice activity detection is applied.
4.3.2 Training settings
Our implementation is based on the PyTorch framework [40] and uses the Adam algorithm [41] to optimize deep neural networks. λ is set to 0.3. We use the initial learning rate 1e3 decreasing by 5% every 5 epochs to train Softmax, AAMSoftmax (m=0.3), DAMSoftmax (m=0.3), GE2E, AMCentroid (m=0.3), Subcenter (m=0.3), Softtriple (m=0.3), ProxyGML and our MMAM (m=0.3). The models trained by MMAM with (m=0.3) are respectively used as pretrained models to train MMAM with (m>0.3) and the initial learning rate is changed to 1e −4. The minimum batch size of all types of classification loss is set to 64. When training GE2E loss and AMcentroid loss, each batch contains 10 languages, and each language contains 6 speech fragments, to roughly match the minimum batch size of the classification loss.
4.3.3 Backend
In the train/dev subset, the average vectors of a language can model the language. The test utterance score in a specific language can be the cosine distance between the vector of the test utterance and the vector of the language model generated by the train/dev subset. The formula is as follows:
where E_{avg} is the enrollment utterance mean, and E_{test} is the test utterance vector.
4.4 Evaluation metrics
As in LRE15 [33, 34], C_{avg}, minimum detection cost function (minDCF), detection error tradeoff (DET) curve, and equal error rate (EER) [11] are used to evaluate the performance of different loss systems. These metrics evaluate the system from different perspectives, thereby providing more reliable analysis and conclusions of experimental results. The pairwise loss that constitutes the miss and false alarm probability of a specific target/nontarget language pair is defined as:
where L_{t} and L_{n} are the target and nontarget languages, respectively; P_{Miss} and P_{FA} are the missing and false alarm probabilities, respectively. P_{target} is the prior probability for the target language, which to 0.5 in the evaluation. C_{avg} as the average of the above pairwise performance:
where N is the number of languages.
4.5 The neural network architecture
The DNN architecture used to extract language embedding is based on [42], with several modifications, as shown in Fig. 2. The framelevel feature extraction layer first passes through a layer of ordinary convolutional layer (CNN). It then passes through the IMTDNNBlock module, which consists of 3 SERes2Block blocks, consisting of a 1frame background containing an extended convolution of the front and back dense layers. The first dense layer can reduce the feature dimension, while the second dense layer can restore the number of features to the original size. Next is the SEBlock to scale each channel. A skip connection covers the entire unit, and these layers have parameters, including the number of filters, the filter size, and the expansion factor. Considering the hierarchical nature of TDNN, these deeper features are the most complex and closely related to language relationships. However, based on the evidence in [43], we believe that shallower feature mapping also contributes to more robust language embedding. For each frame, the network structure connects the output feature maps of all SERes2Blocks blocks. Two unidirectional LSTM layers with 512 neural units are used to capture the longterm time dependence of the framelevel feature sequence. The pooling layer based on the time attention mechanism is further extended to the channel dimension, which allows the network to pay more attention to language characteristics that will not be activated at the same or similar time. Using the weighted statistical pooling layer, the speechlevel feature vector is generated by calculating the weighted mean and standard deviation of the input feature sequence and gathering this statistical information together. The last two fully connected layers have 512 and 192 nodes, respectively, and project the speechlevel feature vectors into the 192dimensional language embedding.
We choose GELU [44] activation function or without activation function and use batch normalization [45] to accelerate training. Using the classification loss (e.g., Softmax, AAMSoftmax, DAMSoftmax, Subcenter, Softtriple, ProxyGML and our MMAM) to train the DNN, another fully connected with ten nodes or a multiple of them is attached to the last layer of the structure as the classification layer. All LSTM layers use tanh as the activation function between them and use the hard sigmoid as the activation function for recurrent steps. The early stop is also used in training. When the loss of the test set does not decrease in three consecutive periods, the training is completed. In our experiments, we do not set the valida tion set separately and perform validation directly on the test set, so there may be the risk of overfitting and the overall experimental results may be optimistic. However, all the methods in this paper are optimized in this way, so they are comparable. Particularly, we do not use the test set to optimize the models with the loss, but only for the termination condition.
5 Experimental results
5.1 Parameter analysis
To find the optimal system of the proposed method, the parameters K and r in Section 3.2 and the parameter m in Section 3.4 will be experimentally and theoretically analyzed.
5.1.1 The number of initialization centers K
As described in Section 3.2, p is the number of optimized similarities selected from the similarity matrix S. The upper bound of p is C·K. From the perspective of a network structure optimization, we hope to choose a smaller p so that the network can adaptively select appropriate optimization parameters. Therefore, the scale factor r is introduced to directly select p at different scales.
In terms of experiments, we use the four representative scales of r = 0.3, 0.5, 0.8, and 1.0 to conduct experiments to explore the influence of the number of centers K, as shown in 3. Figure 3a, d, and g illustrates the results of EER on testall, test3 s, and test1 s, respectively, and Fig. 3b, e, and h illustrates the results of C_{avg} on testall, test3 s, and test1 s, respectively, and Fig. 3c, f, and i illustrate the results of minDCF on testall, test3 s, and test1 s, respectively. We see from Fig. 3 that among the four different scale factors (r), when K = 3, the performance is the best, confirming that the learned feature embedding can better capture intraclass changes through an appropriate number of local cluster centers. When K is further increased, the performance will decrease due to the overfitting when the center is overparameterized.
In order to better illustrate the advantages of MMAM loss, the experimental results with different center numbers (K) under four different scale factors (r) of MMAM and ProxyGML are shown in Table 2. It is obvious that in all cases, MMAM is better than ProxyGML.
5.1.2 Scale factor r
We study the influence of the scale factor r when K is fixed at 3, and the margin m is fixed at 0.3. Figure 4a, b, and c illustrates the experimental results on testall, test3 s, and test1 s, respectively. We see that when r = 0.4, the performance is the best, that is, the values of EER, C_{avg} and minDCF are the smallest, which shows that our proposed MMAM can make the network select the similarity matrix adaptively.
Specifically, as can be seen from Section 3.2, the number of positive centers is fixed at K = 3, and the number of negative centers is p−K, where p=⌈r·C·K⌉. On the one hand, when r<0.4, the similarity matrix optimized by the network selection is too small, which leads to poor performance. On the other hand, when r>0.4, the similarity matrix optimized by the network selection is too large, and too many negative centers are introduced, also leading to poor performance.
In order to better illustrate the advantages of MMAM loss, the experimental results under different scale factors (r) of MMAM and ProxyGML are shown in Table 3, with the number of centers K fixed at 3. Also, it is obvious that in all cases, MMAM is better than ProxyGML.
5.1.3 Classification margin m
Since the value of angular margin m has a great influence on the recognition performance and the convergence of the training process, we study different margins m when K is fixed at 3, and the scale factor r is fixed at 0.4. As shown in Table 4, m increases by 0.05 each time, and the performance reaches best when m = 0.5. When m is less than 0.5, the training is relatively stable, and the experimental results do not change much. On the contrary, when m is greater than 0.5, the training process cannot converge well, resulting in poor results, especially when m is 0.8. At this time, the performance of the system is equivalent to that of the best baseline. Overall, when K = 3, r = 0.4 and m = 0.5, the performance is the best. Different angular margin penalty m has a fluctuating effect on different evaluation metrics, which verifies that m has a great impact on the recognition performance and training process.
5.2 Comparison with different baselines
We compare the performance of the proposed MMAM loss with seven different baselines from four aspects, including the training epochs, the experimental results, the DET curve, and the visual analysis with TSNE.
5.2.1 The training epochs
The performances of the existing multicenter loss functions, SubCenter (Eq. (13)), Softtriple (Eq. (15)), and MMAM (Eq. (4)), under testall during the training process are shown in Fig. 5. For the convenience of comparison, we only compare and display the three best Kvalued multicenter loss functions. From Fig. 5, we see that the convergence speed of Subcenter is the same as that of Softtriple, and their performances are similar. MMAM has both the fastest convergence speed and best performance. There is a segment between epochs 90 to 120 where the EER is almost stable for MMAM, which we speculate is caused by the relative stability of the similarity optimization between the selected samples and the multicenter during this segment.
5.2.2 Comparison of performance
Table 5 shows the experimental results of seven different baselines and our proposed MMAM. The first col umn “ID” marks each experiment’s ID. The experimental results can be divided into three categories: Exp. 1–5 represent the baselines of the singlecenter loss; Exp. 6–7 represent the existing types of multicenter loss; Exp. 8–9 represent our proposed MMAM loss with different m. We use “SCL absolute improvement” and “SCL relative improvement” respectively to denote the absolute and relative improvements of our MMAM compared with the best experimental results for singlecenter loss (Exp. 1–5). Correspondingly, we use “MCL absolute improvement” and “MCL relative improvement” to respectively denote the absolute and relative improvements of our MMAM compared with the best experimental results for multicenter loss (Exp. 6–7).
In Fig. 6, MMAM is compared with the existing multicenter loss under different numbers of centers (K) and four different scale factors (r = 0.3, 0.5, 0.8, 1.0). Figure 6a, d, and g illustrates the results of EER on testall, test3 s, and test1 s, respectively. Figure 6b, e, and h illustrates the results of C_{avg} on testall, test3 s, and test1s, respectively. Figure 6c, f, and i illustrates the results of minDCF on testall, test3s, and test1s, respectively. From Fig. 6, we see that MMAM can significantly reduce EER, C_{avg}, and minDCF, so it has better optimization than the existing types of multicenter loss.
5.2.3 The DET curve
The DET curve is another popular method for evaluating verification and identification systems. Compared with C_{avg}, EER, and minDCF, the DET curve represents the performance of all operating points, so it can more comprehensively evaluate the systems. The DET curves of various methods under different test conditions are plotted in Fig. 7. Figure 7a, b, and c represents the results of testall, test3 s, and test1 s, respectively. As we can see from the figure, the proposed MMAM exhibits the best performance at most points.
5.2.4 Visual analysis with TSNE
To understand the model’s ability to distinguish languages intuitively, we directly extract the embedding representations of different models from the test set. Since the visualization results of test3 s and testall are almost the same, we only show the visualization results of test1 s and testall on the AP17OLR corpus. Then we visualize the embedding representations through tSNE [46], which is a nonlinear dimensionality reduction algorithm for visualizing highdimensional data. To facilitate the observation of the difference between our proposed method and the baselines, we only list the visualization diagrams with the optimal parameters studied in Sections 5.1.1, 5.1.2, and 5.1.3. Similarly, the baselines also use the optimal parameters.
Language embeddings of the baselines and our proposed method under test1s are plotted in Figs. 8 and 9, respectively. For singlecenter classification loss, Fig. 8a, b, and c represents the visual baselines of Eq. (6), Eq. (10), and Eq. (11), respectively, and Fig. 8d and e represents the visual baselines of Eq. (4) and Eq. (5), respectively. For multicenter loss, Fig. 8f and g represents the visual baselines of Eq. (13) and Eq. (15), respectively.
For our MMAM loss, Fig. 9a, b, and c represents the visual results of Eq. (26). Also, language embeddings of the baselines and our proposed method under testall are plotted in Figs. 10 and 11, respectively. It can be seen that the visualizations of language embeddings by most of the singlecenter losses are not as good as those by the multicenter losses, and the multicenter losses do show traces of different centers. Although some singlecenter losses appear to distinguish all languages well compared with multicenter losses, their EERs are significantly higher than those of multicenter losses because scores and thresholds across languages determine the EER.
5.3 Comparison with stateofthearts
Table 6 compares the results of MMAM on the AP17OLR test set with the current stateoftheart results in terms of EER and C_{avg}. The first two lines are the two baselines released by the organizers of the AP17OLR challenge [47]. LDA_HVS [48] uses a factorized hidden variabil ity subspace (FHVS) learning technique for the adaptation of BLSTM RNNs model structure. AFs_ivector + AFs_xvector + AFs_TDNN [49] refers to the respectively established ivector, xvector and TDNN networkbased fusion system based on articulatory features (AFs). Multihead attention [50] represents the introduction of a multihead attention mechanism in the selfattention network, assuming that each head can capture different information to distinguish languages. Wav2vec [51] uses Wav2vec 2.0 (a selfsupervised framework for speech representation learning) to extend the selfsupervised framework to speaker verification and language recognition.
Compared with these systems released above, MMAM shows obvious advantages under all test sets.
6 Conclusions and future work
The MMAM loss proposed in this paper has achieved the most advanced language recognition performance on the AP17OLR corpus. The performance is improved by constraining the relationship between different centers and samples and between different centers and adding additional corner margins to the loss. EER, C_{avg}, minDCF are all greatly reduced. It is worth noting that in this paper, we choose the best performing hyperparameters for all the methods based on the test set, which may risk overfitting.
Our future work is to design other novel functions different from Eq. (26) that can make the network more capable of identifying different languages. In theory, the MMAM loss can be used for language recognition and other classification tasks.
Abbreviations
 LR:

Language recognition
 OLR:

Oriental language recognition
 SCL:

Singlecenter Loss
 MCL:

Multicenter Loss
 MMAM:

Masked multicenter angular margin
 GE2E:

Generalized endtoend
 AMCentroid:

Angular margin centroid
 ASoftmax:

Angular softmax
 AMSoftmax:

Additive margin softmax
 Subcenter:

Subcenter ArcFace
 AAMSoftmax:

Additive angular margin softmax
 DAMSoftmax:

Dynamicadditivemargin softmax
 pm:

Positive mask
 LP:

Label propagation
 RLP:

Reverse label propagation
 ProxyGML:

Proxybased deep graph metric learning
References
H. Li, B. Ma, K. A. Lee, Spoken language recognition: from fundamentals to practice. Proc. IEEE. 101(5), 1136–1159 (2013).
A. Waibel, P. Geutner, L. M. Tomokiyo, T. Schultz, M. Woszczyna, Multilinguality in speech and spoken language systems. Proc. IEEE. 88(8), 1297–1313 (2000).
J. Yu, J. Zhang, in 2019 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Zeroresource language recognition (IEEENew York, 2019), pp. 1907–1911.
A. Nagrani, J. S. Chung, A. Zisserman, Voxceleb: a largescale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).
D. Snyder, D. GarciaRomero, G. Sell, D. Povey, S. Khudanpur, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Xvectors: Robust dnn embeddings for speaker recognition (IEEENew York, 2018), pp. 5329–5333.
S. J. Prince, J. H. Elder, in 2007 IEEE 11th International Conference on Computer Vision. Probabilistic linear discriminant analysis for inferences about identity (IEEENew York, 2007), pp. 1–8.
M. Ravanelli, Y. Bengio, in 2018 IEEE Spoken Language Technology Workshop (SLT). Speaker recognition from raw waveform with sincnet (IEEENew York, 2018), pp. 1021–1028.
K. Okabe, T. Koshinaka, K. Shinoda, Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963 (2018).
D. Snyder, D. GarciaRomero, G. Sell, A. McCree, D. Povey, S. Khudanpur, in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker recognition for multispeaker conversations using xvectors (IEEENew York, 2019), pp. 5796–5800.
W. Cai, D. Cai, S. Huang, M. Li, in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Utterancelevel endtoend language identification using attentionbased cnnblstm (IEEENew York, 2019), pp. 5991–5995.
B. Padi, A. Mohan, S. Ganapathy, Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans. Audio Speech Lang. Process.28:, 1223–1232 (2020).
S. Ioffe, in European Conference on Computer Vision. Probabilistic linear discriminant analysis (SpringerBerlin, Heidelberg, 2006), pp. 531–542.
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Sphereface: Deep hypersphere embedding for face recognition (IEEENew York, 2017), pp. 212–220.
S. Ramoji, P. Krishnan V, P. Singh, S. Ganapathy, Pairwise discriminative neural plda for speaker verification. arXiv preprint arXiv:2001.07034 (2020).
H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, W. Liu, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Cosface: Large margin cosine loss for deep face recognition (IEEENew York, 2018), pp. 5265–5274.
W. Xie, A. Nagrani, J. S. Chung, A. Zisserman, in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Utterancelevel aggregation for speaker recognition in the wild (IEEENew York, 2019), pp. 5791–5795.
D. GarciaRomero, D. Snyder, G. Sell, A. McCree, D. Povey, S. Khudanpur, in INTERSPEECH. xvector dnn refinement with fulllength recordings for speaker recognition (IEEENew York, 2019), pp. 1493–1496.
C. Luu, P. Bell, S. Renals, Dropclass and dropadapt: Dropping classes for deep speaker representation learning. arXiv preprint arXiv:2002.00453 (2020).
D. Snyder, J. Villalba, N. Chen, D. Povey, G. Sell, N. Dehak, S. Khudanpur, in INTERSPEECH. The jhu speaker recognition system for the voices 2019 challenge (IEEENew York, 2019), pp. 2468–2472.
F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin softmax for face verification. IEEE Signal Proc. Lett.25(7), 926–930 (2018).
J. Deng, J. Guo, N. Xue, S. Zafeiriou, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Arcface: Additive angular margin loss for deep face recognition (IEEENew York, 2019), pp. 4690–4699.
D. Zhou, L. Wang, K. A. Lee, Y. Wu, M. Liu, J. Dang, J. Wei, in Proc. Interspeech 2020. Dynamic margin softmax loss for speaker verification (IEEENew York, 2020), pp. 3800–3804.
C. Zhang, K. Koishida, in Interspeech. Endtoend textindependent speaker verification with triplet loss on short utterances (IEEENew York, 2017), pp. 1487–1491.
V. Mingote, D. Castan, M. McLaren, M. K. Nandwana, A. O. Giménez, E. Lleida, A. Miguel, in INTERSPEECH. Language recognition using triplet neural networks (IEEENew York, 2019), pp. 4025–4029.
S. Chopra, R. Hadsell, Y. LeCun, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 1. Learning a similarity metric discriminatively, with application to face verification (IEEENew York, 2005), pp. 539–546.
L. Wan, Q. Wang, A. Papir, I. L. Moreno, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Generalized endtoend loss for speaker verification (IEEENew York, 2018), pp. 4879–4883.
J. Deng, J. Guo, T. Liu, M. Gong, S. Zafeiriou, in European Conference on Computer Vision. Subcenter arcface: Boosting face recognition by largescale noisy web faces (SpringerHeidelberg, 2020), pp. 741–757.
Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, R. Jin, in Proceedings of the IEEE/CVF International Conference on Computer Vision. Softtriple loss: Deep metric learning without triplet sampling (IEEENew York, 2019), pp. 6450–6458.
Y. Zhu, M. Yang, C. Deng, W. Liu, Fewer is more: A deep graph metric learning perspective using fewer proxies. arXiv preprint arXiv:2010.13636 (2020).
Y. Wei, J. Du, H. Liu, in Proc. Interspeech 2020. Angular margin centroid loss for textindependent speaker recognition (IEEENew York, 2020), pp. 3820–3824.
A. Iscen, G. Tolias, Y. Avrithis, O. Chum, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Label propagation for deep semisupervised learning (IEEENew York, 2019), pp. 5070–5079.
W. Liu, J. Wang, S. F. Chang, Robust and scalable graphbased semisupervised learning. Proc. IEEE. 100(9), 2624–2638 (2012).
Z. Tang, D. Wang, Y. Chen, Q. Chen, in 2017 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Ap17olr challenge: Data, plan, and baseline (IEEENew York, 2017), pp. 749–753.
D. Wang, L. Li, D. Tang, Q. Chen, in 2016 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Ap16ol7: A multilingual database for oriental languages and a language recognition baseline (IEEENew York, 2016), pp. 1–5.
Z. Ma, H. Yu, Language identification with deep bottleneck features. arXiv preprint arXiv:1809.08909 (2018).
T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, S. Khudanpur, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A study on data augmentation of reverberant speech for robust speech recognition (IEEENew York, 2017), pp. 5220–5224.
D. Snyder, G. Chen, D. Povey, Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).
W. Cai, J. Chen, J. Zhang, M. Li, Onthefly data loader and utterancelevel aggregation for speaker and language recognition. IEEE/ACM Trans Audio Speech Lang. Process.28:, 1038–1051 (2020).
Z. Qi, Y. Ma, M. Gu, in 2019 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). A study on lowresource language identification (IEEENew York, 2019), pp. 1897–1902.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, highperformance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
D. P. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
B. Desplanques, J. Thienpondt, K. Demuynck, Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143 (2020).
Z. Gao, Y. Song, I. McLoughlin, P. Li, Y. Jiang, L. R. Dai, in INTERSPEECH. Improving aggregation and loss function for better embedding learning in endtoend speaker verification system (IEEENew York, 2019), pp. 361–365.
D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
S. Ioffe, C. Szegedy, in International Conference on Machine Learning. Batch normalization: Accelerating deep network training by reducing internal covariate shift (PMLRNew York, 2015), pp. 448–456.
L. Van der Maaten, G. Hinton, Visualizing data using tsne. J. Mach. Learn. Res.9(11) (2008).
Z. Tang, D. Wang, Q. Chen, AP18OLR challenge: Three tasks and their baselines. CoRR. abs/1806.00616: (2018). 1806.00616. Accessed 1 Nov 2018.
S. Fernando, V. Sethu, E. Ambikairajah, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Factorized hidden variability learning for adaptation of short duration language identification models (IEEENew York, 2018), pp. 5204–5208.
J. Yu, M. Guo, Y. Xie, J. Zhang, in 2019 International Conference on Asian Language Processing (IALP). Articulatory features based tdnn model for spoken language recognition (IEEENew York, 2019), pp. 308–312.
R. K. Vuddagiri, T. Mandava, H. K. Vydana, A. K. Vuppala, in 2019 Twelfth International Conference on Contemporary Computing (IC3). Multihead selfattention networks for language identification (IEEENew York, 2019), pp. 1–5.
Z. Fan, M. Li, S. Zhou, B. Xu, Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185 (2020).
Acknowledgements
Not applicable.
Funding
This work was supported by the Fundamental Research Funds for the Central Universities (grant number 2021ZY87).
Author information
Authors and Affiliations
Contributions
Authors’ contributions
The first author mainly performed the experiments and wrote the paper, and the other authors reviewed and edited the manuscript. All of the authors discussed the final results. All of the authors read and approved the final manuscript.
Authors’ information
Not applicable.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ju, M., Xu, Y., Ke, D. et al. Masked multicenter angular margin loss for language recognition. J AUDIO SPEECH MUSIC PROC. 2022, 17 (2022). https://doi.org/10.1186/s13636022002494
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636022002494
Keywords
 Spoken language recognition
 Masked multicenter angular margin
 Multicenter loss
 Singlecenter loss
 ECAPATDNN neural network