Introducing phonetic information to speaker embedding for speaker verification

Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.


Introduction
Automatic speaker verification (ASV) has made great strides in the last two decades, moving from traditional Gaussian mixture model (GMM) approaches [1] to the i-vector framework [2] and neural network-based speaker embedding [3]. Based on Bayesian factor analysis, the i-vector framework converts a variable-length speech utterance into a fixed-length vector representing speaker characteristics. A variety of backend classifiers can be applied to suppress session variability and increase speaker discrimination. Even though the i-vector approach performed well in previous National Institute of Standards and Technology (NIST) speaker recognition evaluations (SREs), it is known to suffer from many problems in practical applications. The i-vector is a point estimate of the total variability factor, ignoring the covariance [2,4]. The performance of i-vector systems Critics of this approach argue that the frame-wise training is not a good option since speaker information tends to reside within long-term segments [13,14].
To address this problem, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were then introduced to directly capture segment information [13,15,16]. Network architectures and training strategies used in image and face recognitions have been adapted to speaker verification [17][18][19]. By using approaches such as statistics pooling [3], self attention [14] and learnable dictionary encoding (LDE) [20], neural networks are able to extract meaningful low-dimensional vectors from utterances. More effective loss functions have also been proposed to further encourage discriminative learning of speaker embeddings [21][22][23][24]. Speaker embedding has shown state-of-the-art performance comparative to i-vectors in many conditions. Speaker embedding also benefits from its ability to utilize big data [25], which is valuable in commercial applications. Based on these advantages, speaker embedding is quickly replacing the i-vector approach as the next generation of speaker verification technology.
Of the many components of a speech signal, speaker traits, and phonetic content, representing who spoke what, are two predominant factors for human communication. The mixing of speaker traits and phonetic contents creates challenges for speaker verification. Although speaker embedding has achieved superior performance, most current systems still do not take phonetic content into account. However, in many cases, networks cannot separate speaker information from the intermingled signal and different techniques should be applied. For example, in automatic speech recognition (ASR), speaker adaptation is used to reduce the impact of the speaker factor to improve accuracy [26]. In a similar way, it should be possible to reduce the impact of phonetic information on the speaker embedding. This is a difficult task, however, because phonetic information is dominant at the frame level while speaker information is typically extracted at the segment level. To overcome this level mismatch problem, we propose several methods in this paper to explicitly introduce framelevel phonetic information into the segment-level speaker embedding extraction. The first of these is phonetic adaptation. Similarly to speaker adaptation in ASR, phonetic adaptation uses phonetically rich vectors to remove the influence of phonetic content. This enables the network to focus on speaker traits which are insensitive to content variation. The second approach uses hybrid multi-task learning to extract the information shared between the speaker and phonetic components. This makes the network more robust against noise and improves the model generalization. Since these two approaches are designed from different perspectives, the phonetic adaptation and the hybrid multi-task learning can be combined into a novel c-vector (phonetic information combined vector). A simplified c-vector approach is further presented to reduce the model size.
This paper is an extension of our previous work presented in [27]. The new contributions of this paper are as follows: • A new c-vector approach has been proposed, combining phonetic adaptation and hybrid multi-task learning. A simplified c-vector architecture has also been presented. • Extensive experiments on 8-kHz NIST SREs and 16-kHz VoxCeleb [19,28] have been conducted.
Severe language mismatch has been introduced into the experiments to assess the generalization of the proposed approaches. • Data augmentation has been added and a better baseline than that reported in our previous work [27] has been built. In addition, larger datasets have been used to train the phonetic-related models in this paper. These modifications evaluate the effectiveness of the proposed approaches when more training data is available.
The experiments in this paper demonstrate that our proposed systems significantly outperform conventional speaker embedding. The best results are obtained with the c-vector approach. On the NIST SRE 2010 dataset, the resulting relative improvement in equal error rate (EER) is over 30% for the core-extended condition and 15% for the 10 s-10 s condition. Results on NIST SRE 2016, 2018, and VoxCeleb further validate the effectiveness of our methods in the language mismatched condition.
The outline of the paper is as follows. The existing literature about the use of phonetic information in speaker verification is briefly reviewed in Section 2. Section 3 describes the baseline system, and Section 4 presents our proposed approaches to introduce phonetic information in the speaker embedding neural network. Our experimental setup and results are given in Section 5. The last section concludes the paper.

Phonetic information in speaker verification
From an acoustic perspective, speaker traits, phonetic content and other components are intermingled throughout the speech signal. How to separate the speaker traits from speech content is the key problem in speaker verification.
Gaussian mixture models have been successfully used in speaker verification for several decades. In GMM-based speaker verification, features are required to be softly aligned to the corresponding Gaussian mixtures to compute the sufficient statistics. This frame alignment plays an important role in the GMM framework. To improve alignment accuracy, fine-grained GMMs were first proposed to model individual phoneme groups [29,30]. DNN acoustic models were later introduced to improve the frame alignment in [31]. In the DNN approach, the phonetic content is modeled by senones, which are subphonetic classes in speech recognition. The posteriors on these senones are estimated by DNNs and are then used to compute the statistics for the i-vector modeling. This model was extended in [32] where broader phonetic units were investigated. These works showed that comparison of speakers within the same phonetic category reduces the impact of the phonetic variability. Bottleneck (BN) features extracted from ASR acoustic models, which have rich phonetic information, have also been used in many approaches, with and without DNN-based alignment [33]. Overall, the i-vector approach based on DNN alignment and BN features greatly outperforms conventional systems. The existing work in i-vector-based systems suggests the importance of considering phonetic information in speaker verification.
For neural network-based speaker embedding, the d-vector network in [34] used the concatenation of raw features and the outputs of an ASR network to represent phonetic information. In [35], conventional Melfrequency cepstral coefficients (MFCCs) were replaced by ASR BN features to train the speaker embedding extractor. A collaborative joint training was presented in [36], in which the speaker and speech recognition networks were interconnected. Using an RNN architecture, the outputs of one task were fed into another at the next time step. This feedback enabled the speaker network to receive the information from the speech recognition task.
Multi-task learning has also been investigated for speaker verification. Multi-task learning has been shown to be useful across many different tasks [37]. Speaker traits and phonetic information, two key components of speech, have been combined through the use of multitask networks. In [38], phonetically-related classification was considered as a parallel task to the speaker classification network. The extracted features were effective for text-dependent speaker verification. The same idea has been used in some other works as well [35]. The rationale is that by exploring the common information shared between the speaker and phonetic components, multitask learning can prevent overfitting and improve model generalization.
However, to the best of our knowledge, none of these networks involving phonetic information consider the level mismatch problem and can only be trained at the frame level (i.e. in the d-vector style). This is not applicable to state-of-the-art segment-level speaker embeddings. Another problem in current multi-task learning for speaker verification is that most multi-task networks share all hidden layers between the speaker and phonetic-discriminant tasks, which is not ideal since this ignores the fact that speaker traits and phonetic content are quite different and likely need several individual layers to extract their own features. Therefore, it is necessary to propose novel architectures to combine the phonetic information with the speaker embedding.

The x-vector baseline
The baseline speaker embedding used in this paper is xvector [3]. X-vector is popular in the speaker verification community and has been provided as the official system on recent NIST SREs. The architecture is illustrated in Fig. 1.
The x-vector network consists of frame-level and segment-level sub-networks, connected by a statistics pooling layer. The frame-level network can be seen as a speaker feature extractor. Given the input sequence T k ] from utterance k with T k frames, the frame-level network tries to transform the acoustic features x A CNN variant, time-delay neural network (TDNN) whose input of each layer is the sliced outputs of the previous layer, is used. We omit the index k for brevity, denoting the frame-level network as where F(·) denotes the feed-forward function and f is the parameters for frame-level network. Next, a statistics pooling layer aggregates the speaker features f t and concatenates the mean and standard deviation as a segment-level representation l.
Fully-connected layers with parameters l are then implemented as the segment-level network. The output of the segment-level network is fed into a softmax layer and the posterior P(i|k) of speaker i is calculated as The x-vector network parameter = { f , l } is trained by minimizing the cross entropy loss. After training, the pre-activation of a hidden layer at the segmentlevel network is extracted as the speaker embedding. The x-vector backend processing is similar to that of i-vector. Mean normalization is first applied. Then, linear discriminant analysis (LDA) can be used to reduce the dimension of the embedding and length normalization is also performed [39]. Finally, probabilistic linear discriminant analysis (PLDA) scoring is introduced to generate the verification scores.

Proposed methods
In this section, we will describe our methods to tackle the level mismatch problem and introduce frame-level phonetic information into the training and extraction of the segment-level speaker embedding. Both phonetic adaptation and hybrid multi-task learning are proposed, which are then further combined into an integrated c-vector network.

Phonetic adaptation
A pooling strategy is used in the speaker embedding neural networks to aggregate frame-level speaker features into segment-level representations. Statistics pooling, which concatenates the first-and second-order statistics (i.e., the mean and standard deviation) as outputs, is applied in the x-vector architecture. In speech signals, speaker features are influenced by phonetic contents. To make the pooling more effective, phonetic information should be considered in the frame-level network.
Motivated by speaker adaptation as used in speech recognition, we propose a phonetic adaptation method. In speech recognition, a speaker code (e.g. i-vector) is used as an auxiliary input to help the network reduce the impact of speaker changes [26]. Similarly, phonetic adaptation can be done by feeding phonetically rich vectors into the x-vector network. With the phonetic vectors, the framelevel network is able to learn the phonetic-dependent transforms which is useful for the pooling layer.
In this paper, BN features extracted from an ASR acoustic model are selected as the phonetic vectors. As shown in Fig. 2, a phonetic-discriminant ASR acoustic model with parameters a is appended to the original network. The phonetic vector p t is the activation extracted from a hidden layer of the appended model.
where a denotes the parameters of the sub-network used to extract the phonetic vector. Since this sub-network is a part of the ASR acoustic model, a can be derived from a . The frame-level network then becomes For initialization, an ASR acoustic model with a BN layer is first pre-trained. The activations of the BN layer are connected to the x-vector frame-level network as phonetic vectors. The phonetic vectors can be extracted before the training of the x-vector network. In this case, the additional acoustic model is only used as a phonetic feature extractor. In our experiments, however, we find that fine-tuning the acoustic model with a small learning rate during the x-vector training improves the performance. The unused layers in the acoustic model are removed and the remaining part is updated with the x-vector network. The fine-tuning makes the phonetic vectors more adapted to the speaker verification task. The procedure is described in Algorithm 1.

Algorithm 1
Training speaker embedding using phonetic adaptation. Require: α, the learning rate. c, the learning rate scaling factor. m, the batch size. n s , the number of the training steps.
Pre-train an ASR acoustic model a . Derive the subnetwork a from the pre-trained model.
where X (k) is the feature sequence of utterance k and y (k) is the speaker label.

Hybrid multi-task learning
Phonetic adaptation filters out the phonetic variability by introducing auxiliary vectors. However, although the speaker and phonetic components are different, they still share some common information. Some factors, such as formant, pitch trajectory, and spectral energy distribution, are essential for both speaker traits and phonetic contents. By using phonetic unit classification as a parallel task, multi-task learning can discover more informative features which are less sensitive to nuisance factors. The conventional multi-task learning approach in speaker verification suffers from a level mismatch problem, because it requires all the tasks to operate at the same level. This only works for the frame-wise d-vector and is not suitable for the x-vector training. To address this, a hybrid multi-task learning framework is proposed in this section. In this hybrid framework, only the framelevel hidden layers in the x-vector network are shared with the phonetic-discriminant task. This architecture is able to process the frame-level phonetic information and the segment-level speaker embedding at the same time.
Multi-task networks also often share all the layers between different tasks. In our framework, the number of shared layers is set to be a hyper-parameter. This hyperparameter controls the trade-off between the common and individual information in the speaker and phonetic tasks. The hybrid multi-task learning network is shown in Fig. 3.
In Fig. 3, the frame-level network of the x-vector architecture is partitioned into a shared and a non-shared parts whose parameters are s and ns , respectively. The parameters of the remaining layers in the phoneticdiscriminant network are denoted as p . The speaker feature of the frame-level network is now The training strategy of our multi-task framework is similar to the multi-lingual acoustic model training [40].  ples are selected, the parameters { s , ns , l } are updated and { s , p } are trained when the phonetic examples are used. It is possible to set different learning rates to balance the importance of these tasks. The complete training procedure is described in Algorithm 2.
Algorithm 2 Training speaker embedding using hybrid multi-task learning. Require: α 1 , m 1 , the learning rate and the batch size for the speaker task. α 2 , m 2 , the learning rate and the batch size for the phonetic task. n s , the number of training steps. N s , N p , the total number of available training examples for speaker and phonetic tasks.
otherwise. y (k) and z (k) are the speaker and phonetic unit labels, respectively. if a speaker mini-batch is sampled then

The c-vector
In our phonetic adaptation approach, the phonetic content of an utterance is considered to have negative impact on the speaker verification task. In contrast, hybrid multitask learning exploits the useful phonetic information to improve the model generalization. The different perspectives of these methods create an opportunity to further combine them into a unified architecture. In this section, we propose a c-vector using both techniques to accomplish this goal. Figure 4 shows the c-vector architecture, which is a straightforward combination of Figs. 2 and 3. The phonetic vector extracted from a pre-trained acoustic model is introduced to the multi-task network. The phonetic vector is only used in the speaker task while the phoneticdiscriminant network is kept unchanged. The integrated network can be jointly optimized following a similar strategy to that of Algorithm 2 except that we need to fine-tune the acoustic model as in Algorithm 1. Since the phonetic vector is only appended to the speaker task, the parameters of the acoustic model providing phonetic vectors will not be updated when the phonetic examples are selected. This will make the ASR acoustic model only focus on the speaker task. The training procedure of the c-vector is summarized in Algorithm 3.

Algorithm 3
The c-vector. Require: α 1 , m 1 , the learning rate and the batch size for the speaker task. α 2 , m 2 , the learning rate and the batch size for the phonetic task. c, the learning rate scaling factor. n s , the number of training steps. N s , N p , the total number of available training examples for speaker and phonetic tasks. Pre-train an ASR acoustic model a . Derive the subnetwork a from the pre-trained model.
if a speaker mini-batch is sampled then In the c-vector architecture, two independent phonetic branches are used. This is necessary since these two subnetworks are optimized by different objective functions. However, there is also a need to limit the model size. We notice that, in the multi-task learning, the phoneticdiscriminant network also provides frame-wise phonetic information. Based on the c-vector architecture, a simplified model is proposed in Fig. 5. In the new model, the pre-trained acoustic model is first removed. A BN layer is then incorporated in the phonetic-discriminant network and the phonetic vectors are extracted from this layer.
Although the speaker-discriminant network in the simplified c-vector uses the activations of the BN layer in the phonetic-discriminant network in the feed-forward step, Fig. 5 The simplified c-vector architecture. The additional acoustic model is removed and the phonetic vectors come from the BN layer of the phonetic-discriminant network. The gradient-based training is stopped at the interconnected link between the two sub-networks the gradient from this sub-network should not be backpropagated through the phonetic-discriminant network. The reason is that the phonetic-discriminant network is optimized for phonetic unit classification in the multi-task learning framework. The speaker information introduced into the phonetic-discriminant network may affect the training procedure and weaken the effectiveness of the multi-task learning. To prevent this impact, when optimizing the speaker-discriminant network, the gradientbased training is stopped at the connection introducing phonetic vectors by setting the gradient to zero. It should be pointed out that, in the simplified version of the c-vector, the phonetic-discriminant network cannot be adapted freely; thus, the phonetic vectors are not optimized for the speaker verification task.
In our proposed methods, the ASR acoustic model and the speaker and phonetic-discriminant networks are trained alternately. This procedure does not require the training data to be both speaker and phonetically transcribed which is different than many conventional multitask networks. This flexibility is quite desirable in practice since we may collect the speaker data from one source and the phonetic data from other sources. Even although many speaker verification datasets do not have phoneme or text transcriptions, we can use other ASR datasets to introduce the phonetic information to our speaker embeddings.

Datasets
The performance of the proposed approaches is presented on NIST SREs and VoxCeleb datasets.
Experiments are first carried out on the NIST SRE 2010 core-extended and 10 s-10 s condition 5 [41]. Both conditions involve English conversational telephone speech. The core-extended condition consists of 2-min enrollment and test utterances while the duration of the utterances in the 10 s-10 s condition range from 8 to 12 s. To validate our proposed methods when utterances of different languages are presented, the NIST SRE 2016 [42] and 2018 [43] datasets are then used. The NIST SRE 2016 evaluation set contains trials spoken in Tagalog and Cantonese. Although two sources, namely call my net 2 (CMN2) and video annotation for speech technology (VAST), are included in NIST SRE 2018, only the CMN2 subset is used in this paper. The CMN2 subset is composed of speech spoken in Tunisian Arabic. The performance on the VAST subset is not reported since it exhibits quite different attribute from telephone speech. Different training data and adaptation technologies need to be investigated to achieve good results on the VAST subset [7].
For NIST SREs, the training data consists of 5 Switchboard datasets (Switchboard-2 Phase 1/2/3, Switchboard Cellular Part 1/2) and NIST SRE 2004-2008 telephone  [25]. Two English corpora, the 318-h Switchboard-1 and 1904-h Fisher English, are used to extract phonetic information. Since data augmentation is shown to improve the performance of a speaker verification system [44], it has become a standard pre-processing step. Two noise and reverberation datasets, MUSAN [45] and RIRs [46], are introduced to augment the training data. The augmentation follows the same recipe described in [25].
We also examine our approaches in an independent dataset out of the NIST SREs. The VoxCeleb dataset is extracted from videos in YouTube [19,28]. In this experiment, the results are evaluated on the VoxCeleb1 test set. The training set includes the dev portion of Vox-Celeb1 and the entire VoxCeleb2, comprising 2780 h of data and 7323 speakers. Although the VoxCeleb1 dataset only contains English data, the VoxCeleb2 dataset consists of speech from speakers of different nationalities, accents, and languages, making the training set multilingual. Different from NIST SREs, the VoxCeleb dataset is sampled at 16 kHz. The 960-h Librispeech [47] is used to introduce phonetic information. This corpus only contains read English speech. The data augmentation is also applied to the training data.
The statistics of all the test sets are shown in Table 1 and Table 2 summarizes the training data usage in the experiments.

Baseline x-vector system
The x-vector system used in our experiments follows the standard setup in Kaldi SRE16 V2 recipe [48]. For instance, at time t, MFCCs from time (t−2), (t−1), (t), (t + 1) and (t + 2) are concatenated as the input of the first hidden layer. A statistics pooling layer is then applied followed by 2 fully connected layers. Each hidden layer consists of a linear transform, following by a rectified linear unit (ReLU) activation and a batch-normalization. All the hidden layers have 512 nodes except for the one before the statistics pooling, which has 1500 nodes instead. The output is predicted by a softmax layer and the size is equal to the number of training speakers.  Natural gradient for stochastic gradient descent (NG-SGD) [49] is used to train the network. The batch size is 64 and the number of training epochs is 3. The learning rate starts from 0.001 and linearly decreases to 0.0001 at the end of the training. No dropout is applied. This setup follows the same recipe described in [25]. Unless otherwise specified, all the neural networks in this paper are optimized using this setup.
After training, the pre-activation of the first hidden layer at the segment-level network is extracted as the x-vector. Mean normalization is first performed and the dimension of the x-vector is then reduced through LDA. After LDA, the dimension of the x-vector is 150 for NIST SREs and 200 for VoxCeleb. The embedding is unit-length normalized and PLDA scoring is finally applied. For NIST SREs, the SRE 2004-2008 corpora with augmented excerpts are used to train the LDA/PLDA, while for VoxCeleb, the entire training set is used.
To deal with the domain mismatch in NIST SRE 2016 and 2018, an unsupervised PLDA adaptation proposed in Kaldi is applied. Due to domain mismatch, the total covariance estimated in the new domain is different from the covariance indicated in the out-of-domain PLDA. In the experiments, 75% of the excess in-domain covariance is attributed to the within-class covariance of the PLDA model while the remaining 25% is attributed to the between-class covariance. The PLDA parameters are then re-estimated based on the new within-and between-class covariances. The adaptation is performed on the unlabeled data of NIST SRE 2016 and 2018. Refer to the Kaldi source code 1 for more details.

Speaker embedding with phonetic information
In our proposed methods, the x-vector architecture is treated as a speaker-discriminant network and keeps the same setting as the baseline. Phonetic information is explicitly introduced by phonetic-discriminant networks. The senone transcriptions forced aligned by GMM-HMM are used to represent the phonetic contents on Fisher and Switchboard. The number of senones is 3800. 1 https://github.com/kaldi-asr/kaldi/blob/master/src/ivector/plda.cc  The effect of the finetuning learning rate scaling factor c is discussed in this section. When optimizing the phonetic-discriminant networks, the batch size is set to 256, which is a value often used in ASR acoustic model training.
In hybrid multi-task learning, the phoneticdiscriminant network uses the same architecture as the x-vector network except for two differences. The statistics pooling layer is excluded in the phonetic-discriminant network and the number of the nodes in the 5-th layer is reduced to 512. Although the learning rates α 1 and α 2 for the two tasks can be different, keeping them equal performs well in our experiments. The results sharing different numbers of layers will be investigated below.
The c-vector combines the phonetic adaptation and multi-task learning networks and uses the same parameters. In the simplified c-vector architecture, the multitask learning network is used, with some modifications. The number of nodes in the last hidden layer of the phonetic-discriminant network is reduced to 128 and the activations of this layer are fed into the 5-th layer of the x-vector network.
We use x-vector-pa, x-vector-mt, c-vector and sc-vector to denote the proposed systems respectively.
The conventional i-vector and DNN-based i-vector are also used in some evaluations for complete comparison. All the models trained in this paper are genderindependent. We analyze the influence of the parameters in our models on the male part of the NIST SRE 2010 coreextended condition and VoxCeleb. Results on SRE 2016 and 2018 are also reported to validate our approaches.
EER, the minimum detection cost function of NIST SRE 2008 (minDCF08) and SRE 2010 (minDCF10) [41] are used as the main performance metrics. The primary cost measures for SRE 2016 and 2018, denoted as minDCF16 and minDCF18, are reported for these two evaluations respectively [42,43].  The Kaldi toolkit [48] is used to build all the systems in this paper. The code has been released 2 .

Phonetic adaptation
This section presents the results of phonetic adaptation in speaker embedding training. In order to show the effectiveness of the acoustic model fine-tuning, several systems are trained with different learning rate scaling factors. Table 3 shows that all the systems using phonetic adaptation outperform the x-vector baseline on the male part of the NIST SRE 2010 core-extended condition. Since the phonetic information is introduced to this network, phonetic adaptation without fine-tuning (i.e., c = 0) reduces the EER by 14%. The performance can be further improved if we update the appended network during the x-vector training. The best result is obtained when the learning rate scaling factor is 0.1. The small learning rate means that the acoustic model only needs a slight adjustment to achieve good performance. Compared to the baseline, the x-vector using fine-tuned phonetic vectors improves the performance by 26%, 22%, and 23% in EER, minDCF08, and minDCF10, respectively.
Due to the introduction of an additional network, the number of model parameters increases. It is not initially clear whether the improvement comes from the phonetic information or the bigger model. Hence, we train a network combining the x-vector architecture and the acoustic model from scratch where there is no phonetic information considered. This model has the same topology with x-vector-pa. As shown in the last row of Table 3, the larger network does improve the performance. But the performance gain is smaller and it still performs worse than our proposed systems. The result validates the importance of the phonetic information extracted from the pre-trained acoustic model. Table 4 presents the results on VoxCeleb. Phonetic adaptation is also effective in this dataset. Since the acoustic model is pre-trained only using English speech, the extracted phonetic vectors are not a good match with the multi-lingual VoxCeleb training set. Fine-tuning alleviates this mismatch to some extent. Compared with NIST SRE 2010, a higher learning rate scaling factor needs to be used. In this case, 0.2 or 0.3 seems to be a good option. The speaker embedding using phonetic adaptation improves the EER of the baseline by relative 6% and 16%, without and with fine-tuning (c = 0.2), respectively. Table 5 gives the performance of several systems sharing different numbers of frame-level layers between the speaker and phonetic-discriminant networks on NIST SRE 2010. From Table 5, we find that although the multitask learning adds benefit in this condition, sharing more layers does not always improve the performance. The best overall result on our development set is obtained when 3 layers are shared. We decrease the EER from 1.96, when no multi-task learning is used, to 1.52%, when 3 layers are shared, resulting in 22% relative reduction. The minDCF08 and minDCF10 in this configuration also improves by 33% and 29% compared to the baseline. In contrast to the previous results, the VoxCeleb experiments reported in Table 6 show that, the multi-task learning only improves the results slightly in this dataset. By sharing 1 layer, the system outperforms the baseline by 4%, 8%, and 13% on EER, minDCF08, and minDCF10, respectively. However, the performance degrades rapidly when more layers are shared. We hypothesize that this is due to the language mismatch between Librispeech and  the VoxCeleb training set. In this case, sharing more layers cannot provide more useful information for the speakerdiscriminant network. On the contrary, with more layers shared, the number of the individual layers in the speakerdiscriminant network decreases, making the extraction of speaker characteristics more difficult.

The c-vector
The advantages of phonetic adaptation and multi-task learning are combined in the proposed c-vector. The tunable hyper-parameters now include both the learning rate scaling factor and the shared layers. We start our analysis on NIST SRE 2010. Based on the above experiments, the learning rate scaling factor ranges from 0.1 to 0.3. There are 2 or 3 layers shared between the speaker and phonetic-discriminant networks. Table 7 shows that the EER and minDCFs on the male part of the core-extended condition can be greatly reduced if proper parameters are selected. For each configuration, the c-vector performs better than the systems using only phonetic adaptation or multi-task learning. Consistent with the former experiments, better performance is obtained using a smaller learning rate scaling factor. Sharing 2 layers results in better EER, while 3-layer sharing is better for minDCF10. The minimum EER (1.12%) is achieved in the second row while the best minDCF08 (0.0065) and minDCF10 (0.2449) are obtained in the third row. To make a trade-off between these operation points, we set c = 0.1 and 3-layer sharing in our c-vector. Table 8 presents the performance of the c-vector approach on VoxCeleb. The best performance is obtained when the learning rate scaling factor is 0.2 and 1 layer is shared, resulting in 19%, 10%, and 36% relative reduction on EER, minDCF08, and minDCF10, respectively.  As explained previously, compared with the c-vector used in NIST SRE 2010, the learning rate scaling factor is increased while the number of the shared layers is decreased.

The simplified c-vector
We further evaluate the simplified c-vector (sc-vector) on these two datasets. As shown in Table 9, the sc-vector approach sharing 3 layers achieves the best overall performance on the male part of NIST SRE 2010 core-extended condition. This is consistent with the results observed for the c-vector. Table 10 shows the results obtained from different scvector configurations on VoxCeleb. Unlike the results in Table 9, the sc-vector does not significantly improve the performance on VoxCeleb. As shown in the above experiments, phonetic adaption with fine-tuning is more helpful than the multi-task learning in this dataset, which means the fine-tuning is vital in the language mismatch condition. However, for the sc-vector, the phoneticdiscriminant sub-network is only optimized by the outof-domain phonetic unit classification, which limits the power of the phonetic vectors.

Comparison of systems on NIST SRE 2010 and VoxCeleb
Tables 11 and 12 summarize the results of the x-vector baseline and all our proposed methods with the best system configurations on NIST SRE 2010. Two i-vector systems are also included. The setups of the i-vector systems are the same as that of the Kaldi SRE10 recipe 3 . From the results, we find that when using fine-tuning the speaker embedding with phonetic adaptation achieves better results than the baseline x-vector in almost all conditions. The only exception is the minDCF08 in the female 10 s-10 s condition which is also very close to the baseline. With multi-task learning, the proposed speaker embedding generally improves the performance, except for the minDCF10 in the male 10 s-10 s condition. The relative improvements in the core-extended condition are about 20% and about 10% in the 10 s-10 s condition. From Table 11 and Table 12, it is difficult to conclude which one is better because they each have advantages in different conditions. Overall, the sc-vector is able to  deliver better performance than previous systems and the c-vector generally performs the best on NIST SRE 2010. Even though the x-vectors with phonetic adaptation and hybrid multi-task learning have reduced the EER and minDCFs compared to the baseline for both genders, the c-vector approach further improves the performance.
The only case where the c-vector approach performs worse than the multi-task learning system is with regard to EER on the male part of the 10 s-10 s condition. In contrast, the minDCF10 is significantly better, leading to 33% relative reduction. On the male part of the coreextended condition, the c-vector significantly outperforms the original x-vector by 38%, 40%, and 37% in EER, minDCF08, and minDCF10, respectively. In the 10 s-10 s condition, the improvement on EER is 16% and over 20% based on minDCFs. The performance is similar on the female part. For the sc-vector, the model size is reduced by removing the appended acoustic model. The cost of this is that the phonetic vector fine-tuning is unavailable, and as a result the sc-vector performs worse than the c-vector. As shown in Tables 11 and 12, the i-vector framework can also benefit from the use of phonetic information on NIST SRE 2010. The i-vector system using DNNbased alignments (DNN/i-vector) outperforms the vanilla i-vector, especially when the utterance duration is long. The improvement of incorporating phonetic information in the i-vector system is almost 50% in some conditions. Even so, compared with our proposed c-vector, the DNNbased i-vector only achieves better results on the male part of the core-extended condition and performs much worse in the 10 s-10 s conditions. Next, we investigate the performance of the different systems on VoxCeleb. The results are shown in Table 13. The i-vector systems are not included due to inferior results. Compared with NIST SRE 2010, language mismatch exists between the speaker and phonetic training sets on VoxCeleb. From Table 13, it is clear that the phonetic adaptation performs better than the multi-task learning and a more aggressive learning rate scaling factor should be applied. We see that the sc-vector fails to outperform the x-vectors with phonetic adaptation and hybrid multi-task learning in this dataset. The likely reason is that the sc-vector cannot utilize the fine-tuning to adapt the acoustic model so that the extracted phonetic vectors are not suitable for the new domain. Actually,  the performance of the sc-vector is similar to the second row in Table 13. It seems that the introduction of the BN layer in the phonetic-discriminant network training has a negative impact, so that the multi-task learning in the scvector does not further improve the results of the speaker embedding using phonetic adaptation without any finetuning. Again, the c-vector approach performs the best on VoxCeleb.

Results on NIST SRE 2016 and 2018
Only English speech is used in NIST SRE 2010. In this case, the i-vector and speaker embedding systems both achieve better results when including phonetic information. In recent NIST SREs, language and channel mismatch was introduced between the training and test data [42,43]. Although we have examined the new speaker embeddings in a multi-lingual VoxCeleb dataset, it is still interesting to investigate the performance of our proposed methods in the more challenging NIST SRE 2016 and 2018. According to the experimental results on NIST SRE 2010 and VoxCeleb, in the language mismatched condition, a larger learning rate scaling factor and fewer shared layers should be used for our proposed approaches. All the results reported in this section are obtained by setting the learning rate scaling factor as 0.2 in the phonetic adaptation, and there is 1 layer shared in the multi-task learning.
The results of the two subsets of NIST SRE 2016, Tagalog and Cantonese, are reported in Table 14. From Table 14, we find that due to the severe language mismatch, the DNN-based i-vector system performs worse than the conventional i-vector. The reason for this is that the DNN trained on the English corpus cannot accurately compute the senone posteriors in Tagalog and Cantonese. Table 14 also demonstrates that the speaker-embedding systems outperform both i-vector systems on NIST SRE 2016. The x-vector baseline reduces the EER from 15.73% (i-vecto) to 9.46%.
Unlike the i-vector systems, our proposed methods still benefit from the added phonetic information even in this language mismatched condition. The first observation is that the x-vector using phonetic adaptation outperforms the baseline and the fine-tuning further improves the performance. The hybrid multi-task learning is also beneficial for the speaker embedding. Compared with the phonetic adaptation, hybrid multi-task learning performs worse in the Tagalog subset, while the results are better in the Cantonese subset. The sc-vector achieves similar results to that of the multi-task learning in the Tagalog subset and results in a better EER in Cantonese. The last row in Table 14 shows that the c-vector performs the best. Compared to the conventional x-vector, the c-vector improves the EER and minDCF16 on the pooled set by relative 15% and 10%, respectively. Table 15 summarizes the results of different systems on the NIST SRE 2018 CMN2 development and evaluation sets. The x-vector using fine-tuned phonetic vectors performs much better than the multi-task learning. Without fine-tuning, the sc-vector does not significantly improve the performance. The c-vector approach performs similarly to the x-vector with fine-tuned phonetic adaptation (the 3rd row in Table 15) on the dataset. This confirms  the important role of the fine-tuning in the language mismatched condition. When the c-vector approach is used, the EER and minDCF18 of the baseline is reduced by 13% and 8% on the evaluation set. The actDCF18 is also reported on the evaluation set of the NIST SRE 2018 CMN2 subset. Logistic regressionbased score calibration is used. The calibration parameters are first trained on the development set using the Bosaris toolkit [50] and then applied on the evaluation set. In Table 15, the actDCFs show a similar trend with the minDCFs and the proposed speaker embeddings still perform better than the baseline in actDCF18.
Although the improvements due to applying the phonetic information in these conditions are smaller than those of NIST SRE 2010, these results show the effectiveness and robustness of our proposed approaches when a language mismatch exists.

Conclusions
Although phonetic information has been reported to be effective in both the i-vector and frame-level d-vector frameworks, it is rarely used in state-of-the-art speaker embeddings. In this paper, we propose several approaches to overcome the level mismatch problem and introduce frame-level phonetic information into segment-level speaker embedding. The first approach is based on applying phonetic adaptation using phonetic vectors. The phonetic vectors, which are extracted from a fine-tuned ASR acoustic model, are used as auxiliary inputs into the x-vector network. The second approach uses hybrid multi-task learning to exploit the shared information between speaker traits and phonetic content, which improves model generalization. We finally propose a c-vector architecture combining these two approaches, as well as a simplified c-vector which uses phonetic vectors extracted from the phonetic-discriminant network in the multi-task learning approach. On NIST SRE 2010 coreextended and 10 s-10 s condition 5, the proposed speaker embeddings using phonetic adaptation and hybrid multitask learning significantly outperform the conventional x-vector, with the best performance achieved by our combined c-vector approach. Moreover, the results on the language mismatched NIST SRE 2016, 2018 and Vox-Celeb show that the proposed approaches perform well even if different languages are presented. The relationship between the performance and different system configurations have been carefully analyzed across different conditions. These results provide strong support for the benefit of including phonetic information into the speaker embedding-based speaker verification systems.