Introducing phonetic information to speaker embedding for speaker verification

Liu, Yi; He, Liang; Liu, Jia; Johnson, Michael T.

doi:10.1186/s13636-019-0166-8

Research
Open access
Published: 05 December 2019

Introducing phonetic information to speaker embedding for speaker verification

Yi Liu¹,
Liang He ORCID: orcid.org/0000-0003-4076-7479¹,
Jia Liu¹ &
…
Michael T. Johnson²

EURASIP Journal on Audio, Speech, and Music Processing volume 2019, Article number: 19 (2019) Cite this article

6208 Accesses
16 Citations
Metrics details

Abstract

Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

1 Introduction

Automatic speaker verification (ASV) has made great strides in the last two decades, moving from traditional Gaussian mixture model (GMM) approaches [1] to the i-vector framework [2] and neural network-based speaker embedding [3]. Based on Bayesian factor analysis, the i-vector framework converts a variable-length speech utterance into a fixed-length vector representing speaker characteristics. A variety of backend classifiers can be applied to suppress session variability and increase speaker discrimination. Even though the i-vector approach performed well in previous National Institute of Standards and Technology (NIST) speaker recognition evaluations (SREs), it is known to suffer from many problems in practical applications. The i-vector is a point estimate of the total variability factor, ignoring the covariance [2, 4]. The performance of i-vector systems deteriorates dramatically when the utterance is short, because this point estimate does not model the uncertainty [5]. I-vectors are also vulnerable to language and channel mismatch as shown in recent NIST SREs [6, 7]. Moreover, the performance of the i-vector model tends to asymptote quickly as the amount of data increases, which means that it is unable to fully exploit the availability of large-scale training data [3].

Deep neural networks (DNNs) have been used for speech processing tasks for a number of years [8–10]. Recently, neural network-based speaker embedding has drawn much attention in the speaker verification community. Motivated by the i-vector concept, speaker embedding encodes the speaker characteristics of an utterance into a fixed-length vector using neural networks. The first such method was the d-vector approach, initially proposed for text-dependent speaker verification [11]. The network was trained frame-by-frame and the d-vector was extracted by averaging all the activations of a selected hidden layer from an utterance. This network architecture was extended to text-independent verification in [12]. Critics of this approach argue that the frame-wise training is not a good option since speaker information tends to reside within long-term segments [13, 14].

To address this problem, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were then introduced to directly capture segment information [13, 15, 16]. Network architectures and training strategies used in image and face recognitions have been adapted to speaker verification [17–19]. By using approaches such as statistics pooling [3], self attention [14] and learnable dictionary encoding (LDE) [20], neural networks are able to extract meaningful low-dimensional vectors from utterances. More effective loss functions have also been proposed to further encourage discriminative learning of speaker embeddings [21–24]. Speaker embedding has shown state-of-the-art performance comparative to i-vectors in many conditions. Speaker embedding also benefits from its ability to utilize big data [25], which is valuable in commercial applications. Based on these advantages, speaker embedding is quickly replacing the i-vector approach as the next generation of speaker verification technology.

Of the many components of a speech signal, speaker traits, and phonetic content, representing who spoke what, are two predominant factors for human communication. The mixing of speaker traits and phonetic contents creates challenges for speaker verification. Although speaker embedding has achieved superior performance, most current systems still do not take phonetic content into account. However, in many cases, networks cannot separate speaker information from the intermingled signal and different techniques should be applied. For example, in automatic speech recognition (ASR), speaker adaptation is used to reduce the impact of the speaker factor to improve accuracy [26]. In a similar way, it should be possible to reduce the impact of phonetic information on the speaker embedding.

This is a difficult task, however, because phonetic information is dominant at the frame level while speaker information is typically extracted at the segment level. To overcome this level mismatch problem, we propose several methods in this paper to explicitly introduce frame-level phonetic information into the segment-level speaker embedding extraction. The first of these is phonetic adaptation. Similarly to speaker adaptation in ASR, phonetic adaptation uses phonetically rich vectors to remove the influence of phonetic content. This enables the network to focus on speaker traits which are insensitive to content variation. The second approach uses hybrid multi-task learning to extract the information shared between the speaker and phonetic components. This makes the network more robust against noise and improves the model generalization. Since these two approaches are designed from different perspectives, the phonetic adaptation and the hybrid multi-task learning can be combined into a novel c-vector (phonetic information combined vector). A simplified c-vector approach is further presented to reduce the model size.

This paper is an extension of our previous work presented in [27]. The new contributions of this paper are as follows:

A new c-vector approach has been proposed, combining phonetic adaptation and hybrid multi-task learning. A simplified c-vector architecture has also been presented.
Extensive experiments on 8-kHz NIST SREs and 16-kHz VoxCeleb [19, 28] have been conducted. Severe language mismatch has been introduced into the experiments to assess the generalization of the proposed approaches.
Data augmentation has been added and a better baseline than that reported in our previous work [27] has been built. In addition, larger datasets have been used to train the phonetic-related models in this paper. These modifications evaluate the effectiveness of the proposed approaches when more training data is available.

The experiments in this paper demonstrate that our proposed systems significantly outperform conventional speaker embedding. The best results are obtained with the c-vector approach. On the NIST SRE 2010 dataset, the resulting relative improvement in equal error rate (EER) is over 30% for the core-extended condition and 15% for the 10 s-10 s condition. Results on NIST SRE 2016, 2018, and VoxCeleb further validate the effectiveness of our methods in the language mismatched condition.

The outline of the paper is as follows. The existing literature about the use of phonetic information in speaker verification is briefly reviewed in Section 2. Section 3 describes the baseline system, and Section 4 presents our proposed approaches to introduce phonetic information in the speaker embedding neural network. Our experimental setup and results are given in Section 5. The last section concludes the paper.

2 Phonetic information in speaker verification

From an acoustic perspective, speaker traits, phonetic content and other components are intermingled throughout the speech signal. How to separate the speaker traits from speech content is the key problem in speaker verification.

Gaussian mixture models have been successfully used in speaker verification for several decades. In GMM-based speaker verification, features are required to be softly aligned to the corresponding Gaussian mixtures to compute the sufficient statistics. This frame alignment plays an important role in the GMM framework. To improve alignment accuracy, fine-grained GMMs were first proposed to model individual phoneme groups [29, 30]. DNN acoustic models were later introduced to improve the frame alignment in [31]. In the DNN approach, the phonetic content is modeled by senones, which are sub-phonetic classes in speech recognition. The posteriors on these senones are estimated by DNNs and are then used to compute the statistics for the i-vector modeling. This model was extended in [32] where broader phonetic units were investigated. These works showed that comparison of speakers within the same phonetic category reduces the impact of the phonetic variability. Bottleneck (BN) features extracted from ASR acoustic models, which have rich phonetic information, have also been used in many approaches, with and without DNN-based alignment [33]. Overall, the i-vector approach based on DNN alignment and BN features greatly outperforms conventional systems. The existing work in i-vector-based systems suggests the importance of considering phonetic information in speaker verification.

For neural network-based speaker embedding, the d-vector network in [34] used the concatenation of raw features and the outputs of an ASR network to represent phonetic information. In [35], conventional Mel-frequency cepstral coefficients (MFCCs) were replaced by ASR BN features to train the speaker embedding extractor. A collaborative joint training was presented in [36], in which the speaker and speech recognition networks were interconnected. Using an RNN architecture, the outputs of one task were fed into another at the next time step. This feedback enabled the speaker network to receive the information from the speech recognition task.

Multi-task learning has also been investigated for speaker verification. Multi-task learning has been shown to be useful across many different tasks [37]. Speaker traits and phonetic information, two key components of speech, have been combined through the use of multi-task networks. In [38], phonetically-related classification was considered as a parallel task to the speaker classification network. The extracted features were effective for text-dependent speaker verification. The same idea has been used in some other works as well [35]. The rationale is that by exploring the common information shared between the speaker and phonetic components, multi-task learning can prevent overfitting and improve model generalization.

However, to the best of our knowledge, none of these networks involving phonetic information consider the level mismatch problem and can only be trained at the frame level (i.e. in the d-vector style). This is not applicable to state-of-the-art segment-level speaker embeddings. Another problem in current multi-task learning for speaker verification is that most multi-task networks share all hidden layers between the speaker and phonetic-discriminant tasks, which is not ideal since this ignores the fact that speaker traits and phonetic content are quite different and likely need several individual layers to extract their own features. Therefore, it is necessary to propose novel architectures to combine the phonetic information with the speaker embedding.

3 The x-vector baseline

The baseline speaker embedding used in this paper is x-vector [3]. X-vector is popular in the speaker verification community and has been provided as the official system on recent NIST SREs. The architecture is illustrated in Fig. 1.

The x-vector network consists of frame-level and segment-level sub-networks, connected by a statistics pooling layer. The frame-level network can be seen as a speaker feature extractor. Given the input sequence $\vec {X}^{(k)}=[\vec {x}_{1}^{(k)}, \ldots, \vec {x}_{T_{k}}^{(k)}]$ from utterance k with T_k frames, the frame-level network tries to transform the acoustic features $\vec {x}_{t}^{(k)}$ into speaker-discriminant features $\vec {f}_{t}^{(k)}$. A CNN variant, time-delay neural network (TDNN) whose input of each layer is the sliced outputs of the previous layer, is used. We omit the index k for brevity, denoting the frame-level network as

$$ \vec{f}_{t} = \mathcal{F} \left(\vec{x}_{t} | \Theta_{f}\right) $$

(1)

where $\mathcal {F}(\cdot)$ denotes the feed-forward function and Θ_f is the parameters for frame-level network.

Next, a statistics pooling layer aggregates the speaker features $\vec {f}_{t}$ and concatenates the mean and standard deviation as a segment-level representation $\vec {l}$.

$$ \vec{l} = \left[ {\vec{m}}^{T}, {\vec{\sigma}}^{T} \right]^{T} $$

(2)

$$ \vec{m} = \frac{1}{T} \sum_{t=1}^{T} \vec{f}_{t} $$

(3)

$$ \vec{\sigma} = \left(\frac{1}{T} \sum_{t=1}^{T} (\vec{f}_{t} - \vec{m})^{2} \right)^{1/2} $$

(4)

Fully-connected layers with parameters Θ_l are then implemented as the segment-level network. The output of the segment-level network is fed into a softmax layer and the posterior P(i|k) of speaker i is calculated as

$$ P(i|k) = \text{softmax}\left(\mathcal{F} \left(\vec{l} | \Theta_{l} \right)\right) $$

(5)

The x-vector network parameter Θ={Θ_f,Θ_l} is trained by minimizing the cross entropy loss. After training, the pre-activation of a hidden layer at the segment-level network is extracted as the speaker embedding. The x-vector backend processing is similar to that of i-vector. Mean normalization is first applied. Then, linear discriminant analysis (LDA) can be used to reduce the dimension of the embedding and length normalization is also performed [39]. Finally, probabilistic linear discriminant analysis (PLDA) scoring is introduced to generate the verification scores.

4 Proposed methods

In this section, we will describe our methods to tackle the level mismatch problem and introduce frame-level phonetic information into the training and extraction of the segment-level speaker embedding. Both phonetic adaptation and hybrid multi-task learning are proposed, which are then further combined into an integrated c-vector network.

4.1 Phonetic adaptation

A pooling strategy is used in the speaker embedding neural networks to aggregate frame-level speaker features into segment-level representations. Statistics pooling, which concatenates the first- and second-order statistics (i.e., the mean and standard deviation) as outputs, is applied in the x-vector architecture. In speech signals, speaker features are influenced by phonetic contents. To make the pooling more effective, phonetic information should be considered in the frame-level network.

Motivated by speaker adaptation as used in speech recognition, we propose a phonetic adaptation method. In speech recognition, a speaker code (e.g. i-vector) is used as an auxiliary input to help the network reduce the impact of speaker changes [26]. Similarly, phonetic adaptation can be done by feeding phonetically rich vectors into the x-vector network. With the phonetic vectors, the frame-level network is able to learn the phonetic-dependent transforms which is useful for the pooling layer.

In this paper, BN features extracted from an ASR acoustic model are selected as the phonetic vectors. As shown in Fig. 2, a phonetic-discriminant ASR acoustic model with parameters Θ_a is appended to the original network. The phonetic vector $\vec {p}_{t}$ is the activation extracted from a hidden layer of the appended model.

$$ \vec{p}_{t} = \mathcal{F} \left(\vec{x}_{t} | \Theta^{\prime}_{a} \right) $$

(6)

where $\Theta ^{\prime }_{a}$ denotes the parameters of the sub-network used to extract the phonetic vector. Since this sub-network is a part of the ASR acoustic model, $\Theta ^{\prime }_{a}$ can be derived from Θ_a. The frame-level network then becomes

$$ \vec{f}_{t} = \mathcal{F} \left(\vec{x}_{t}, \vec{p}_{t} | \Theta_{f} \right) $$

(7)

For initialization, an ASR acoustic model with a BN layer is first pre-trained. The activations of the BN layer are connected to the x-vector frame-level network as phonetic vectors. The phonetic vectors can be extracted before the training of the x-vector network. In this case, the additional acoustic model is only used as a phonetic feature extractor. In our experiments, however, we find that fine-tuning the acoustic model with a small learning rate during the x-vector training improves the performance. The unused layers in the acoustic model are removed and the remaining part is updated with the x-vector network. The fine-tuning makes the phonetic vectors more adapted to the speaker verification task. The procedure is described in Algorithm 1.

4.2 Hybrid multi-task learning

Phonetic adaptation filters out the phonetic variability by introducing auxiliary vectors. However, although the speaker and phonetic components are different, they still share some common information. Some factors, such as formant, pitch trajectory, and spectral energy distribution, are essential for both speaker traits and phonetic contents. By using phonetic unit classification as a parallel task, multi-task learning can discover more informative features which are less sensitive to nuisance factors.

The conventional multi-task learning approach in speaker verification suffers from a level mismatch problem, because it requires all the tasks to operate at the same level. This only works for the frame-wise d-vector and is not suitable for the x-vector training. To address this, a hybrid multi-task learning framework is proposed in this section. In this hybrid framework, only the frame-level hidden layers in the x-vector network are shared with the phonetic-discriminant task. This architecture is able to process the frame-level phonetic information and the segment-level speaker embedding at the same time.

Multi-task networks also often share all the layers between different tasks. In our framework, the number of shared layers is set to be a hyper-parameter. This hyper-parameter controls the trade-off between the common and individual information in the speaker and phonetic tasks. The hybrid multi-task learning network is shown in Fig. 3.

In Fig. 3, the frame-level network of the x-vector architecture is partitioned into a shared and a non-shared parts whose parameters are Θ_s and Θ_ns, respectively. The parameters of the remaining layers in the phonetic-discriminant network are denoted as Θ_p. The speaker feature of the frame-level network is now

$$ \vec{f}_{t} = \mathcal{F} \left(\vec{x}_{t} | \Theta_{s}, \Theta_{ns}\right) $$

(8)

The training strategy of our multi-task framework is similar to the multi-lingual acoustic model training [40]. The training data consists of speaker and phonetic examples. The speaker examples contain the reference speaker labels for each utterance while the phonetic examples contain the corresponding phonetic units for frames. The transcriptions of the phonetic units are obtained by forced alignment using a hidden Markov model (HMM). The two tasks are trained alternately. At each step, we randomly choose a mini-batch composed of the speaker examples from the pooled training data with probability p_s=N_s/(N_s+N_p) and select the phonetic mini-batch otherwise. Here, N_s and N_p are the number of remaining speaker and phonetic examples. When the speaker examples are selected, the parameters {Θ_s,Θ_ns,Θ_l} are updated and {Θ_s,Θ_p} are trained when the phonetic examples are used. It is possible to set different learning rates to balance the importance of these tasks. The complete training procedure is described in Algorithm 2.

4.3 The c-vector

In our phonetic adaptation approach, the phonetic content of an utterance is considered to have negative impact on the speaker verification task. In contrast, hybrid multi-task learning exploits the useful phonetic information to improve the model generalization. The different perspectives of these methods create an opportunity to further combine them into a unified architecture. In this section, we propose a c-vector using both techniques to accomplish this goal.

Figure 4 shows the c-vector architecture, which is a straightforward combination of Figs. 2 and 3. The phonetic vector extracted from a pre-trained acoustic model is introduced to the multi-task network. The phonetic vector is only used in the speaker task while the phonetic-discriminant network is kept unchanged. The integrated network can be jointly optimized following a similar strategy to that of Algorithm 2 except that we need to fine-tune the acoustic model as in Algorithm 1. Since the phonetic vector is only appended to the speaker task, the parameters of the acoustic model providing phonetic vectors will not be updated when the phonetic examples are selected. This will make the ASR acoustic model only focus on the speaker task. The training procedure of the c-vector is summarized in Algorithm 3.

In the c-vector architecture, two independent phonetic branches are used. This is necessary since these two sub-networks are optimized by different objective functions. However, there is also a need to limit the model size. We notice that, in the multi-task learning, the phonetic-discriminant network also provides frame-wise phonetic information. Based on the c-vector architecture, a simplified model is proposed in Fig. 5. In the new model, the pre-trained acoustic model is first removed. A BN layer is then incorporated in the phonetic-discriminant network and the phonetic vectors are extracted from this layer.

Although the speaker-discriminant network in the simplified c-vector uses the activations of the BN layer in the phonetic-discriminant network in the feed-forward step, the gradient from this sub-network should not be back-propagated through the phonetic-discriminant network. The reason is that the phonetic-discriminant network is optimized for phonetic unit classification in the multi-task learning framework. The speaker information introduced into the phonetic-discriminant network may affect the training procedure and weaken the effectiveness of the multi-task learning. To prevent this impact, when optimizing the speaker-discriminant network, the gradient-based training is stopped at the connection introducing phonetic vectors by setting the gradient to zero. It should be pointed out that, in the simplified version of the c-vector, the phonetic-discriminant network cannot be adapted freely; thus, the phonetic vectors are not optimized for the speaker verification task.

In our proposed methods, the ASR acoustic model and the speaker and phonetic-discriminant networks are trained alternately. This procedure does not require the training data to be both speaker and phonetically transcribed which is different than many conventional multi-task networks. This flexibility is quite desirable in practice since we may collect the speaker data from one source and the phonetic data from other sources. Even although many speaker verification datasets do not have phoneme or text transcriptions, we can use other ASR datasets to introduce the phonetic information to our speaker embeddings.

5 Experiments

5.1 Datasets

The performance of the proposed approaches is presented on NIST SREs and VoxCeleb datasets.

Experiments are first carried out on the NIST SRE 2010 core-extended and 10 s–10 s condition 5 [41]. Both conditions involve English conversational telephone speech. The core-extended condition consists of 2-min enrollment and test utterances while the duration of the utterances in the 10 s-10 s condition range from 8 to 12 s. To validate our proposed methods when utterances of different languages are presented, the NIST SRE 2016 [42] and 2018 [43] datasets are then used. The NIST SRE 2016 evaluation set contains trials spoken in Tagalog and Cantonese. Although two sources, namely call my net 2 (CMN2) and video annotation for speech technology (VAST), are included in NIST SRE 2018, only the CMN2 subset is used in this paper. The CMN2 subset is composed of speech spoken in Tunisian Arabic. The performance on the VAST subset is not reported since it exhibits quite different attribute from telephone speech. Different training data and adaptation technologies need to be investigated to achieve good results on the VAST subset [7].

For NIST SREs, the training data consists of 5 Switchboard datasets (Switchboard-2 Phase 1/2/3, Switchboard Cellular Part 1/2) and NIST SRE 2004–2008 telephone excerpts. Unlike [25], Mixer 6 is excluded in our experiments since it was used in NIST SRE 2010 test sets. This comprises 64,742 utterances from 6394 speakers, resulting in 5524 h in total. The NIST SRE 2016 and 2018 unlabeled data, representing 52 and 72 h of data, is used for domain adaptation [25]. Two English corpora, the 318-h Switchboard-1 and 1904-h Fisher English, are used to extract phonetic information. Since data augmentation is shown to improve the performance of a speaker verification system [44], it has become a standard pre-processing step. Two noise and reverberation datasets, MUSAN [45] and RIRs [46], are introduced to augment the training data. The augmentation follows the same recipe described in [25].

We also examine our approaches in an independent dataset out of the NIST SREs. The VoxCeleb dataset is extracted from videos in YouTube [19, 28]. In this experiment, the results are evaluated on the VoxCeleb1 test set. The training set includes the dev portion of VoxCeleb1 and the entire VoxCeleb2, comprising 2780 h of data and 7323 speakers. Although the VoxCeleb1 dataset only contains English data, the VoxCeleb2 dataset consists of speech from speakers of different nationalities, accents, and languages, making the training set multi-lingual. Different from NIST SREs, the VoxCeleb dataset is sampled at 16 kHz. The 960-h Librispeech [47] is used to introduce phonetic information. This corpus only contains read English speech. The data augmentation is also applied to the training data.

The statistics of all the test sets are shown in Table 1 and Table 2 summarizes the training data usage in the experiments.

Table 1 The number of speakers, target trials, and impostor trials for the test sets

Introducing phonetic information to speaker embedding for speaker verification

Abstract

1 Introduction

2 Phonetic information in speaker verification

3 The x-vector baseline

4 Proposed methods

4.1 Phonetic adaptation

4.2 Hybrid multi-task learning

4.3 The c-vector

5 Experiments

5.1 Datasets

5.2 Experimental setup

5.2.1 Baseline x-vector system

5.2.2 Speaker embedding with phonetic information

5.3 Results and discussion

5.3.1 Phonetic adaptation

5.3.2 Multi-task learning

5.3.3 The c-vector

5.3.4 The simplified c-vector

5.3.5 Comparison of systems on NIST SRE 2010 and VoxCeleb

5.3.6 Results on NIST SRE 2016 and 2018

6 Conclusions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords