Skip to main content

Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning

Abstract

Speaker recognition, the process of automatically identifying a speaker based on individual characteristics in speech signals, presents significant challenges when addressing heterogeneous-domain conditions. Federated learning, a recent development in machine learning methods, has gained traction in privacy-sensitive tasks, such as personal voice assistants in home environments. However, its application in heterogeneous multi-domain scenarios for enhancing system customization remains underexplored. In this paper, we propose the utilization of federated learning in heterogeneous situations to enable adaptation across multiple domains. We also introduce a personalized federated learning algorithm designed to effectively leverage limited domain data, resulting in improved learning outcomes. Furthermore, we present a strategy for implementing the federated learning algorithm in practical, real-world continual learning scenarios, demonstrating promising results. The proposed federated learning method exhibits superior performance across a range of synthesized complex conditions and continual learning settings, compared to conventional training methods.

1 Introduction

Speaker recognition, a critical task in the field of speech processing, involves the automatic identification and verification of speakers based on individual characteristics embedded in speech signals. With the growing ubiquity of voice-controlled devices and systems, speaker recognition has become an essential component for various applications, including security, authentication, and personalized user experiences.

Deep neural networks have become the cornerstone of modern machine learning applications, often requiring large amounts of labeled training data to achieve optimal performance. Traditionally, this data is collected from end-devices, such as smartphones, and sent to a centralized server for model training. However, this approach raises concerns regarding user privacy and the potential burden on communication links due to the transmission of large datasets.

In recent years, researchers in speaker recognition field are putting more focus on learning robust speaker features on multiple conditions [1, 2]. Including different room acoustics scenarios, different languages, different channel conditions, etc. All these contribute to degraded speaker recognition performance. Many researches focus on using domain adaptation methods to improve the system performance in these scenarios. While many of these research need to obtain both target domain data and the source domain data in a central data center, which is not only cost inefficient, but also sometimes impossible.

Federated learning (FL), an emerging machine learning paradigm, has gained significant attention in recent years for its potential to improve privacy and enable collaborative learning among distributed data sources. FL allows multiple clients to jointly train a model without sharing raw data, which can be particularly useful in privacy-sensitive applications. Despite its growing popularity, the application of FL in heterogeneous multi-domain conditions for enhancing system customization in speaker recognition remains relatively unexplored. Federated learning can be broadly categorized into two main types [3]:

  • Cross-device federated learning: This method focuses on jointly learning speech characteristics from numerous mobile or similar devices to train a unified statistical model for speaker recognition. In this typical scenario, data is often limited and may have lower labeling quality.

  • Cross-silo federated learning: In this setting, organizations like universities can be regarded as remote devices containing substantial student data. These organizations must adhere to strict privacy practices and navigate potential legal, administrative, or ethical constraints to ensure data privacy. Federated learning can be employed in this scenario with relatively more abundant data and better labeling quality, facilitating the construction of supervised yet cost-effective training in this situation.

Privacy concerns are regarded as one of the major challenges in speaker recognition applications since they involve the complete sharing of speech data, which can have serious implications for user privacy. Federated learning can mitigate privacy infringement in speaker recognition systems by enabling multiple participants to collaboratively learn a shared model without revealing their local data, as recently examined by Lai et al. [4]. Interestingly, an emerging trend in the FL domain involves utilizing federated learning for domain adaptation and personalization, leading to a research area known as personalized FL [5]. This approach eliminates the need for centralized data transmission, storage, and training, making adaptation to diverse and complex client conditions more feasible and reasonable. Another challenge in real-world scenarios includes ever-changing data with limited buffer capacity, prompting research into continual learning and online learning [6].

The main contribution of this work is the application of federated learning techniques to train supervised deep neural network-based speaker recognition models, with the goal of customizing speaker information across multiple heterogeneous domains while preserving user privacy. Unlike previous works on FL in combination with speech and speaker recognition, which mainly focus on privacy-preservation scenarios and simple client collaboration within a single data domain, we concentrate on multi-domain client collaboration and heterogeneous domain adaptation using personalized FL. To achieve this, we simulate iconic acoustic conditions using the room acoustic software Pyroom [7] and select multi-lingual datasets to design and compose client datasets. We also evaluate various personalized training strategies to identify a better approach that outperforms centralized training. Finally, we explore ways to combine FL methods with continual learning techniques, enabling them to function effectively in real-world continual learning scenarios.

In summary, this paper explores the effectiveness of personalized federated learning (PFL) and Federated Learning combined with continual learning methods within the scope of speaker identification and speaker verification tasks across multiple heterogeneous domains. Our primary contributions can be outlined as follows:

  • We propose a speaker recognition system based on personalized federated learning (PFL), leveraging supervised speaker data stored across different silos. By learning client-dependent projection modules, our approach enables better adaptation to various scenarios and demonstrates promising performance in both speaker identification and speaker verification tasks.

  • We simulate and evaluate our systems using room acoustics software to assess PFL’s effectiveness in domain adaptation scenarios. We compare PFL’s performance with centralized training and other common baselines, showing that PFL surpasses these alternatives, a finding unreported in other FL-based speech research. Our carefully designed training strategies demonstrate that the proposed PFL methods are particularly suitable for domain-heterogeneous speaker recognition scenarios.

  • We effectively integrate federated learning with continual learning settings, introducing continual personalized federated learning (C-PFL) that delivers robust performance throughout training stages. Our chosen random prototype casting training strategy, employed as an enhancement, proves to be beneficial when combined with C-PFL.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 describes the proposed model. Section 4 details the experimental setup and presents the results and analysis. Section 5 discusses our future considerations. Finally, Section 6 concludes the paper.

2 Related work

2.1 Speaker recognition

Automatic speaker recognition has a rich history, with early methods including probabilistic models, deep neural networks combined with probabilistic models, and end-to-end speaker recognition models [8,9,10,11]. Speaker recognition encompasses three sub-tasks: speaker verification, speaker identification [1], and speaker diarization [12]. This paper primarily focuses on speaker verification and speaker identification tasks.

Over the past decade, neural network-based speaker recognition models have achieved superior performance, becoming the dominant approach in the field. The x-vector model, which extracts speaker-related features from acoustic properties using neural networks [9], can be considered a milestone in modern deep neural network speaker recognition. Subsequent years have seen the development of convolutional-based [13], complex 1D temporal neural network-based [10], transformer-based DNN systems [14], and autoML-based systems [15], all of which have contributed to significant progress in speaker recognition.

Following the trend of large-scale training, speaker recognition models that leverage transformer-based pre-training [16, 17] have demonstrated impressive improvements over previous deep learning methods.

2.2 Domain adaptation and robust speaker recognition

Although speaker recognition systems have demonstrated strong performance on numerous benchmark datasets, recognizing speech in complex and diverse domains remains a challenging problem. Current methods for addressing domain adaptation issues can be categorized into several groups. Back-end statistical model adaptation techniques [18, 19] utilize first- and second-order information from the embedding space feature distribution to adapt the backend classification model. These models are generally lightweight [20], resulting in high explainability.

Recently, many works have focused on developing methods that can better learn from different datasets or conditions, employing transfer learning [21] approaches such as domain adversarial learning [22, 23] or discrepancy minimization methods [24]. Other techniques concentrate on using meta-learning methods to construct domain-agnostic pretrained models from various datasets [25] or adapting parts of the base pretrained model to build better target domain representations [26].

In order to enhance the robustness of speaker recognition models, some researchers utilize data augmentation methods. Notable recent works include using text-to-speech techniques to synthesize fake speakers [27, 28], which helps make the speaker model more generalizable. Other approaches involve more sophisticated audio signal processing technologies in the front-end, such as beamforming methods [29], dereverberation techniques [30], and speech separation methods [31], all of which contribute to making speaker recognition systems more robust in varying acoustic conditions.

2.3 Continual learning and its application

Continual learning and its related online learning scenarios and methods are gaining more attention recently [6]. In recent research, several approaches have been proposed to tackle the challenges of continual learning and generalization in various domains, including speaker verification and automatic speech recognition. In [32], the authors propose a continual-learning-based method to incrementally learn new spoofing attacks for speaker verification systems without performance degradation on previous data. Paper [33] presents a dynamically expanding end-to-end model for the speech recognition task, which helps avoid catastrophic forgetting and seamlessly integrate knowledge from new data. Paper [34] focuses on online continual learning for automatic speech recognition and demonstrates the effectiveness of incremental model updates using the online Gradient Episodic Memory (GEM) method.

2.4 Federated learning and its application

Federated learning has emerged as a promising technology in the field of machine learning, with significant potential for preserving user privacy [35]. In the speech community, numerous studies have employed federated learning techniques in various applications such as automatic speech recognition [36,37,38,39,40], keyword spotting [41, 42], and speech emotion detection [43].

A number of works have also applied federated learning methods to speaker recognition tasks, including [4, 44, 45]. These studies primarily focus on utilizing federated learning to enhance data privacy and explore the data class distribution properties of each client in non-IID scenarios within the same domain condition.

3 Learning speaker features with personalized federated learning

The federated learning-based speaker recognition system consists of two learning procedures: client-side fine-tuning and server-side updates. Federated learning allows for distributed training of speaker recognition models across domain-heterogeneous clients. Our proposed personalized federated learning (PFL) system for speaker recognition operates in two primary locations: edge silos (clients) and a central server. The overall architecture of the system, showcasing the server-side, silos’ domain conditions, and individual client learning details, is illustrated in Fig. 1.

$$\begin{aligned} \min _{\theta } G(\theta ), \text{ where } G(\theta ):=\sum _{i=1}^{n} q_{i} G_{i}(\theta ) \end{aligned}$$
(1)
Fig. 1
figure 1

Architecture of the proposed multi-domain personalized federated speaker system. The proposed architecture is used in major speaker recognition sub-tasks including speaker verification and identification

In Eq. (1), we aim to minimize the global objective function \(G(\theta )\), which is defined as the weighted sum of local objective functions \(G_i(\theta )\) across n clients. Here, \(q_i\) denotes the weights for aggregating the targets, and \(\theta\) represents the model parameters.

This formulation illustrates the process of training a centralized model over a distributed dataset, where a multitude of clients hold variable-sized subsets of the data. During each iteration of training, a local model update is computed at the device level and communicated to a central server. Subsequently, the central server combines a large number of these updates or gradients to compute a global update to the central model. This global update is essentially an average of the local updates, which ensures the preservation of privacy and efficient utilization of distributed data.

$$\begin{aligned} G(\theta , X, Y):=\text {LossFunc}(\text {Trans}(\text {Base}(X)), Y) \end{aligned}$$
(2)

Equation (2) demonstrates the core concept of personalized FL, where \(\text {Base}(\cdot )\) serves as the model combination, employing a global feature extractor based on x-vectors in our setup. Meanwhile, \(\text {Trans}(\cdot )\) represents a transformer-based personalized feature projector for each domain-specific client before calculating the final loss for each client. This approach adopts the transfer learning-based personalized FL (TL-PFL) strategy, as described in [5].

Given an speaker embedding from the \(\text {Base}(\cdot )\) and sequenlize it as \(X = \{x_1, x_2, \dots , x_n\}\), the transformer encoder first applies a positional encoding to the input embeddings, which allows the model to utilize positional information. The encoded sequence \(Z^0 = \{z_1^0, z_2^0, \dots , z_n^0\}\) is then fed into the first layer of the encoder. Each layer l computes the output sequence \(Z^l = \{z_1^l, z_2^l, \dots , z_n^l\}\).

In the transformer encoder, the Multi-Head Attention mechanism is denoted as \(\text {MultiHead}(Q, K, V)\), where Q, K, and V represent the query, key, and value matrices, respectively, serving as the inputs for this mechanism.

For each layer of the transformer encoder:

$$\begin{aligned} Z^{l, \text {att}}= & {} \text {MultiHead}(Z^{l-1}, Z^{l-1}, Z^{l-1})\end{aligned}$$
(3)
$$\begin{aligned} Z^{l, \text {ff}}= & {} \text {FFN}(Z^{l, \text {att}})\end{aligned}$$
(4)
$$\begin{aligned} Z^l= & {} \text {LayerNorm}(Z^{l-1} + Z^{l, \text {att}})\end{aligned}$$
(5)
$$\begin{aligned} Z^{l+1}= & {} \text {LayerNorm}(Z^l + Z^{l, \text {ff}}) \end{aligned}$$
(6)

where \(\text {FFN}\) represents the position-wise feed-forward network, and \(\text {LayerNorm}\) denotes layer normalization.

As illustrated in Fig. 1, the final output embedding for evaluation is produced by two branches, namely the Base Feature (referred to as Type-A, Discriminator-projection) and Projection Feature (referred to as Type-B, Feature-projection), which can be used individually or in combination. The embedding is used directly for cosine similarity comparison in speaker verification tasks following [10]. For speaker identification tasks, a separate logistic regression-based classifier is learned for each speaker in each evaluation subset:

$$\begin{aligned} P(c_{id}|\mathbf {x_{emb}}; \textbf{w}) = \sigma (\mathbf {W_{id}^{(i)}}^T \mathbf {x_{emb}}) \end{aligned}$$
(7)

Here, \(c_{id}\) represents the identification output class ID, \(\mathbf {{W_{id}^{(i)}}}^T\) denotes the learnable weight for the i-th evaluation subset, and \(\mathbf {x_{emb}}\) is the generated speaker embedding.

figure a

Algorithm 1 Server update algorithm for PFL & C-PFL

figure b

Algorithm 2 Client update algorithm in C-PFL

3.1 Heterogenous domain continual personalized federated learning

We present the heterogeneous-domain across-silo continual personalized federated learning (C-PFL) method, which combines the principles of continual learning and federated learning. Our approach is specifically designed to address the challenges encountered when new data from different stages become available, all while preserving the aforementioned FL mechanism across diverse silos.

The key idea behind our method is to dynamically update the output classifier parameters with new weights when data from new stages arrive. This is achieved by continually adapting the model to incorporate the information from the newly acquired data without causing significant interference with the previously learned knowledge.

In the proposed C-PFL method, given the output of speaker embedding as \(\textbf{x} \in R^n\), the output probability for each class is estimated by the equation:

$$\begin{aligned} P(c|\textbf{x}; \textbf{w}, \textbf{b}) = \sigma (\textbf{W}^T \textbf{x} + \textbf{b}) \end{aligned}$$
(8)

where \(c \in N_0\) is the non-negative label ID for the speaker embedding, \(\sigma (\cdot )\) is the Softmax function, and \(\textbf{W} \in R^{C \times n}\) and \(\textbf{b} \in R^C\). Given that we have T continual learning stages in which each stage has a speaker set \(\mathbf {c_t}\), we configure C as follows:

$$\begin{aligned}{} & {} \Sigma _{t=1}^{T}|\mathbf {c_t}| \ll C\end{aligned}$$
(9)
$$\begin{aligned}{} & {} c_{ti} \sim \text {Uniform}(0, C) \end{aligned}$$
(10)

where \(c_{ti}\) is the assigned label for speaker embedding i in the t-th continual learning stage. As new data arrive at different stages, the output classifier sets new weights specifically for the classes in these new stages with high probability if C is set large enough:

$$\begin{aligned}{} & {} P_{new}(t) > 1-\frac{\Sigma _1^{n-1}|\mathbf {c_t}|}{C}\end{aligned}$$
(11)
$$\begin{aligned}{} & {} \lim _{C \rightarrow \infty } P_{new}(t, C) = 1 \end{aligned}$$
(12)

Here, \(P_{new}(t)\) is the probability of setting new classes at the continual learning stage t. Let \(\Omega ^{(t)}\) denote the classifier weights vector sets (also called prototypes) in Eq. (8) used at stage t, which is defined as:

$$\begin{aligned} \Omega ^{(t)}= & {} \{W_{:, c_{ti}}|c_{ti} \in \mathbf {c_t}\} \end{aligned}$$
(13)
$$\begin{aligned} \Omega ^{(t_1)} \cap \Omega ^{(t_2)}= & {} \emptyset ; (t_1 \ne t_2) \end{aligned}$$
(14)

Equation (14) holds true when the condition in (9) is met. In the initial learning stage, we employ weights \(\Omega ^{(1)}\). As new data becomes available for subsequent stages, we introduce additional weights \(\Omega ^{(2)}\), and so forth. This method enables the model to learn continually from new data without experiencing catastrophic forgetting of previously acquired knowledge. The detailed procedure for PFL and C-PFL are shown in Algorithm 1 and Algorithm 2. This approach, named “C-PFL with enhanced training strategy,” is contrasted with a simpler method that employs Eq. (8) without random prototype casting, which we designate as “C-PFL with a standard training strategy,” to emphasize the differences between the two strategies.

3.2 Silo acoustic simulation

To evaluate the PFL and C-PFL methods in scenarios involving multiple domain clients, we simulated room acoustic settings in our experiments using the Pyroom Acoustics library. Six distinct room configurations were employed to thoroughly examine the performance of our models under challenging acoustic conditions. As illustrated in Fig. 2, these room settings encompassed a diverse range of sizes, shapes, and reverberation characteristics, thereby providing a comprehensive evaluation framework. In each room, we tested the models using both single OMNI-directional microphones and OMNI-directional microphone arrays. This approach enabled us to investigate the impact of different microphone configurations on the overall system performance, as well as assess the robustness of our models in coping with diverse real-world scenarios.

Fig. 2
figure 2

The room acoustic settings simulated with Pyroom acoustics. Six rooms settings used in the experiments are demonstrated in the figure. Rooms are using both single OMNI-direction microphone or OMNI-direction mic-arrays

The first three rooms are designed with varying sizes to simulate different room reverberation properties, representing small, medium, and large rooms. The left part of Fig. 3 demonstrates the RT60 metrics of each room, showcasing their distinct reverberation characteristics. The fourth room includes a noise source alongside the sound source to simulate noisy room conditions.

Fig. 3
figure 3

Detailed room simulation information: The first three figures display the RT60 properties of the small, large, and medium rooms in our simulation. The last two figures illustrate the beamforming settings in Room5 and Room6, as well as the beam patterns across different frequency bands

Our room simulations also incorporate delay-and-sum (DAS) beamforming to emulate the performance of real-world multi-channel microphone array acoustic environments and evaluate how our systems perform under these common conditions. The right part of Fig. 3 depicts the beamforming settings for Room5 and Room6, presenting the beam patterns across different frequency bands. Room6 additionally includes a noise source, as in Room4. The delay-and-sum beamforming algorithm is applied in these two acoustic scenarios. Let \(x_i(t)\) denote the signal received by the \(i^{th}\) microphone in these rooms, where \(i=1,\ldots ,M\), and M is the total number of microphones. The time-delay \(\tau _i\) is calculated for each microphone based on the desired direction of the beam, as expressed below:

$$\begin{aligned} \tau _i= & {} \frac{d_i}{c}\end{aligned}$$
(15)
$$\begin{aligned} y(t)= & {} \sum _{i=1}^M x_i(t - \tau _i) \end{aligned}$$
(16)

where \(d_i\) represents the difference in distance between the source and the \(i^{th}\) microphone and the reference microphone, and c is the speed of sound. The output signal y(t) is then computed by summing the delayed signals from all microphones to enhance the preset direction.

4 Experiments

4.1 Experiment settings

We utilize simulated room acoustics scenarios to assess the federated learning system and its associated algorithms in the context of speaker recognition tasks. The evaluation specifically targets both speaker verification and speaker identification tasks across multiple heterogeneous domain groups, and we also investigate the performance in continual learning settings. The configurations can be found in Table 1.

Table 1 Dataset settings for experiments

To build the speaker recognition system with personalized federated learning, we use the VoxCeleb [46] and CnCeleb [47] datasets. We select 100 speakers for each group, ensuring no overlap between groups. Each group simulates domain conditions based on the predefined settings described in the table, enabling a comprehensive assessment of the system’s performance across various acoustic environments and languages. We configure Groups 1 through 6 to use the settings of Rooms 1 through 6 with the English language VoxCeleb dataset, while Groups 7 through 12 employ the same room settings but with the Chinese language CnCeleb dataset.

For system evaluation, speaker IDs are selected according to the table for both verification and identification tasks. The evaluation speakers for each group remain consistent across all groups, with their speech processed by the simulation pipeline according to the predetermined settings.

In the continual learning settings, we additionally allocate Stage 2 and Stage 3 data using distinct speakers from VoxCeleb, while keeping the evaluation set consistent across all stages to evaluate the performance of the proposed systems during each stage.

Table 1 also documents detailed settings for each group, encompassing room dimensions, reverberation time, microphone array configurations, and other relevant parameters.

We utilize the VoxCeleb2 dataset to pre-train the x-vector model for the canonical model, following the same procedure as in [10]. Subsequently, we conduct further experiments with PFL, C-PFL, and other baseline algorithms based on this pre-trained model.

4.2 Performance evaluation of personalized federated learning

In Table 2, we present the evaluation results for various models across evaluation groups 1 to 12, along with their mean values. Our findings indicate that the Canonical model, following the classical training procedure of [10], demonstrates satisfactory performance in certain domain groups that exhibit similar acoustic conditions to the original dataset. However, its performance is limited in many other groups characterized by distinct acoustic conditions.

Table 2 Evaluation performance of PFL and the baseline

The centralized training model, incorporating a transformer block similar to the architecture of [14], exhibits a substantial improvement over the canonical models, as it leverages the domain information collectively within the central server. In contrast, the separated training procedure, employing a strategy similar to [26], does not consistently yield better performance than the canonical model. This highlights the impact of limited data for fine-tuning, which may lead to degraded performance compared to the original models.

The personalized federated learning strategy outperforms all other methods in these domain-agnostic scenarios, achieving the lowest mean EER among all approaches. The improvement is particularly significant in groups with rare acoustic conditions that deviate considerably from the original training data. This finding underscores the effectiveness of the PFL strategy in addressing speaker recognition tasks across diverse acoustic environments.

In Table 3, we evaluate various personalized federated learning (PFL) strategies, comparing the classical direct-discriminator approach, as used in [4], to the proposed personalized projection-based methods. Both methods yield promising results, with PFL achieving a better mean EER using the discriminator-projection strategy (PFL Type-A). Therefore, we further assess these two techniques by examining their convergence capabilities, as illustrated in Fig. 4. Utilizing a personalized discriminator demonstrates significantly better convergence performance, achieving the lowest EER within just 10 rounds, whereas the direct-discriminator approach requires approximately 25 rounds. This showcases the efficiency and effectiveness of the PFL method.

Table 3 Comparing performance of different training strategies of PFL
Fig. 4
figure 4

Comparing the error rate between the FL-classical and the PFL Type-A

Moreover, we investigate the effects of various FL strategies on overall performance. Utilizing a feature-projection strategy (PFL Type-B) does not lead to better outcomes, indicating that the limited data available for fine-tuning the transformer-based projector presents challenges in creating a more efficient feature space. These insights emphasize the strengths and limitations of different Personalized FL strategies when addressing speaker recognition tasks in diverse acoustic environments.

4.3 Speaker identification task evaluation

We further evaluate the performance of the Personalized-FL (PFL) model by comparing its performance on the speaker identification task. The results are presented in Table 4. We compare various basic classification metrics across the 12 evaluation domain sets, including accuracy, precision, F1-score, and AUC. The corresponding micro-average ROC curves for each domain group are also assessed, as depicted in Fig. 5.

Table 4 Evaluation performance of centralized training and PFL training on the speaker identification task
Fig. 5
figure 5

Receiver operating characteristic (ROC) curves and area under the curve (AUC) values for multi-class speaker identification using Models A (centralized training) and B (PFL Type-A). The plot shows individual ROC curves for each of the 12 groups, with different shades of blue for Model A and different shades of orange for Model B. The mean ROC curves for Model A and Model B are highlighted with thicker dashed lines. The AUC values for each curve are included in the legend

Our analysis reveals that using our proposed PFL strategy leads to significantly better results than centralized training, with the exception of some domain groups that have relatively more common and less challenging acoustic conditions. In these cases, both methods perform similarly. The mean ROC curve in Fig. 5 further illustrates this trend, providing a comprehensive visualization of the PFL strategy’s effectiveness across diverse domain groups. Overall, the PFL model demonstrates superior performance in the majority of domain groups, showcasing its potential for enhancing speaker identification tasks in various acoustic environments.

4.4 Evaluation of continual personalized federated learning

Table 5 presents the performance of continual personalized federated learning (C-PFL). We evaluate the model across all episodes and stages throughout the three training stages. It is worth noting that we use the same evaluation sets across all training stages, which serves as a good way to assess how different learning methods adapt to ever-changing new data and are influenced by the deletion of previous data from storage. Remarkably, our C-PFL method effectively integrates information from the data while enabling knowledge generalization and preventing catastrophic forgetting.

Table 5 Evaluation performance of C-PFL in continual learning settings

Upon examining the results, we observe a consistent improvement in the mean EER during the training of each stage, indicating enhanced performance. Furthermore, the standard deviation decreases, signifying a more uniform improvement across all evaluation domain groups. This trend underscores the robustness of the C-PFL method in adapting to various domain datasets and demonstrates its potential for real-world continual learning applications. The model effectively generalizes to new knowledge while avoiding catastrophic forgetting.

In Fig. 6, we compare the performance of the standard C-PFL strategy with our enhanced C-PFL strategy across each stage of the learning process. The standard strategy appears to struggle with retaining previously acquired knowledge and generalizing it to future learning stages. In contrast, the enhanced strategy demonstrates superior performance in continual learning scenarios.

Fig. 6
figure 6

Comparison of error rates between C-PFL with and without enhanced training strategy, and the reference performance of single-stage non-continual PFL training

This improvement can be attributed to the design of the proposed C-PFL method, which effectively balances the retention of prior knowledge with the acquisition of new information by randomly shifting the weight designation in the output classifier of the PFL module. As a result, our enhanced strategy exhibits a more stable learning curve, ensuring consistently better performance across various stages and domain datasets. This finding highlights the advantages of employing the C-PFL with an enhanced strategy approach in real-world continual learning speaker recognition applications, where knowledge retention and generalization are critical for achieving reliable and robust performance.

In Fig. 7, we compare the impact of different training data incoming sequences on the continual learning process to assess the influence that varying data sequences may have on the results. We examine two scenarios: one where we use Stage 2 datasets first, followed by Stage 3 datasets, and another where we reverse the order. Our findings demonstrate that the proposed strategy remains effective regardless of the data sequence used, as evidenced by the EER plots in the figure.

Fig. 7
figure 7

Comparison of error rates for C-PFL with different training data input sequences

In both cases, the EER on evaluation sets consistently improves at each stage, demonstrating that the system effectively leverages the knowledge acquired from previous stages to enhance subsequent training, regardless of the data incoming sequence. This observation underscores the robustness of the proposed C-PFL method.

4.5 Complementary experiments

We perform supplementary experiments to provide a deeper understanding and analysis of the characteristics of our proposed PFL method.

Figure 8 displays the loss tracking for various selected groups of our training data when implementing our proposed PFL Type-A system. Different groups necessitate varying numbers of training steps within a single combination episode, causing some clients to wait for others to finish their training.

Fig. 8
figure 8

Tracking the loss learning curve for PFL training. Loss values for selected representative groups, along with their standard deviation range. The plot illustrates the loss values for Groups 1, 2, 4, 7, 8, and 10 during the first 300 steps of training

At the beginning of each episode, the loss initially increases to a local peak but converges rapidly compared to the previous episode, resulting in a lower loss value. This behavior illustrates the successful knowledge sharing and transfer between clients at each episode, enabling effective learning and adaptation across clients from multiple heterogeneous domains. The moving average mean and the standard deviation range shown in the plot indicate that all groups achieve good convergence within several hundred steps. This consistent convergence across various groups highlights the effectiveness of our proposed PFL approach in handling diverse data and domain conditions, ultimately improving the overall performance in speaker recognition tasks.

Figure 9 presents the t-SNE visualization of the embedding space for both personalized federated learning (PFL) and centralized training strategies. Each point represents an embedding from an utterance, with each cluster corresponding to a speaker class, while different colors indicate speakers from distinct domain client groups. It is evident that both PFL and centralized training methods encounter challenges in distinguishing some speakers and forming well-defined clusters for certain speaker IDs. This difficulty arises due to the severe interference caused by the diverse room acoustic conditions experienced by some groups.

Fig. 9
figure 9

Comparing the embedding plot between the PFL and centralized training

Nevertheless, a significant difference between the two methods can be observed. Centralized training seems to blend domain information with speaker class information, resulting in a decline in performance compared to the PFL training. This confusion is particularly evident in the highlighted area of the t-SNE plot, where the speaker classes and domain groups are evidently entangled. In contrast, the PFL strategy exhibits better separation of speaker classes and domain information, leading to improved overall performance in speaker recognition tasks. This can be attributed to the personalized module constraining domain-specific information within a single client training through feature projection, thus avoiding interference with the global model training. Consequently, PFL inherently emerges as a suitable method to effectively address the challenges posed by complex room acoustics and diverse domain conditions, ultimately enhancing the robustness and generalizability of speaker recognition systems.

Figure 10 presents the results of a common problem in federated learning where, at each combination time point, not all clients participate in the aggregation process. We investigate the effects of varying combination ratios for the proposed PFL system, including 1.0 (normal case), 0.7, and 0.3. As the combination ratio decreases, we observe a decline in performance, yet all scenarios still outperform the centralized training baseline (represented by the grey dashed line). The minimum EER dotted lines for different systems are displayed in the plot to emphasize this trend.

Fig. 10
figure 10

Comparison of error rates for PFL using various combination probabilities

These findings emphasize the importance of the combination ratio in achieving optimal performance with FL. Despite the performance degradation caused by a lower combination ratio, the proposed federated learning method remains effective, consistently outperforming the centralized training method. Although the performance is reasonably good compared to the baseline, the influence of partial combination, unsynchronized combination, and the waiting time for clients on PFL still remains an open problem, requiring further research.

5 Discussion and future considerations

In the present study, we have introduced a novel personalized federated learning (PFL) system, engineered explicitly for speaker recognition tasks across various heterogeneous domains. The system has shown considerable advancements over existing baseline models. However, several aspects necessitate further exploration and discussion.

5.1 Future research and application

Moving forward, we aim to delve deeper into more advanced federated learning algorithms, such as those employing clustering-based FL methods [48]. Additionally, we intend to explore more pragmatic on-device learning algorithms, specifically those utilizing unsupervised streaming audio data. Our future research will also concentrate on crafting specific federated learning algorithms tailor-made for speaker diarization systems [12].

5.2 System design and reliability

The development of FL algorithms is intrinsically linked to system design; therefore, their integration into real-world systems demands careful consideration of numerous factors. Apart from the FL combination ratios discussed earlier, one significant element to consider is the implementation of asynchronous federated systems [49]. These systems cater to variations in client device timing prior to the aggregation in each training round, addressing potential downgrades in model accuracy due to disparities in device resources.

As we contemplate incorporating these features into tangible systems, the emphasis on system reliability intensifies. To mitigate reliability concerns, we can apply insights from Boudi et al.’s 2023 study [50], which showcases a robust and error-free career agent created using Deep Reinforcement Learning in tandem with formal verification. Moreover, the methodology presented by Ait-Ameur et al. (2023) [51] provides an excellent illustration of employing formal methods to handle complex cyber-physical systems. With the application of the Event-B formal specification language to model, verify, and refine these systems, we are well-equipped to design and verify the reliability of our FL-based speaker recognition system, accounting for variables such as device training latency and transmission latency, ensuring the system meets certain criteria.

5.3 Significance to AI, speech, and NLP fields

Our research contributes to the wider fields of Artificial Intelligence (AI), speech, and Natural Language Processing (NLP). In these research domains, one prevailing trend involves the centralized training and pre-training of large-scale models for speech and NLP tasks, such as the Whisper model for speech recognition developed by OpenAI [52]. These models primarily aim to enhance generalization across diverse scenarios. Another emerging trend is the development of models that emphasize customization and personalization, as exemplified by Lin et al. [53], which cater to the unique complexities of Chinese language processing to improve NLP tasks.

We posit that by applying federated learning, starting from a well-pretrained model, we can concurrently enhance the model’s generalization and personalization capabilities. This process is expedited and made seamless by the continual integration of wide-spread, privacy-sensitive data that would normally be unavailable or challenging to manage. Notably, in line with recent trends, the future integration of high-performance and explainable AI, as underscored by Bride et al. [54], presents an intriguing prospect for further enhancing our system’s robustness and transparency. Consequently, this combined approach paves the way for the steady evolution of our model, thereby advancing towards our aspiration of crafting a robust, explainable, and lifelong learning system in the future.

6 Conclusion

This study has demonstrated the effectiveness of personalized federated learning in the field of speaker recognition. By facilitating the training of speaker recognition models using supervised speaker data stored on various heterogeneous domain silos, the proposed system has exhibited promising results in both speaker verification and speaker identification tasks. Moreover, the learning of the client-dependent personalized module has allowed for better adaptation to diverse scenarios.

Our simulations and evaluations, based on room acoustics software, have underscored the advantages of using PFL in heterogeneous domain adaptation scenarios. Comparisons between various federated learning methods, centralized training, and separated training have revealed that PFL outperforms the other methods, a finding not reported in previous research.

Furthermore, PFL can be effectively integrated with continual learning settings, with the continual personalized federated learning method showcasing strong performance as the training stages progress. By employing a random prototype casting training strategy, the C-PFL with enhanced strategy has proven advantageous.

In summary, the findings of this study endorse the adoption of personalized federated learning as a valuable approach to speaker recognition tasks. This approach offers improvements in performance for heterogeneous domain adaptation compared to classical transfer learning methods for speaker recognition. Additionally, it presents a new way of learning in a continual, collaborative manner that brings valuable traits such as data security, training-costs saving, and adaptability to ever-changing data in real-world scenarios.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the authors upon reasonable request. Any additional materials, such as code or supplementary information, can also be provided by the authors upon request. The relevant code and data that support the findings of this study are also available at https://github.com/bicbrv/FedSPK following the date of publication.

References

  1. Z. Bai, X.L. Zhang, Speaker recognition based on deep learning: an overview. Neural Netw. 140, 65–99 (2021)

    Article  Google Scholar 

  2. Y. Tu, W. Lin, M.W. Mak, A Survey on Text-Dependent and Text-Independent Speaker Verification. IEEE Access 10, 99038–99049 (2022)

    Article  Google Scholar 

  3. OpenMined (2022), https://www.openmined.org/. Accessed 01 Apr 2023

  4. A. Woubie, T. Bäckström, in 2021 ISCA Symposium on Security and Privacy in Speech Communication. Federated Learning for Privacy Preserving On-Device Speaker Recognition. ISCA. pp. 1–5 (2021) 

  5. A.Z. Tan, H. Yu, L. Cui, Q. Yang, Towards Personalized Federated Learning. IEEE Trans. Neural Netw. Learn. Syst. 1–17 (2022). https://ieeexplore.ieee.org/document/9743558

  6. L. Wang, X. Zhang, H. Su, J. Zhu. A Comprehensive Survey of Continual Learning: Theory, Method and Application (2023). ArXiv preprint arXiv:2302.00487

  7. R. Scheibler, E. Bezzam, I. Dokmanić, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Pyroomacoustics: A Python package for audio room simulations and array processing algorithms (2018), pp. 351–355

  8. N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  9. D.D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). X-Vectors: Robust DNN Embeddings for Speaker Recognition (2018). pp. 5329–5333 Publisher: IEEE

  10. B. Desplanques, J. Thienpondt, K. Demuynck, in 2020 Interspeech. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (2020). pp. 3830–3834. Publisher: ISCA

  11. M. Zhao, Y. Ma, M. Liu, M. Xu, The SpeakInSystem for VoxCeleb Speaker Recognition Challange 2021. (2021). ArXiv preprint arXiv:2109.01989

  12. T.J. Park, N. Kanda, D. Dimitriadis, K.J. Han, S. Watanabe, S. Narayanan, A review of speaker diarization: recent advances with deep learning. Comput. Speech Lang. 72, 101317 (2022)

    Article  Google Scholar 

  13. Z. Wang, K. Yao, X. Li, S. Fang, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multi-Resolution Multi-Head Attention in Deep Speaker Embedding (2020). pp. 6464–6468. Publisher: IEEE

  14. R. Wang, J. Ao, L. Zhou, S. Liu, Z. Wei, T. Ko, Q. Li, Y. Zhang, in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multi-View Self-Attention Based Transformer for Speaker Recognition (2022). pp. 6732–6736. Publisher: IEEE

  15. S. Ding, T. Chen, X. Gong, W. Zha, Z. Wang, AutoSpeech: Neural Architecture Search for Speaker Recognition. (2020). ArXiv preprint arXiv:2005.03215

  16. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, F. Wei, WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505–1518 (2022)

    Article  Google Scholar 

  17. W.N. Hsu, B. Bolte, Y.H.H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed, HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)

    Article  Google Scholar 

  18. K.A. Lee, Q. Wang, T. Koshinaka, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The CORAL+ Algorithm for Unsupervised Domain Adaptation of PLDA (2019). pp. 5821–5825. Publisher: IEEE

  19. L. Li, Y. Zhang, J. Kang, T.F. Zheng, D. Wang, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Squeezing Value of Cross-Domain Labels: A Decoupled Scoring Approach for Speaker Verification (2021). pp. 5829–5833. Publisher: IEEE

  20. Q. Wang, K. Okabe, K.A. Lee, T. Koshinaka, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A Generalized Framework for Domain Adaptation of PLDA in Speaker Recognition (2020). pp. 6619–6623. Publisher: IEEE

  21. M. Iman, K. Rasheed, H.R. Arabnia, A Review of Deep Transfer Learning and Recent Advancements. Technologies 11(2), 40 (2023)

    Article  Google Scholar 

  22. G. Bhattacharya, J. Monteiro, J. Alam, P. Kenny, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-End Speaker Verification (2019). pp. 6226–6230. Publisher: IEEE

  23. J. Rohdin, T. Stafylakis, A. Silnova, H. Zeinali, L. Burget, O. Plchot, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker verification using end-to-end adversarial language adaptation (2019). pp. 6006–6010. Publisher: IEEE

  24. Z. Wang, J.H. Hansen, Multi-source Domain Adaptation for Text-independent Forensic Speaker Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 60–75 (2021)

    Article  Google Scholar 

  25. J. Kang, R. Liu, L. Li, Y. Cai, D. Wang, T.F. Zheng, Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning. (2020). ArXiv preprint arXiv:2005.11900

  26. S. Sarfjoo, S. Madikeri, P. Motlicek, S. Marcel, in Interspeech 2020. Supervised domain adaptation for text-independent speaker verification using limited data (2020). pp. 3815–3819. Publisher: ISCA

  27. C. Du, B. Han, S. Wang, Y. Qian, K. Yu, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). SynAug: Synthesis-Based Data Augmentation for Text-Dependent Speaker Verification (2021). pp. 5844–5848

  28. H. Huang, X. Xiang, F. Zhao, S. Wang, Y. Qian, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Unit Selection Synthesis Based Data Augmentation for Fixed Phrase Speaker Verification (2021). pp. 5849–5853

  29. H. Taherian, Z.Q. Wang, J. Chang, D. Wang, Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1293–1302 (2020)

    Article  Google Scholar 

  30. L. Mošner, P. Matějka, O. Novotný, J.H. Černocký, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Dereverberation and Beamforming in Far-Field Speaker Recognition (2018). pp. 5254–5258. Publisher: IEEE

  31. N. Zheng, N. Li, B. Wu, M. Yu, J. Yu, C. Weng, D. Su, X. Liu, H. Meng, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A Joint Training Framework of Multi-Look Separator and Speaker Embedding Extractor for Overlapped Speech (2021). pp. 6698–6702. Publisher: IEEE

  32. H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, C. Wang, Continual Learning for Fake Audio Detection. (2021). ArXiv preprint arXiv:2104.07286

  33. M. Sustek, S. Sadhu, H. Hermansky, in Interspeech 2022. Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition (2022). pp. 1046–1050. Publisher: ISCA

  34. M. Yang, I. Lane, S. Watanabe, in Interspeech 2022. Online Continual Learning of End-to-End Speech Recognition Models (2022). pp. 2668–2672. Publisher: ISCA

  35. C. He, A.D. Shah, Z. Tang, D.F.N. Sivashunmugam, K. Bhogaraju, M. Shimpi, L. Shen, X. Chu, M. Soltanolkotabi, S. Avestimehr, FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks. (2021). ArXiv preprint arXiv:2111.11066

  36. H. Zhu, J. Wang, G. Cheng, P. Zhang, Y. Yan, in Interspeech 2022. Decoupled Federated Learning for ASR with Non-IID Data (2022). pp. 2628–2632. Publisher: ISCA

  37. Y. Gao, T. Parcollet, S. Zaiem, J. Fernandez-Marques, P.P.B. de Gusmao, D.J. Beutel, N.D. Lane, in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). End-to-End Speech Recognition from Federated Acoustic Models (2022). pp. 7227–7231. Publisher: IEEE

  38. J. Jia, J. Mahadeokar, W. Zheng, Y. Shangguan, O. Kalinli, F. Seide, in Interspeech 2022. Federated Domain Adaptation for ASR with Full Self-Supervision (2022). pp. 536–540. Publisher: ISCA

  39. Y. Gao, J. Fernandez-Marques, T. Parcollet, A. Mehrotra, N. Lane, in Interspeech 2022. Federated Self-supervised Speech Representations: Are We There Yet? (2022). pp. 3809–3813. Publisher: ISCA

  40. N. Tomashenko, S. Mdhaffar, M. Tommasi, Y. Estéve, J.F. Bonastre, in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Privacy Attacks for Automatic Speech Recognition Acoustic Models in A Federated Learning Framework (2022). pp. 6972–6976. Publisher: IEEE

  41. X.C. Li, J.L. Tang, S. Song, B. Li, Y. Li, Y. Shao, L. Gan, D.C. Zhan, in Interspeech 2022. Avoid Overfitting User Specific Information in Federated Keyword Spotting (2022). pp. 3869–3873. Publisher: ISCA

  42. A. Hard, K. Partridge, N. Chen, S. Augenstein, A. Shah, H.J. Park, A. Park, S. Ng, J. Nguyen, I. Lopez-Moreno, R. Mathews, F. Beaufays, in Interspeech 2022. Production federated keyword spotting via distillation, filtering, and joint federated-centralized training (2022). pp. 76–80. Publisher: ISCA

  43. T. Feng, S. Narayanan, in Interspeech 2022, Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling (2022) pp. 5050–5054. Publisher: ISCA

  44. F. Granqvist, M. Seigel, R. van Dalen, A. Cahill, S. Shum, M. Paulik, Improving on-device speaker verification using federated learning with privacy. (2020). ArXiv preprint arXiv:2008.02651

  45. Y. Wang, Y. Song, D. Jiang, Y. Ding, X. Wang, Y. Liu, Q. Liao, in Algorithms and Architectures for Parallel Processing. FedSP: Federated Speaker Verification with Personal Privacy Preservation (Cham, 2021), pp. 462–478. Publisher: Springer International Publishing

  46. J.S. Chung, A. Nagrani, A. Zisserman, in Interspeech 2018. VoxCeleb2: Deep Speaker Recognition (2018). pp. 1086–1090. Publisher: ISCA

  47. Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, Y. Cai, D. Wang, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). CN-CELEB: a challenging Chinese speaker recognition dataset (2020). pp. 7604–7608. Publisher: IEEE

  48. M. Duan, D. Liu, X. Ji, R. Liu, L. Liang, X. Chen, Y. Tan, in 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). FedGroup: Efficient federated learning via decomposed similarity-based clustering (2021), pp. 228–237. Publisher: IEEE

  49. C. Xu, Y. Qu, Y. Xiang, L. Gao, Asynchronous federated learning on heterogeneous devices: A survey. (2023). ArXiv preprint arXiv:2109.04269

  50. Z. Boudi, A.A. Wakrime, M. Toub, M. Haloua, A deep reinforcement learning framework with formal verification. Form. Asp. Comput. 35(1), 1–17 (2023)

    Article  MathSciNet  Google Scholar 

  51. Y. Aït-Ameur, S. Bogomolov, G. Dupont, A. Iliasov, A. Romanovsky, P. Stankaitis, A refinement-based formal development of cyber-physical railway signalling systems. Form. Asp. Comput. 35(1), 1–1 (2023)

    Article  MathSciNet  Google Scholar 

  52. A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, in International Conference on Machine Learning (PRML). Robust speech recognition via large-scale weak supervision (2023). pp. 28492–28518. Publisher: ML Research Press

  53. M. Lin, Y. Xu, C. Cai, D. Ke, K. Su, A lattice-transformer-graph deep learning model for chinese named entity recognition. J. Intell. Syst. 32(1), 20222014 (2023)

    Google Scholar 

  54. H. Bride, C.H. Cai, J. Dong, J.S. Dong, Z. Hóu, S. Mirjalili, J. Sun, Silas: a high-performance machine learning foundation for logical reasoning and verification. Expert Syst. Appl. 176, 114806 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to express our gratitude to the various funding agencies that have supported this work.

Funding

This work was supported in part by the National High-Quality Program grant TC220H07D; the National Natural Science Foundation of China (NSFC) under Grant 61871262, 62071284, and 61901251; the National Key R &D Program of China grants 2022YFB2902000; Key-Area Research and Development Program of Guangdong Province grant 2020B0101130012; and Foshan Science and Technology Innovation Team Project grant FS0AAKJ919-4402-0060.

Author information

Authors and Affiliations

Authors

Contributions

ZC (the first author) designed the study, performed the experiments, and contributed to the analysis and interpretation of the data. Prof. SX (the corresponding author) contributed to the conception and overall design of the study, as well as supervised the research project. All authors participated in drafting and revising the manuscript, and all authors have read and approved the final version of the manuscript.

Authors’ information

• Zhiyong Chen received his M.Eng. degree in Communication Engineering from Shanghai University, China, in 2021. He is currently pursuing a Ph.D. in Information and Communication Engineering at the same institution. His research interests include speaker recognition, speech recognition, acoustics, and emerging machine learning paradigms.

• Shugong Xu (Fellow, IEEE) received his master’s degree in pattern recognition and intelligent control and a Ph.D. degree in EE from Huazhong University of Science and Technology, China. He is a Professor at Shanghai University, where he was the Founding Head of the Shanghai Institute for Advanced Communication and Data Science. Before joining Shanghai University, he held positions at Intel Labs, Huawei Technologies, Sharp Laboratories of America, and conducted research at the City College of New York, Michigan State University, and Tsinghua University.

Prof. Xu has published over 160 peer-reviewed research papers, holds more than 60 U.S. and China patents, and has made significant contributions to international standards such as IEEE 802.11, 3GPP LTE, and DLNA. His research interests include 6G wireless communication systems, machine learning, pattern recognition, and AI-enabled embedded systems.

In recognition of his work, Prof. Xu was awarded the “National Innovation Leadership Talent” by the China Government in 2013 and elevated to IEEE Fellow in 2015. He also received the 2017 Award for Advances in Communication from the IEEE Communications Society.

Corresponding author

Correspondence to Shugong Xu.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Xu, S. Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning. J AUDIO SPEECH MUSIC PROC. 2023, 33 (2023). https://doi.org/10.1186/s13636-023-00299-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-023-00299-2

Keywords