Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning

Speaker recognition, the process of automatically identifying a speaker based on individual characteristics in speech signals, presents significant challenges when addressing heterogeneous-domain conditions. Federated learning, a recent development in machine learning methods, has gained traction in privacy-sensitive tasks, such as personal voice assistants in home environments. However, its application in heterogeneous multi-domain scenarios for enhancing system customization remains underexplored. In this paper, we propose the utilization of federated learning in heterogeneous situations to enable adaptation across multiple domains. We also introduce a personalized federated learning algorithm designed to effectively leverage limited domain data, resulting in improved learning outcomes. Furthermore, we present a strategy for implementing the federated learning algorithm in practical, real-world continual learning scenarios, demonstrating promising results. The proposed federated learning method exhibits superior performance across a range of synthesized complex conditions and continual learning settings, compared to conventional training methods.


Introduction
Speaker recognition, a critical task in the field of speech processing, involves the automatic identification and verification of speakers based on individual characteristics embedded in speech signals.With the growing ubiquity of voice-controlled devices and systems, speaker recognition has become an essential component for various applications, including security, authentication, and personalized user experiences.
Deep neural networks have become the cornerstone of modern machine learning applications, often requiring large amounts of labeled training data to achieve optimal performance.Traditionally, this data is collected from end-devices, such as smartphones, and sent to a centralized server for model training.However, this approach raises concerns regarding user privacy and the potential burden on communication links due to the transmission of large datasets.
In recent years, researchers in speaker recognition field are putting more focus on learning robust speaker features on multiple conditions [1,2].Including different room acoustics scenarios, different languages, different channel conditions, etc.All these contribute to degraded speaker recognition performance.Many researches focus on using domain adaptation methods to improve the system performance in these scenarios.While many of these research need to obtain both target domain data and the source domain data in a central data center, which is not only cost inefficient, but also sometimes impossible.
Federated learning (FL), an emerging machine learning paradigm, has gained significant attention in recent years for its potential to improve privacy and enable Page 2 of 17 Chen and Xu EURASIP Journal on Audio, Speech, and Music Processing (2023) 2023 :33 collaborative learning among distributed data sources.FL allows multiple clients to jointly train a model without sharing raw data, which can be particularly useful in privacy-sensitive applications.Despite its growing popularity, the application of FL in heterogeneous multidomain conditions for enhancing system customization in speaker recognition remains relatively unexplored.Federated learning can be broadly categorized into two main types [3]: • Cross-device federated learning: This method focuses on jointly learning speech characteristics from numerous mobile or similar devices to train a unified statistical model for speaker recognition.In this typical scenario, data is often limited and may have lower labeling quality.• Cross-silo federated learning: In this setting, organizations like universities can be regarded as remote devices containing substantial student data.These organizations must adhere to strict privacy practices and navigate potential legal, administrative, or ethical constraints to ensure data privacy.Federated learning can be employed in this scenario with relatively more abundant data and better labeling quality, facilitating the construction of supervised yet cost-effective training in this situation.
Privacy concerns are regarded as one of the major challenges in speaker recognition applications since they involve the complete sharing of speech data, which can have serious implications for user privacy.Federated learning can mitigate privacy infringement in speaker recognition systems by enabling multiple participants to collaboratively learn a shared model without revealing their local data, as recently examined by Lai et al. [4].Interestingly, an emerging trend in the FL domain involves utilizing federated learning for domain adaptation and personalization, leading to a research area known as personalized FL [5].This approach eliminates the need for centralized data transmission, storage, and training, making adaptation to diverse and complex client conditions more feasible and reasonable.Another challenge in real-world scenarios includes ever-changing data with limited buffer capacity, prompting research into continual learning and online learning [6].
The main contribution of this work is the application of federated learning techniques to train supervised deep neural network-based speaker recognition models, with the goal of customizing speaker information across multiple heterogeneous domains while preserving user privacy.Unlike previous works on FL in combination with speech and speaker recognition, which mainly focus on privacy-preservation scenarios and simple client collaboration within a single data domain, we concentrate on multi-domain client collaboration and heterogeneous domain adaptation using personalized FL.To achieve this, we simulate iconic acoustic conditions using the room acoustic software Pyroom [7] and select multi-lingual datasets to design and compose client datasets.We also evaluate various personalized training strategies to identify a better approach that outperforms centralized training.Finally, we explore ways to combine FL methods with continual learning techniques, enabling them to function effectively in real-world continual learning scenarios.
In summary, this paper explores the effectiveness of personalized federated learning (PFL) and Federated Learning combined with continual learning methods within the scope of speaker identification and speaker verification tasks across multiple heterogeneous domains.Our primary contributions can be outlined as follows: • We propose a speaker recognition system based on personalized federated learning (PFL), leveraging supervised speaker data stored across different silos.By learning client-dependent projection modules, our approach enables better adaptation to various scenarios and demonstrates promising performance in both speaker identification and speaker verification tasks.The remainder of this paper is organized as follows: Section 2 reviews related work.Section 3 describes the proposed model.Section 4 details the experimental setup and presents the results and analysis.Section 5 discusses our future considerations.Finally, Section 6 concludes the paper.

Speaker recognition
Automatic speaker recognition has a rich history, with early methods including probabilistic models, deep neural networks combined with probabilistic models, and end-to-end speaker recognition models [8][9][10][11].Speaker recognition encompasses three sub-tasks: speaker verification, speaker identification [1], and speaker diarization [12].This paper primarily focuses on speaker verification and speaker identification tasks.Over the past decade, neural network-based speaker recognition models have achieved superior performance, becoming the dominant approach in the field.The x-vector model, which extracts speaker-related features from acoustic properties using neural networks [9], can be considered a milestone in modern deep neural network speaker recognition.Subsequent years have seen the development of convolutional-based [13], complex 1D temporal neural network-based [10], transformer-based DNN systems [14], and autoML-based systems [15], all of which have contributed to significant progress in speaker recognition.
Following the trend of large-scale training, speaker recognition models that leverage transformer-based pretraining [16,17] have demonstrated impressive improvements over previous deep learning methods.

Domain adaptation and robust speaker recognition
Although speaker recognition systems have demonstrated strong performance on numerous benchmark datasets, recognizing speech in complex and diverse domains remains a challenging problem.Current methods for addressing domain adaptation issues can be categorized into several groups.Back-end statistical model adaptation techniques [18,19] utilize first-and secondorder information from the embedding space feature distribution to adapt the backend classification model.These models are generally lightweight [20], resulting in high explainability.
Recently, many works have focused on developing methods that can better learn from different datasets or conditions, employing transfer learning [21] approaches such as domain adversarial learning [22,23] or discrepancy minimization methods [24].Other techniques concentrate on using meta-learning methods to construct domain-agnostic pretrained models from various datasets [25] or adapting parts of the base pretrained model to build better target domain representations [26].
In order to enhance the robustness of speaker recognition models, some researchers utilize data augmentation methods.Notable recent works include using text-tospeech techniques to synthesize fake speakers [27,28], which helps make the speaker model more generalizable.Other approaches involve more sophisticated audio signal processing technologies in the front-end, such as beamforming methods [29], dereverberation techniques [30], and speech separation methods [31], all of which contribute to making speaker recognition systems more robust in varying acoustic conditions.

Continual learning and its application
Continual learning and its related online learning scenarios and methods are gaining more attention recently [6].In recent research, several approaches have been proposed to tackle the challenges of continual learning and generalization in various domains, including speaker verification and automatic speech recognition.In [32], the authors propose a continual-learning-based method to incrementally learn new spoofing attacks for speaker verification systems without performance degradation on previous data.Paper [33] presents a dynamically expanding end-to-end model for the speech recognition task, which helps avoid catastrophic forgetting and seamlessly integrate knowledge from new data.Paper [34] focuses on online continual learning for automatic speech recognition and demonstrates the effectiveness of incremental model updates using the online Gradient Episodic Memory (GEM) method.

Federated learning and its application
Federated learning has emerged as a promising technology in the field of machine learning, with significant potential for preserving user privacy [35].In the speech community, numerous studies have employed federated learning techniques in various applications such as automatic speech recognition [36][37][38][39][40], keyword spotting [41,42], and speech emotion detection [43].
A number of works have also applied federated learning methods to speaker recognition tasks, including [4,44,45].These studies primarily focus on utilizing federated learning to enhance data privacy and explore the data class distribution properties of each client in non-IID scenarios within the same domain condition.

Learning speaker features with personalized federated learning
The federated learning-based speaker recognition system consists of two learning procedures: client-side finetuning and server-side updates.Federated learning allows for distributed training of speaker recognition models across domain-heterogeneous clients.Our proposed personalized federated learning (PFL) system for speaker recognition operates in two primary locations: edge silos (clients) and a central server.The overall architecture of the system, showcasing the server-side, silos' domain conditions, and individual client learning details, is illustrated in Fig. 1.
In Eq. ( 1), we aim to minimize the global objective function G(θ) , which is defined as the weighted sum of local objective functions G i (θ) across n clients.Here, q i (1) denotes the weights for aggregating the targets, and θ represents the model parameters.
This formulation illustrates the process of training a centralized model over a distributed dataset, where a multitude of clients hold variable-sized subsets of the data.During each iteration of training, a local model update is computed at the device level and communicated to a central server.Subsequently, the central server combines a large number of these updates or gradients to compute a global update to the central model.This global update is essentially an average of the local updates, Fig. 1 Architecture of the proposed multi-domain personalized federated speaker system.The proposed architecture is used in major speaker recognition sub-tasks including speaker verification and identification which ensures the preservation of privacy and efficient utilization of distributed data.Equation ( 2) demonstrates the core concept of personalized FL, where Base(•) serves as the model combination, employing a global feature extractor based on x-vectors in our setup.Meanwhile, Trans(•) represents a trans- former-based personalized feature projector for each domain-specific client before calculating the final loss for each client.This approach adopts the transfer learningbased personalized FL (TL-PFL) strategy, as described in [5].
Given an speaker embedding from the Base(•) and sequenlize it as X = {x 1 , x 2 , . . ., x n } , the transformer encoder first applies a positional encoding to the input embeddings, which allows the model to utilize positional information.The encoded sequence Z 0 = {z 0 1 , z 0 2 , . . ., z 0 n } is then fed into the first layer of the encoder.Each layer l computes the output sequence Z l = {z l 1 , z l 2 , . . ., z l n }.In the transformer encoder, the Multi-Head Attention mechanism is denoted as MultiHead(Q, K , V ) , where Q, K, and V represent the query, key, and value matrices, respectively, serving as the inputs for this mechanism.
For each layer of the transformer encoder: where FFN represents the position-wise feed-forward network, and LayerNorm denotes layer normalization.
As illustrated in Fig. 1, the final output embedding for evaluation is produced by two branches, namely the Base Feature (referred to as Type-A, Discriminator-projection) and Projection Feature (referred to as Type-B, Feature-projection), which can be used individually or in combination.The embedding is used directly for cosine similarity comparison in speaker verification tasks following [10].For speaker identification tasks, a separate logistic regression-based classifier is learned for each speaker in each evaluation subset: (2) Here, c id represents the identification output class ID, W (i) id T denotes the learnable weight for the i-th evaluation subset, and x emb is the generated speaker embedding.

Heterogenous domain continual personalized federated learning
We present the heterogeneous-domain across-silo continual personalized federated learning (C-PFL) method, which combines the principles of continual learning and federated learning.Our approach is specifically designed to address the challenges encountered when new data from different stages become available, all while preserving the aforementioned FL mechanism across diverse silos.
The key idea behind our method is to dynamically update the output classifier parameters with new weights when data from new stages arrive.This is achieved by continually adapting the model to incorporate the information from the newly acquired data without causing significant interference with the previously learned knowledge.
In the proposed C-PFL method, given the output of speaker embedding as x ∈ R n , the output probability for each class is estimated by the equation: where c ∈ N 0 is the non-negative label ID for the speaker embedding, σ (•) is the Softmax function, and W ∈ R C×n and b ∈ R C .Given that we have T continual learning stages in which each stage has a speaker set c t , we config- ure C as follows: where c ti is the assigned label for speaker embedding i in the t-th continual learning stage.As new data arrive at different stages, the output classifier sets new weights specifically for the classes in these new stages with high probability if C is set large enough: Here, P new (t) is the probability of setting new classes at the continual learning stage t.Let � (t) denote the classi- fier weights vector sets (also called prototypes) in Eq. ( 8) used at stage t, which is defined as: Equation ( 14) holds true when the condition in ( 9) is met.In the initial learning stage, we employ weights � (1) .As new data becomes available for subsequent stages, we introduce additional weights � (2) , and so forth.This method enables the model to learn continually from new data without experiencing catastrophic forgetting of previously acquired knowledge.The detailed procedure for PFL and C-PFL are shown in Algorithm 1 and Algorithm 2. This approach, named "C-PFL with enhanced training strategy, " is contrasted with a simpler method that employs Eq. ( 8) without random prototype casting, which we designate as "C-PFL with a standard training strategy, " to emphasize the differences between the two strategies.

Silo acoustic simulation
To evaluate the PFL and C-PFL methods in scenarios involving multiple domain clients, we simulated room acoustic settings in our experiments using the Pyroom Acoustics library.Six distinct room configurations were employed to thoroughly examine the performance of our models under challenging acoustic conditions.As illustrated in Fig. 2, these room settings encompassed a (9) diverse range of sizes, shapes, and reverberation characteristics, thereby providing a comprehensive evaluation framework.In each room, we tested the models using both single OMNI-directional microphones and OMNIdirectional microphone arrays.This approach enabled us to investigate the impact of different microphone configurations on the overall system performance, as well as assess the robustness of our models in coping with diverse real-world scenarios.
The first three rooms are designed with varying sizes to simulate different room reverberation properties, representing small, medium, and large rooms.The left part of Fig. 3 demonstrates the RT60 metrics of each room, showcasing their distinct reverberation characteristics.The fourth room includes a noise source alongside the sound source to simulate noisy room conditions.
Our room simulations also incorporate delay-and-sum (DAS) beamforming to emulate the performance of realworld multi-channel microphone array acoustic environments and evaluate how our systems perform under these common conditions.The right part of Fig. 3 depicts the beamforming settings for Room5 and Room6, presenting the beam patterns across different frequency bands.Room6 additionally includes a noise source, as in Room4.The delay-and-sum beamforming algorithm is applied in these two acoustic scenarios.Let x i (t) denote the signal received by the i th microphone in these rooms, where i = 1, . . ., M , and M is the total number of microphones.The time-delay τ i is calculated for each microphone based on the desired direction of the beam, as expressed below: where d i represents the difference in distance between the source and the i th microphone and the reference microphone, and c is the speed of sound.The output signal y(t) is then computed by summing the delayed signals from all microphones to enhance the preset direction.

Experiment settings
We utilize simulated room acoustics scenarios to assess the federated learning system and its associated algorithms in the context of speaker recognition tasks.The evaluation specifically targets both speaker verification and speaker identification tasks across multiple heterogeneous domain groups, and we also investigate the performance in continual learning settings.The configurations can be found in Table 1.(15) To build the speaker recognition system with personalized federated learning, we use the VoxCeleb [46] and CnCeleb [47] datasets.We select 100 speakers for each group, ensuring no overlap between groups.Each group simulates domain conditions based on the predefined settings described in the table, enabling a comprehensive assessment of the system's performance across various acoustic environments and languages.We configure Groups 1 through 6 to use the settings of Rooms 1 through 6 with the English language VoxCeleb dataset, while Groups 7 through 12 employ the same room settings but with the Chinese language CnCeleb dataset.
For system evaluation, speaker IDs are selected according to the table for both verification and identification tasks.The evaluation speakers for each group remain consistent across all groups, with their speech processed In the continual learning settings, we additionally allocate Stage 2 and Stage 3 data using distinct speakers from VoxCeleb, while keeping the evaluation set consistent across all stages to evaluate the performance of the proposed systems during each stage.
Table 1 also documents detailed settings for each group, encompassing room dimensions, reverberation time, microphone array configurations, and other relevant parameters.
We utilize the VoxCeleb2 dataset to pre-train the x-vector model for the canonical model, following the same procedure as in [10].Subsequently, we conduct further experiments with PFL, C-PFL, and other baseline algorithms based on this pre-trained model.

Performance evaluation of personalized federated learning
In Table 2, we present the evaluation results for various models across evaluation groups 1 to 12, along with their mean values.Our findings indicate that the Canonical model, following the classical training procedure of [10], demonstrates satisfactory performance in certain domain groups that exhibit similar acoustic conditions to the original dataset.However, its performance is limited in many other groups characterized by distinct acoustic conditions.The centralized training model, incorporating a transformer block similar to the architecture of [14], exhibits a substantial improvement over the canonical models, as it leverages the domain information collectively within the central server.In contrast, the separated training procedure, employing a strategy similar to [26], does not consistently yield better performance than the canonical model.This highlights the impact of limited data for finetuning, which may lead to degraded performance compared to the original models.
The personalized federated learning strategy outperforms all other methods in these domain-agnostic scenarios, achieving the lowest mean EER among all approaches.The improvement is particularly significant in groups with rare acoustic conditions that deviate considerably from the original training data.This finding underscores the effectiveness of the PFL strategy in addressing speaker recognition tasks across diverse acoustic environments.In Table 3, we evaluate various personalized federated learning (PFL) strategies, comparing the classical directdiscriminator approach, as used in [4], to the proposed personalized projection-based methods.Both methods yield promising results, with PFL achieving a better mean EER using the discriminator-projection strategy (PFL Type-A).Therefore, we further assess these two techniques by examining their convergence capabilities, as illustrated in Fig. 4. Utilizing a personalized discriminator demonstrates significantly better convergence performance, achieving the lowest EER within just 10 rounds, whereas the direct-discriminator approach requires approximately 25 rounds.This showcases the efficiency and effectiveness of the PFL method.
Moreover, we investigate the effects of various FL strategies on overall performance.Utilizing a featureprojection strategy (PFL Type-B) does not lead to better outcomes, indicating that the limited data available for fine-tuning the transformer-based projector presents challenges in creating a more efficient feature space.These insights emphasize the strengths and limitations of different Personalized FL strategies when  addressing speaker recognition tasks in diverse acoustic environments.

Speaker identification task evaluation
We further evaluate the performance of the Personalized-FL (PFL) model by comparing its performance on the speaker identification task.The results are presented in Table 4.We compare various basic classification metrics across the 12 evaluation domain sets, including accuracy, precision, F1-score, and AUC.The corresponding micro-average ROC curves for each domain group are also assessed, as depicted in Fig. 5.
Our analysis reveals that using our proposed PFL strategy leads to significantly better results than centralized training, with the exception of some domain groups that have relatively more common and less challenging acoustic conditions.In these cases, both methods perform similarly.The mean ROC curve in Fig. 5 further illustrates this trend, providing a comprehensive visualization of the PFL strategy's effectiveness across diverse domain groups.Overall, the PFL model demonstrates superior performance in the majority of domain groups, showcasing its potential for enhancing speaker identification tasks in various acoustic environments.

Evaluation of continual personalized federated learning
Table 5 presents the performance of continual personalized federated learning (C-PFL).We evaluate the model across all episodes and stages throughout the three     of each stage, indicating enhanced performance.Furthermore, the standard deviation decreases, signifying a more uniform improvement across all evaluation domain groups.This trend underscores the robustness of the C-PFL method in adapting to various domain datasets and demonstrates its potential for real-world continual learning applications.The model effectively generalizes to new knowledge while avoiding catastrophic forgetting.In Fig. 6, we compare the performance of the standard C-PFL strategy with our enhanced C-PFL strategy across each stage of the learning process.The standard strategy appears to struggle with retaining previously acquired knowledge and generalizing it to future learning stages.In contrast, the enhanced strategy demonstrates superior performance in continual learning scenarios.
This improvement can be attributed to the design of the proposed C-PFL method, which effectively balances the retention of prior knowledge with the acquisition of new information by randomly shifting the weight designation in the output classifier of the PFL module.As a result, our enhanced strategy exhibits a more stable learning curve, ensuring consistently better performance across various stages and domain datasets.This finding highlights the advantages of employing the C-PFL with an enhanced strategy approach in real-world continual learning speaker recognition applications, where knowledge retention and generalization are critical for achieving reliable and robust performance.
In Fig. 7, we compare the impact of different training data incoming sequences on the continual learning process to assess the influence that varying data sequences may have on the results.We examine two scenarios: one where we use Stage 2 datasets first, followed by Stage 3 datasets, and another where we reverse the order.Our findings demonstrate that the proposed strategy remains effective regardless of the data sequence used, as evidenced by the EER plots in the figure .In both cases, the EER on evaluation sets consistently improves at each stage, demonstrating that the system effectively leverages the knowledge acquired from previous stages to enhance subsequent training, regardless of the data incoming sequence.This observation underscores the robustness of the proposed C-PFL method.

Complementary experiments
We perform supplementary experiments to provide a deeper understanding and analysis of the characteristics of our proposed PFL method.
Figure 8 displays the loss tracking for various selected groups of our training data when implementing our proposed PFL Type-A system.Different groups necessitate varying numbers of training steps within a single combination episode, causing some clients to wait for others to finish their training.
At the beginning of each episode, the loss initially increases to a local peak but converges rapidly compared to the previous episode, resulting in a lower loss value.This behavior illustrates the successful knowledge sharing and transfer between clients at each episode, enabling effective learning and adaptation across clients from multiple heterogeneous domains.The moving average mean and the standard deviation range shown in the plot indicate that all groups achieve good convergence within several hundred steps.This consistent convergence across various groups highlights the effectiveness of our proposed PFL approach in handling diverse data and domain Figure 9 presents the t-SNE visualization of the embedding space for both personalized federated learning (PFL) and centralized training strategies.Each point represents an embedding from an utterance, with each cluster corresponding to a speaker class, while different colors indicate speakers from distinct domain client groups.It is evident that both PFL and centralized training methods encounter challenges in distinguishing some speakers and forming well-defined clusters for certain speaker IDs.This difficulty arises due to the severe interference caused by the diverse room acoustic conditions experienced by some groups.Nevertheless, a significant difference between the two methods can be observed.Centralized training seems to blend domain information with speaker class information, resulting in a decline in performance compared to the PFL training.This confusion is particularly evident in the highlighted area of the t-SNE plot, where the speaker classes and domain groups are evidently entangled.In contrast, the PFL strategy exhibits better separation of speaker classes and domain information, leading to improved overall performance in speaker recognition tasks.This can be attributed to the personalized module constraining domain-specific information within a single client training through feature projection, thus avoiding interference with the global model training.
Consequently, PFL inherently emerges as a suitable method to effectively address the challenges posed by complex room acoustics and diverse domain conditions, ultimately enhancing the robustness and generalizability of speaker recognition systems.
Figure 10 presents the results of a common problem in federated learning where, at each combination time point, not all clients participate in the aggregation process.We investigate the effects of varying combination ratios for the proposed PFL system, including 1.0 (normal case), 0.7, and 0.3.As the combination ratio decreases, we observe a decline in performance, yet all scenarios still outperform the centralized training baseline (represented by the grey dashed line).The minimum EER dotted lines These findings emphasize the importance of the combination ratio in achieving optimal performance with FL.Despite the performance degradation caused by a lower combination ratio, the proposed federated learning method remains effective, consistently outperforming the centralized training method.Although the performance is reasonably good compared to the baseline, the influence of partial combination, unsynchronized combination, and the waiting time for clients on PFL still remains an open problem, requiring further research.

Discussion and future considerations
In the present study, we have introduced a novel personalized federated learning (PFL) system, engineered explicitly for speaker recognition tasks across various heterogeneous domains.The system has shown considerable advancements over existing baseline models.However, several aspects necessitate further exploration and discussion.

Future research and application
Moving forward, we aim to delve deeper into more advanced federated learning algorithms, such as those employing clustering-based FL methods [48].Additionally, we intend to explore more pragmatic on-device learning algorithms, specifically those utilizing unsupervised streaming audio data.Our future research will also concentrate on crafting specific federated learning algorithms tailor-made for speaker diarization systems [12].

System design and reliability
The development of FL algorithms is intrinsically linked to system design; therefore, their integration into realworld systems demands careful consideration of numerous factors.Apart from the FL combination ratios discussed earlier, one significant element to consider is the implementation of asynchronous federated systems [49].These systems cater to variations in client device timing prior to the aggregation in each training round, addressing potential downgrades in model accuracy due to disparities in device resources.
As we contemplate incorporating these features into tangible systems, the emphasis on system reliability intensifies.To mitigate reliability concerns, we can apply insights from Boudi et al. 's 2023 study [50], which showcases a robust and error-free career agent created using Deep Reinforcement Learning in tandem with formal verification.Moreover, the methodology presented by Ait-Ameur et al. (2023) [51] provides an excellent illustration of employing formal methods to handle complex cyber-physical systems.With the application of the Event-B formal specification language to model, verify, and refine these systems, we are well-equipped to design and verify the reliability of our FL-based speaker recognition system, accounting for variables such as device training latency and transmission latency, ensuring the system meets certain criteria.

Significance to AI, speech, and NLP fields
Our research contributes to the wider fields of Artificial Intelligence (AI), speech, and Natural Language Processing (NLP).In these research domains, one prevailing trend involves the centralized training and pre-training of large-scale models for speech and NLP tasks, such as the Whisper model for speech recognition developed by OpenAI [52].These models primarily aim to enhance generalization across diverse scenarios.Another emerging trend is the development of models that emphasize customization and personalization, as exemplified by Lin et al. [53], which cater to the unique complexities of Chinese language processing to improve NLP tasks.
We posit that by applying federated learning, starting from a well-pretrained model, we can concurrently enhance the model's generalization and personalization capabilities.This process is expedited and made seamless by the continual integration of wide-spread, privacy-sensitive data that would normally be unavailable or challenging to manage.Notably, in line with recent trends, the future integration of high-performance and explainable AI, as underscored by Bride et al. [54], presents an intriguing prospect for further enhancing our system's robustness and transparency.Consequently, this combined approach paves the way for the steady evolution of our model, thereby advancing towards our aspiration of crafting a robust, explainable, and lifelong learning system in the future.

Conclusion
This study has demonstrated the effectiveness of personalized federated learning in the field of speaker recognition.By facilitating the training of speaker recognition models using supervised speaker data stored on various heterogeneous domain silos, the proposed system has exhibited promising results in both speaker verification and speaker identification tasks.Moreover, the learning of the client-dependent personalized module has allowed for better adaptation to diverse scenarios.
Our simulations and evaluations, based on room acoustics software, have underscored the advantages of using PFL in heterogeneous domain adaptation scenarios.Comparisons between various federated learning methods, centralized training, and separated training have revealed that PFL outperforms the other methods, a finding not reported in previous research.Furthermore, PFL can be effectively integrated with continual learning settings, with the continual personalized federated learning method showcasing strong performance as the training stages progress.By employing a random prototype casting training strategy, the C-PFL with enhanced strategy has proven advantageous.
In summary, the findings of this study endorse the adoption of personalized federated learning as a valuable approach to speaker recognition tasks.This approach offers improvements in performance for heterogeneous domain adaptation compared to classical transfer learning methods for speaker recognition.Additionally, it presents a new way of learning in a continual, collaborative manner that brings valuable traits such as data security, training-costs saving, and adaptability to ever-changing data in real-world scenarios.

Fig. 2
Fig. 2 The room acoustic settings simulated with Pyroom acoustics.Six rooms settings used in the experiments are demonstrated in the figure.Rooms are using both single OMNI-direction microphone or OMNI-direction mic-arrays

Fig. 3 Table 1
Fig. 3 Detailed room simulation information: The first three figures display the RT60 properties of the small, large, and medium rooms in our simulation.The last two figures illustrate the beamforming settings in Room5 and Room6, as well as the beam patterns across different frequency bands

Fig.
Fig. Comparing the error rate between the FL-classical and the PFL Type-A training stages.It is worth noting that we use the same evaluation sets across all training stages, which serves as a good way to assess how different learning methods adapt to ever-changing new data and are influenced by the deletion of previous data from storage.Remarkably, our C-PFL method effectively integrates information from the data while enabling knowledge generalization and preventing catastrophic forgetting.Upon examining the results, we observe a consistent improvement in the mean EER during the training

Fig. 5
Fig. 5 Receiver operating characteristic (ROC) curves and area under the curve (AUC) values for multi-class speaker identification using Models A (centralized training) and B (PFL Type-A).The plot shows individual ROC curves for each of the 12 groups, with different shades of blue for Model A and different shades of orange for Model B. The mean ROC curves for Model A and Model B are highlighted with thicker dashed lines.The AUC values for each curve are included in the legend

Fig. 6
Fig. 6 Comparison of error rates between C-PFL with and without enhanced training strategy, and the reference performance of single-stage non-continual PFL training

Fig. 7 Fig. 8
Fig. 7 Comparison of error rates for C-PFL with different training data input sequences

Fig. 9
Fig. 9 Comparing the embedding plot between the PFL and centralized training

Table 2
Evaluation performance of PFL and the baseline

Table 3
Comparing performance of different training strategies of PFL

Table 4
Evaluation performance of centralized training and PFL training on the speaker identification task

Table 5
Evaluation performance of C-PFL in continual learning settings