U2-VC: one-shot voice conversion using two-level nested U-structure

Voice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.


Introduction
It is well-known that speech information is composed of four components: timbre, rhythm, pitch, and content. As stated in [1], the content represents the linguistic part of the speech, and the timbre represents the speaker identity. Voice conversion (VC) aims to convert the timbre while maintaining the linguistic content unchanged. This technique can be applied in many fields, such as pronunciation assistance [2,3], personalized speech synthesis [4,5], and even dubbing.
The existing voice conversion systems can be roughly divided into parallel voice conversion [6][7][8][9] and nonparallel voice conversion [10][11][12][13][14][15][16][17], which depend on whether the model of this system is trained on paired Different from U-Net [24], U 2 -Net is consisted of many residual U-blocks (RSU) with different layers [23]. Each residual U-block can extract both local feature and multiscale feature from input image to preserve more detailed information about salient object. For voice conversion, if one regards the log-mel spectrogram as a 2D image, the speech spectrogram can be regarded as the salient object of this "2D image, " and the residual U-block can be adopted to extract local feature and multi-scale feature from the input speech to preserve more detailed information, which can be expected to improve the naturalness of the converted speech. This motivates us to design a voice conversion algorithm based on U 2 -Net to improve the naturalness of the converted speech. Inspired by AGAIN-VC [22], the sigmoid function is adopted as the activation guidance at the last of encoder to guide the content embedding, so that we can learn more content information. Moreover, sandwich adaptive instance normalization (SaAdaIN) [25], which is first proposed for neural style transformation to reduce content loss during transformation process, is also adopted for speaker identity transformation in decoder to maintain more content information of the source speech and keep the speaker similarity between the converted speech and the target speech simultaneously. To the best of our knowledge, it is the first one-shot voice conversion algorithm with U 2 -Net and SaAdaIN. Objective evaluation results such as melcepstral distortion (MCD) [26], NISQA model [27], and subjective listening tests with mean opinion score (MOS) showed that the proposed approach outperforms many state-of-the-art (SOTA) approaches such as AdaIN-VC [21] and AGAIN-VC [22]. To validate the robustness of the proposed approach, we also perform experiments in cross-lingual scenarios, where the results also verify the better performance of the proposed approach.

One-shot voice conversion
One-shot voice conversion can be achieved by decoupling content and speaker identity with a content encoder and a speaker encoder, respectively. AUTOVC [19] uses a pre-trained speaker encoder with a generalized end-toend loss [28] and a content encoder with a well-designed information bottleneck to limit the leakage of speaker information of the source speaker. However, the pretrained speaker encoder might affect the robustness of the system because it is just trained for speaker verification and the structure of information bottleneck with hard sampling may cause the content information loss, which makes the converted speech sound unnatural. VQVC+ [20] jointly uses vector quantization (VQ) [29] and U-Net [24] for extracting content information and improving the reconstruction simultaneously. Although VQVC+ performs well on speaker conversion, the naturalness still cannot meet the specification because of the content information loss. AdaIN-VC [21] is the first oneshot voice conversion to perform speaker transformation through adaptive instance normalization (AdaIN) [30]. Although AdaIN can separate the content information and the speaker information very well, one can see that the perceptual quality of the converted speech is still unsatisfied, e.g., the naturalness. Very recently, AGAIN-VC [22] was proposed with a single encoder for encoding both content and speaker identity, where the sigmoid function was added to the end of the encoder as an information bottleneck to prevent the content embedding from leaking speaker information. The quality of the converted speech is better than many existing algorithms, and the naturalness is still a big problem because the harmonic distortion still often occurs in the converted speech.
As mentioned above, both the content information loss and the harmonic distortion lead to the degradation of the converted speech. Recently, some methods have been proposed to improve the naturalness of the converted speech [13,[31][32][33]. Kwon et al. [31] use the attention mechanism to modify the information bottleneck structure to preserve more linguistic information, which can prevent the content loss of the converted speech. CycleGAN-VC3 [13] uses time-frequency adaptive normalization (TFAN) to reduce the harmonic distortion of the converted speech in order to make it sound more natural. Text-to-speech (TTS) [32,33] and automatic speech recognition (ASR) [33] techniques also have been introduced to overcome the problem of mispronunciation in the converted speech. In this paper, the proposed U 2 -VC is introduced to improve the naturalness of the converted speech by extracting multi-scale features through newly designed 1-2-1 residual U-blocks and the usage of SaAdaIN to maintain more content information during speaker identity transformation

U 2 -Net
U 2 -Net is a kind of two-level nested U-structure [23], which is originally used for SOD. Residual U-blocks(RSU) with different layer (L) are the main components of U 2 -Net to extract different scale features. Each RSU has the same structure: an input convolution layer to extract the local features, a U-Net like encoder-decoder block with layer of L to extract the multi-scale features from the local features, and a residual connection to fuse the local features and the multi-scale features by summation. The characteristic of U 2 -Net makes it able to extract more details from input features. As demonstrated in [23], U 2 -Net can perform well on SOD. Due to the importance of preserving more harmonic components for the converted speech in improving speech quality, we introduce the U 2 -Net to the one-shot VC task, where we verify that

Proposed approach
For voice conversion, the goal is to design a system that can convert a source speaker to the target one, while keeping the source linguistic content unchanged, which can be represented as: where X 1 denotes the log-mel spectrogram of the source speech, and X 2 denotes that of the target speech. C represents the nonlinear mapping function. X 1→2 denotes the log-mel spectrogram of the converted speech. Figure 1 plots the overall structure of the proposed U 2 -VC, where one can see that U 2 -VC is mainly consisted of three parts: an encoder to disentangle content information and speaker identity information through instance normalization (IN), a decoder to achieve speaker identity transformation through sandwich adaptive instance normalization (SaAdaIN), and an output module to generate the final converted log-mel spectrogram of the converted speech. For encoder, the source log-mel spectrogram X 1 is passed through IN layers following the 1-2-1 residual U-blocks and the sigmoid function to eliminate its speaker information, which is for generating content embedding. Meanwhile, the skip-connection structure passes the speaker embedding μ and σ , which are calculated from target log-mel spectrogram X 2 , of each instance normalization (IN) layer to the paired SaAdaIN layer in decoder for speaker identity transformation. The output module has the same purpose as the saliency map fusion module of U 2 -Net [23]. Firstly, side out log-mel spectrograms of the converted speech X i 1→2 , with the index i = 1, · · · , 6, are generated by the generation blocks following the 1-2-1 RSU blocks in decoder. The generation block consists of two GRU layers and a linear layer, and the output module reshapes the side out log-mel spectrograms from 2D to 1D and then fuses them with a concatenation operation followed by a 1×1 convolution layer. Finally, a reshaping operation is needed to generate the final log-mel spectrogram of the converted speech. The 1-2-1 residual U-blocks and SaAdaIN are the most important components of U 2 -VC to improve the naturalness of the converted speech, and we will describe them in the next two sections separately. The loss function of U 2 -VC will also be introduced finally, and the configuration of the proposed U 2 -VC is summarized in Table 1.

1-2-1 residual U-block
RSU was first developed to extract the multi-scale features from image for salient object detection. More detailed information can be preserved with this approach. To make residual U-block suitable for voice conversion task, we redesigned the residual U-block named 1-2-1 residual Ublock (1-2-1 RSU). The detailed structure of 1-2-1 residual U-block is shown in Fig. 2b. In Fig. 2, the ReBNConv1D Fig. 1 The architecture of U 2 -VC. The 1-2-1 residual U-blocks (RSU) with different layers (1-2-1RSU7,· · · ,1-2-1RSU4F) consist of the U-Net like encoderdecoder structure. "7", "6", "5," and "4" represent the layers (L) of 1-2-1 residual U-blocks. Greater L means the1-2-1 residual U-block could capture more large-scale information . In this network, we set the L from large to small in order to extract the features from the global to the detail. This process preserves more fine details of input features which could be better for the naturalness of converted speech. Inspired by AGAIN-VC, sigmoid function is used at the end of encoder. Sandwich adaptive instance normalization (SaAdaIN) is adopted in decoder for speaker identity transformation Table 1 Bold words represent the three parts of our U 2 -VC as noted in section 3. "I" denotes the input size. "O" denotes the output size. "K" denotes the kernel size. "S" denotes the stride. "D" denotes the dilation. "H" denotes the hidden size. "L" denotes the layer. As stated before, residual U-block has the same structure as original residual U-block except the input layer and the reshaping operation denotes the block consists of a 1D CNN, a batch normalization layer, and a leaky Relu activation function, while the ReBNConv2D has the similar architecture and the only difference is that the 1D CNN in the ReBNConv1D is replaced by a 2D CNN. The main differences between the original residual Ublock and the proposed 1-2-1 residual U-block are the input convolution layer and the reshaping operation. As mentioned before, speech signal is a kind of sequence while its spectrogram can be regarded as an image with one channel. Accordingly, we adopt a 1-2-1D CNN structure in newly designed residual U-block according to the pros and cons of 1D CNN and 2D CNN, as pointed out in [12]: a 1D CNN is good at capturing dynamic change, while a 2D CNN is good at converting features while preserving the original structures. Therefore, a 1D CNN with batch normalization(BN) followed LeakyReLU is used as the input layer to extract the local feature, and many different layers of 2D CNN with batch normalization and LeakyReLU are used for downsampling and upsampling in U-block to extract multi-scale feature. The output feature of U-block has to be reshaped from 2D to 1D because we only use 1D IN to extract the speaker identity information along the time axis. The main reason is that the speaker identity information is a kind of time-invariant feature. Figure 3 compares the 1-2-1 residual U-block (RSU) used in U 2 -VC and the plain residual block used in AGAIN-VC. The operation of the plain residual block can be described as: where F (x) denotes the output feature of the plain residual block when given the input feature x, which is extracted from log-mel spectrograms of the source speech and the target speech; H 1 represents the operation of ReBNConv1D and H 2 represents the operation of plain 1D convolution block. Both operations aim to extract the local feature. Instead of using the plain convolution block to extract the local feature only, multi-layer U-block is proposed in RSU to extract different scale features. The output feature of RSU can be represented as: where F RSU (x) denotes the output feature of RSU; H 1 stands for the convolution operation to extract the local feature, and H U represents the operation of U-block to extract the multi-scale feature from the local feature H 1 (x). The benefit of using RSU is that it helps U 2 -VC preserve more detailed features when compared to AGAIN-VC, which is helpful to improve the naturalness of the converted speech.

Sandwich adaptive instance normalization
Sandwich adaptive instance normalization (SaAdaIN) is the extended application of sandwich batch normalization for style transformation [25]. AdaIN firstly preforms instance normalization on content feature, and then affine transformation is performed on normalized content feature with the statistic of style feature. AdaIN can be formulated as: where C is the content input, and S is the style input. Different from batch normalization, μ(·) and σ (·) represent the channel-wise average and standard deviation of the input, respectively. As discussed in [25], AdaIN would lead to the content loss in the output because the style-dependent re-scale might further amplify the intrinsic data heterogeneity brought by the variety of the input content images. To reduce the content loss problem in AdaIN, SaAdaIN is proposed with shared sandwich affine layer after the instance normalization of content feature to reduce data heterogeneity, which can make the output preserve more content information. The proposed SaAdaIN can be formulated as: The constitution of the output feature from each block illustrates why the 1-2-1 residual U-block can preserve more details than plain residual block where γ sa and β sa are the parameters that can be learned from the added affine layer, and their shapes are both the same as the number of channels of input feature. Experimental results on image style transformation have verified the superiority of SaAdaIN compared to AdaIN [25]. And the ablation experimental results in this paper also show the importance of introducing SaAdaIN for voice conversion in improving the naturalness of the converted speech.

Loss
It has to be noted that log-mel spectrogram of the source and that of the target are both extracted from the speech signals with the same speaker during the training phase in order to calculate all the side self-reconstruction losses and final self-reconstruction loss. We adopt deep supervision similar to U 2 -Net, and the loss is defined as: where X 1 represents the source log-mel spectrogram; X i 1→1 , with i = 1, · · · , M, represents the ith side output reconstructed log-mel spectrogram, and X 1→1 represents the final reconstructed log-mel spectrogram. l i denotes the ith side self-reconstruction loss between the side output reconstucted log-mel spectrogram and the source log-mel spectrogram with M = 6 as shown in Fig. 4, while l f is the self-reconstruction loss between the final reconstructed log-mel spectrogram and the source log-mel spectrogram. ω i and ω f are the weights of the two loss terms. For all l i and l f , we use L1 loss as selfreconstruction loss.

Experimental setup
We implement three experiments, including the ablation study, mono-lingual voice conversion evaluation, and cross-lingual voice conversion evaluation, to verify the effectiveness of our proposed algorithm in improving the naturalness of the converted speech. AdaIN-VC and AGAIN-VC are chosen as the baselines for comparison in both mono-lingual voice conversion evaluation and crosslingual voice conversion evaluation. Details will be given in the following parts.

Dataset
VCTK dataset [34] is chosen for training, ablation study, and comparing the proposed approach with other approaches in mono-ilingual scenario. VCTK is an English dataset consisted of 46-h speech data with 109 speakers. In addition, the dataset for cross-lingual VC task taken from Voice Conversion Challenge (VCC) 2020 [18] is also used to evaluate the robustness of proposed approach in cross-lingual scenario. This dataset includes 6 speakers consisting both male and female speakers. Each speaker

Vocoder
Because the output of the proposed U 2 -VC is log-mel sectrogram of the converted speech, we need a vocoder to convert the log-mel spectrogram to time-domain waveform. The pretrained MelGAN [35] is chosen as the vocoder when considering its high inference speed of waveform generation and high quality of generated speech. Note that MelGAN is used for all the baselines and the proposed approach as the vocoder to give a fair comparison when implementing all the evaluations.

Training details
During the training phase, we randomly select 80 speakers from VCTK corpus, and then 200 utterances are randomly chosen for each speaker. Meanwhile, the remaining speakers are randomly chosen for evaluation in unseento-unseen conversion scenario. All the raw speech signals are downsampled to 22.05 kHz and transform it into logmel spectrogram with 1024 STFT window length, 256 hop length, and 80 mel-frequency bins for training and evaluation according to the configuration of MelGAN [35]. During the training phase, AdamW optimizer [36] is used to train our network with β 1 = 0.9, β 2 = 0.999 and its initial learning rate set to be 5e −4 . The proposed U 2 -VC is implemented based on Pytorch 1.5.1. The training is conducted on NVIDIA Tesla V100 with 32 GB memory, and the number of training step is 170k.

Evaluation metrics
We evaluate the proposed U 2 -VC using the naturalness of the converted speech and the speaker similarity between the converted speech and the target one. Both naturalness and the speaker similarity need objective evaluation and subjective evaluation. Three evaluation metrics are used for objective evaluation and subjective evaluation, which include (1) Mel-cepstral distortion (MCD), (2) NISQA model, and (3) mean opinion score (MOS). The reason for choosing these three evaluation metrics can be summarized as follows: 1. Mel-cepstral distortion: Mel-cepstral distortion (MCD) measures the difference between the target and converted spectral features. It can be calculated between the converted and target Mel-cepstral coeffificients or MCEPs [37]. The lower scores, the better performance. 2. NISQA model: NISQA [27] is a speech quality prediction model. The model can not only well predict the overall MOS, but also measure the four speech quality dimensions including noisiness, coloration, discontinuity, and loudness. We use NISQA as an objective measurement on the naturalness of the converted speech through the predicted overall MOS. The higher scores, the better performance. 3. Mean opinion score (MOS): mean opinion score (MOS) is used for subjective evaluation on both naturalness and the similarity to the target speaker of the converted speech. For similarity, annotators were asked to rate the score from 1 to 5 depending on how confident they considered these two speech signals were uttered by the same speaker for subjective evaluation on similarity, where 1 represents being totally different and 5 represents being absolutely same after listening the target speech and the converted one. For naturalness, annotators were asked to rate the score from 1 to 5 depending on the naturalness of the converted speech, where 1 represents being completely unnatural and 5 represents being completely natural. The higher score, the better performance.

Statistical testing method
We use analysis of variance (ANOVA) as the statistical testing method to verify that the proposed approach outperforms the baselines in a statistically significant manner. In ANOVA, we set the confidence level to be 0.95. If the significance between two approaches is less than 0.05, it means that there is a significant difference between the two approaches.

Experiment implementation details
As mentioned above, we implement three experiments including the ablation study, mono-lingual voice conversion evaluation, and cross-lingual voice conversion evaluation. The experiment implementation details are summarized as follows: 1. Ablation study: Ablation study is performed on U 2 -VC to verify the effectiveness of the proposed structure and SaAdaIN. Both seen-to-seen conversion and unseen-to-unseen conversion scenarios are included in this ablation study, where seen-to-seen conversion means both the source speaker and the target one are included in the training set, while unseen-to-unseen conversion indicates that both of them are not included in the training phase. Both objective and subjective metrics are used for evaluation. For subjective evaluation, raters can speak authentic English fluently.

Mono-lingual voice conversion evaluation:
Mono-lingual conversion performance comparison is conducted to verify the advantage of the proposed U 2 -VC. AGAIN-VC and AdaIN-VC are chosen as baselines for validation. Both seen-to-seen conversion and unseen-to-unseen conversion cases are included in this experiment. Both objective and subjective metrics are selected for evaluation. For subjective evaluation, raters can speak authentic English fluently. ANOVA is used here to show the advantage of the proposed approach compared with the two baselines in a statistically significant manner. 3. Cross-lingual voice conversion evaluation: We conduct cross-lingual conversion performance comparison in order to evaluate the robustness of the proposed U 2 -VC in cross-lingual scenario inspired by [38], which can also test the ability of the proposed algorithm in disentangling the content information and the speaker identity of the input speech. In this experiment, AGAIN-VC and AdaIN-VC are also chosen as baselines. All the models are trained with VCTK dataset to give a fair comparison; meanwhile, we choose the speech signals of Mandarin speakers from VCC dataset as the target and the unseen speakers from VCTK dataset as the source because the raters of subjective evaluation can speak both Mandarin and English fluently. Both objective and subjective metrics are used for this evaluation. For objective evaluation, we only use NISQA model to evaluate the naturalness because the MCD requires that the converted speech and the target speech have the same content. MOS is chosen as subjective evaluation metric. ANOVA is also used here to further show the advantage of the proposed approach when compared with the baselines in a statistically significant manner.

Ablation study
In this ablation study, we use "S, " "T, " "F, " and "M" to represent the source speech, the target speech, female, and male, respectively. As an example, "SF2TF" represents the conversion from a source female speech to a target female speech. Tables 2 and 3 present the ablation study results of the converted speech of AGAIN-VC, voice conversion based on only U 2 -Net, voice conversion with only SaAdIN and our U 2 -VC in speaker similarity, and naturalness through objective evaluation metrics. Table 2 shows the results in seen-to-seen scenario. One can see that all of these approaches nearly have the same MCD score. When focusing on the predicted MOS, U 2 -VC shows much better performance compared to AGAIN-VC, and the predicted MOS of the proposed U 2 -VC always get the best performance. The maximum difference is 0.3 compared to AGAIN-VC, which is a significant improvement. The same trend can be observed in Table 3, which measures the unseen-to-unseen conversion scenario. Table 2 Objective evaluation results of the ablation study on architecture in seen-to-seen conversion scenario. "AGAIN-VC" represents the network has neither U 2 -Net structure nor SaAdaIN. "U 2 -VC" represents the network has both U 2 -Net structure and SaAdaIN Tables 4 and 5 present the subjective evaluation results of the ablation study to make it more convincing. From these results, one can see that both U 2 -Net structure and SaAdaIN can improve the naturalness of the converted speech in both seen-to-seen and unseen-to-unseen conversion scenarios, which is more beneficial by introducing the U 2 -Net structure. The integration of U 2 -Net structure and SaAdaIN can achieve the highest improvement according to the subjective results. This is because the U 2 -Net structure and the SaAdaIN are complementary to each other. Instead of learning the learnable parameters of the SaAdaIN directly from the local features without U 2 -Net structure, the proposed approach learns these parameters from the multi-scale features generated by the U 2 -Net structure, which can improve the performance of the SaAdaIN. Meanwhile, the proposed approach with the SaAdaIN makes the U 2 -Net structure generate better multi-scale features compared with the conventional approach without the SaAdaIN.
In summary, the ablation study results demonstrate the effectiveness of U 2 -Net structure and SaAdaIN in improving the naturalness of the converted speech in both seen-to-seen and unseen-to-unseen scenarios.

Comparison of mono-lingual conversion performance
Tables 6 and 7 present the comparison results of objective evaluation and subjective evaluation with the standard deviation of mono-lingual conversion in seen-to-seen scenario, respectively. Meanwhile, Tables 8 and 9 present the mentioned comparison results in unseen-to-unseen scenario.
From Table 6, one can find that the proposed U 2 -VC and AGAIN-VC nearly have the same MCD scores, and the MCD scores of the proposed U 2 -VC improves a lot compared to those of AdaIN-VC. When focusing on the predicted MOS, the proposed U 2 -VC always gets the best performance compared to AGAIN-VC and AdaIN-VC. The difference of average predicted MOS is 0.2 compared to that of AGAIN-VC, up to 0.92 compared to that of AdaIN-VC, which is an obvious improvement. As stated in [18], MCD score is not always related to the human perception. The subjective evaluation is more important because the results represent the authentic naturalness and similarity of a voice conversion system. From Table 7, one can see that the proposed U 2 -VC always shows the best performance compared to AGAIN-VC and AdaIN-VC in both similarity and naturalness, which means the converted speech's perceptual quality of the proposed approach is much higher. Tables 8 and 9 present the experimental results of the unseen-to-unseen scenario, which has the similar trend with the seen-to-seen scenario as presented in Tables 6 and 7.
We perform the statistical significance evaluation of MOS through ANOVA with the confidence level of 0.95 to further confirm the better performance of the proposed approach in perceptual speech quality. Tables 10 and 11 present the results in seen-to-seen scenario and unseento-unseen scenario, respectively. From the similarity test results, one can see that there is statistical significance between the proposed approach and AdaIN-VC in the four separate cases, but there is no obvious significance between the proposed approach and AGAIN-VC. For all Table 3 Objective evaluation results of the ablation study on architecture in unseen-to-unseen conversion scenario. "AGAIN-VC" represents the network has neither U 2 -Net structure nor SaAdaIN. "U 2 -VC" represents the network has both U 2 -Net structure and SaAdaIN            of the four cases, there is statistical significance between the proposed approach and the two baselines, which indicates that the proposed approach shows better performance on the similarity compared with the baselines. From the naturalness test results, there is statistical significance between the proposed approach and the baselines in both separate and overall cases. Focusing on the subjective evaluation, it is shown that the proposed approach does improve the naturalness of converted speech without degrading the speaker similarity compared with the baselines in both seen-to-seen and unseen-to-unseen scenarios.
In summary, the comparison results, especially the subjective evaluation results, show the advantage of the proposed U 2 -VC in improving the quality of the converted speech.

Comparison of cross-lingual conversion performance
For cross-lingual conversion evaluation, we set 3 conversion cases which are VCTK2VCC, VCC2VCTK, and VCC2VCC. The language of source speech has to be English and Mandarin, while the target speech can be any language in these two corpuses. The results of objective evaluation and subjective evaluation are shown in Tables 12 and 13, respectively.  From Tables 12 and 13, one can see that the proposed U 2 -VC always gets the best performance compared to AdaIN-VC and AGAIN-VC in both objective evaluation and subjective evaluation with the standard deviation. The evaluation results demonstrate that even though the proposed approach somewhat degrades its performance in cross-lingual scenario, it is still much better than the competing approaches. The spectrograms of the converted speech signals in Figs. 5, 6, and 7 show that the proposed approach can solve the problems of harmonic distortion and content loss, which makes the converted speech sound better in both similarity and naturalness.
Statistical significance evaluation of the MOS results is also performed to verify the advantage of the proposed approach in cross-lingual scenario. Table 14 presents the evaluation result, and one can also see that the proposed approach actually improves the naturalness of converted speech without sacrificing the speaker similarity, which has the same conclusion as that of mono-lingual conversion.
In summary, the proposed U 2 -VC has the potential to improve the system robustness in cross-lingual scenario, which is proved by the evaluation results compared to AdaIN-VC and AGAIN-VC.

Conclusion
In this paper, we propose U 2 -VC, which is a new oneshot voice conversion system with U 2 -Net and SaAdaIN. The proposed approach has the capability of extracting