Skip to main content
Fig. 1 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 1

From: U2-VC: one-shot voice conversion using two-level nested U-structure

Fig. 1

The architecture of U2-VC. The 1-2-1 residual U-blocks (RSU) with different layers (1-2-1RSU7, ,1-2-1RSU4F) consist of the U-Net like encoder-decoder structure. “7”, “6”, “5,” and “4” represent the layers (L) of 1-2-1 residual U-blocks. Greater L means the1-2-1 residual U-block could capture more large-scale information. In this network, we set the L from large to small in order to extract the features from the global to the detail. This process preserves more fine details of input features which could be better for the naturalness of converted speech. Inspired by AGAIN-VC, sigmoid function is used at the end of encoder. Sandwich adaptive instance normalization (SaAdaIN) is adopted in decoder for speaker identity transformation

Back to article page