Fig. 2From: Components loss for neural networks in mask-based speech enhancementTopology details of the employed CNN in Fig. 1 (adopted from [48, Fig. 6]). The operation Conv (f,h×w) stands for convolution, with F or 2F representing the number of filter kernels in each layer, and (h×w) representing the kernel size. The maxpooling and upsampling layers have a kernel size of (2×1). The stride of maxpooling layers is set to 2. The gray areas contain two symmetric procedures. All possible forward residual skip connections are added to the layers with matched dimensionsBack to article page