Skip to main content

Table 1 The tensor size of the middle layer of the model when input a speech with length of 4s and sampling rate of 8k

From: Time-domain adaptive attention network for single-channel speech separation

Module

Layers

Input size

Output size

Encoder

Conv1d

[b, 1, 32000]

[b, 256, 15999]

GroupNorm

[b, 256, 15999]

[b, 256, 15999]

Conv1d

[b, 256, 15999]

[b, 64, 15999]

Segmentation

[b, 64, 15999]

[64, 200, b*162]

Separator

LocalAttention

[64, 200, b*162]

[64, 200, b*162]

GlobalAttention

[64, 200, b*162]

[64, 200, b*162]

Decoder

OverlapAdd

[64, 200, b*162]

[b*C, 256, 15999]

Conv1d-Transpose

[b*C, 256, 15999]

[b, C, 32000]