Time-domain adaptive attention network for single-channel speech separation

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 The tensor size of the middle layer of the model when input a speech with length of 4s and sampling rate of 8k

Module	Layers	Input size	Output size
Encoder	Conv1d	[b, 1, 32000]	[b, 256, 15999]
	GroupNorm	[b, 256, 15999]	[b, 256, 15999]
	Conv1d	[b, 256, 15999]	[b, 64, 15999]
	Segmentation	[b, 64, 15999]	[64, 200, b*162]
Separator	LocalAttention	[64, 200, b*162]	[64, 200, b*162]
Separator	GlobalAttention	[64, 200, b*162]	[64, 200, b*162]
Decoder	OverlapAdd	[64, 200, b*162]	[b*C, 256, 15999]
Decoder	Conv1d-Transpose	[b*C, 256, 15999]	[b, C, 32000]