Points2Sound: from mono to binaural audio using 3D point cloud scenes

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Quantitative results of baselines and Points2Sound considering different mono input audio. For each method, we use rgb-depth 3D point cloud attributes and report the performance depending on the number of sources (\(N = 1,2,3\)), based on the average of the evaluation metrics. Average values for any number of sources are given by\(\overline{\mathrm {d}_\mathrm {ENV}}\)and\(\overline{\mathrm {d}_{\mathrm {STFT}}}\)

	\(\mathrm {d}_{\mathrm {ENV}} \downarrow\)			\(\mathrm {d}_{\mathrm {STFT}} \downarrow\)			\(\overline{\mathrm {d}_{\mathrm {ENV}}} \downarrow\)	\(\overline{\mathrm {d}_{\mathrm {STFT}}} \downarrow\)
	1	2	3	1	2	3
\(s_{m}\)(true mono)
Mono-Mono	0.387	0.403	0.388	26.719	26.414	26.747	0.392	26.626
Rotated-Visual	0.232	0.285	0.305	9.002	10.588	12.016	0.274	10.535
Points2Sound (\(\mathcal {L}_{\mathrm {full}}\))	0.173	0.248	0.280	3.297	6.645	9.080	0.233	6.340
\(s_{m} = s_{b}^L\)
Mono-Mono	0.148	0.155	0.159	7.472	6.997	6.951	0.154	7.14
Rotated-Visual	0.165	0.166	0.165	7.610	6.808	6.345	0.165	6.921
Points2Sound (\(\mathcal {L}_{\mathrm {full}}\))	0.054	0.103	0.130	0.636	1.820	2.604	0.095	1.686
\(s_{m} = s_{b}^L+s_{b}^R\)
Mono-Mono	0.142	0.166	0.178	4.046	4.112	4.058	0.162	4.072
Rotated-Visual	0.166	0.192	0.209	5.663	5.918	6.031	0.189	5.870
Points2Sound (\(\mathcal {L}_{\mathrm {full}}\))	0.015	0.073	0.114	0.099	0.762	1.521	0.067	0.794
Points2Sound (\(\mathcal {L}_{\mathrm {full}}\)) (only-depth)	0.016	0.080	0.122	0.082	0.885	1.736	0.072	0.901
Points2Sound (\(\mathcal {L}_{\mathrm {diff}}\))	0.015	0.090	0.125	0.153	1.205	1.832	0.076	1.063