Skip to main content

Table 1 Quantitative results of baselines and Points2Sound considering different mono input audio. For each method, we use rgb-depth 3D point cloud attributes and report the performance depending on the number of sources (\(N = 1,2,3\)), based on the average of the evaluation metrics. Average values for any number of sources are given by\(\overline{\mathrm {d}_\mathrm {ENV}}\)and\(\overline{\mathrm {d}_{\mathrm {STFT}}}\)

From: Points2Sound: from mono to binaural audio using 3D point cloud scenes

 

\(\mathrm {d}_{\mathrm {ENV}} \downarrow\)

\(\mathrm {d}_{\mathrm {STFT}} \downarrow\)

\(\overline{\mathrm {d}_{\mathrm {ENV}}} \downarrow\)

\(\overline{\mathrm {d}_{\mathrm {STFT}}} \downarrow\)

 

1

2

3

1

2

3

  

\(s_{m}\)(true mono)

   Mono-Mono

0.387

0.403

0.388

26.719

26.414

26.747

0.392

26.626

   Rotated-Visual

0.232

0.285

0.305

9.002

10.588

12.016

0.274

10.535

   Points2Sound (\(\mathcal {L}_{\mathrm {full}}\))

0.173

0.248

0.280

3.297

6.645

9.080

0.233

6.340

\(s_{m} = s_{b}^L\)

   Mono-Mono

0.148

0.155

0.159

7.472

6.997

6.951

0.154

7.14

   Rotated-Visual

0.165

0.166

0.165

7.610

6.808

6.345

0.165

6.921

   Points2Sound (\(\mathcal {L}_{\mathrm {full}}\))

0.054

0.103

0.130

0.636

1.820

2.604

0.095

1.686

\(s_{m} = s_{b}^L+s_{b}^R\)

   Mono-Mono

0.142

0.166

0.178

4.046

4.112

4.058

0.162

4.072

   Rotated-Visual

0.166

0.192

0.209

5.663

5.918

6.031

0.189

5.870

   Points2Sound (\(\mathcal {L}_{\mathrm {full}}\))

0.015

0.073

0.114

0.099

0.762

1.521

0.067

0.794

   Points2Sound (\(\mathcal {L}_{\mathrm {full}}\)) (only-depth)

0.016

0.080

0.122

0.082

0.885

1.736

0.072

0.901

   Points2Sound (\(\mathcal {L}_{\mathrm {diff}}\))

0.015

0.090

0.125

0.153

1.205

1.832

0.076

1.063

  1. The lowest errors are highlighted using bold font