We propose Points2Sound, a deep learning algorithm capable of generating a binaural version from a mono audio using the 3D point cloud scene of the sound sources in the environment.

### 3.1 Problem formulation

Consider an audio mono signal \(s_{m}\in \mathbb {R}^{1\times T}\) of length *T* samples generated by *N* sources \(s_i\), with

$$\begin{aligned} s_{m}(t)=\sum\limits_{i=1}^{N}s_i(t) \end{aligned}$$

(1)

along with the 3D scene of the sound sources in the 3D space represented by a set of *I* points \(\mathcal {P}=\{\mathcal {P}_i\}_{i=1}^I\), and the corresponding binaural signal \(s_{b}\in \mathbb {R}^{2\times T}\). We aim at finding a model *f* with the structure of a neural network such that \(s_b(t)=f(s_{m}(t), \mathcal {P})\). The binaural version \(s_b(t)\) generated by *N* sources \(s_i\) is defined as:

$$\begin{aligned} s_{b}^{L,R}(t)=\sum\limits_{i=1}^{N}s_i(t)\circledast \mathrm {HRTF}(\varphi _i, \theta _i, \mathrm {d}_i)|_{\mathrm {left,right}} \end{aligned}$$

(2)

where \(\mathrm {HRTF}(\varphi _i, \theta _i, \mathrm {d}_i)|_{\mathrm {left,right}}\) is the head-related transfer function of the sound incidence at the specified *i*-source orientation \((\varphi _i, \theta _i)\) and distance \(\mathrm {d}_i\) for both left (*L*) and right (*R*) ears. Throughout this work, orientation and distance are defined based on a head-related coordinate system, i.e., the center of coordinates is considered the head of the listener. During the training of the model, we consider generic HRTFs measured in an anechoic environment. However, during testing, we also evaluate the model performance under reverberant room conditions by using two binaural room impulse responses. Also, note that in this work only the sound source’s contribution through the 3D point cloud is explicitly considered for binauralization. The contributions of the listener and the environment are implicitly considered via the choice of HRTF.

### 3.2 Points2Sound model architecture

We propose a multi-modal neural network architecture capable of synthesizing binaural audio in an end-to-end fashion: the architecture takes as inputs the 3D point cloud scene along with the raw mono waveform and outputs the corresponding binaural waveform. The architecture comprises a vision network and an audio network. Broadly, the vision network extracts a visual feature from the 3D point cloud scene that serves to condition the audio network for binaural synthesis. Figure 2 shows a schematic diagram of the proposed model.

#### 3.2.1 Sparse tensor for 3D point cloud representation

3D scenes captured by rgb-depth cameras or LiDAR scanners can be represented by a 3D point cloud. A 3D point cloud is a low-level representation of a 3D scene that consists of a collection of *I* points \(\{\mathcal {P}_i\}_{i=1}^I\). The essential information associated to each point \(\mathcal {P}_i\) is its location in space. For example, in a Cartesian coordinate system, each point \(\mathcal {P}_i\) is associated with a triple of coordinates \(\mathbf {c}_i = (x_i, y_i, z_i)\in \mathbb {R}^{3}\) in the *x*, *y*, and *z*-axes. In addition, each point can have associated features \(\mathbf {f}_i\in \mathbb {R}^{n}\) like its color.

Extracting information from a 3D point cloud using neural networks requires non-standard operations that can handle the 3D data sparsity. It is common to represent the 3D point cloud information using a sparse tensor [32,33,34] and define operations on that sparse tensor, such as the 3D sparse convolution operation. Note that sparse tensors require a discretization step that enables point cloud coordinates to be defined in the integer grid of the sparse tensor. In this work, the 3D point cloud is represented with a third order tensor by first discretizing its coordinates using a voxel size \(v_s\). The voxel size \(v_s\) denotes the discretization step size and allows to define point cloud coordinates in the integer grid of the tensor. The discretized coordinates of each point are given by \(\mathbf {c}_i^\prime = \lfloor \frac{\mathbf {c}_i}{v_s}\rfloor = (\lfloor \frac{x_i}{v_s}\rfloor , \lfloor \frac{y_i}{v_s}\rfloor , \lfloor \frac{z_i}{v_s}\rfloor )\). Then, the resultant tensor representing the point cloud is given by

$$\begin{aligned} \mathbf {T}[\mathbf {c}_i^\prime ]= \left\{ \begin{array}{ll} \mathbf {f}_i &{} \mathrm {if}\ \mathbf {c}_{i}^{\prime }\in {C}^{\prime }\\ 0 &{} \text {otherwise}, \end{array}\right. \end{aligned}$$

(3)

where \({C}^\prime\) is the set of discretized coordinates of the point cloud and \(\mathbf {f}_i\) is the feature associated to the point \(\mathcal {P}_i\). Note that in the following we will evaluate how 3D point cloud attributes affect the binaural audio synthesis. Accordingly, we will consider two types of 3D point cloud scenes: when 3D point cloud scenes consist of depth-only data and when 3D point cloud scenes consist of rgb-depth data. In the cases where depth-only information is available, we use the non-discretized coordinates as the feature vectors associated to each point, i.e., \(\mathbf {f}_i=\mathbf {c}_i\). When rgb-depth information is available, we use the rgb values as the feature vectors associated to each point, i.e., \(\mathbf {f}_i=(\text {r}_i, \text {g}_i, \text {b}_i)\).

#### 3.2.2 3D sparse convolution on a 3D sparse tensor

The 3D sparse convolution is a generalized version of the conventional dense convolution designed to operate on a 3D sparse tensor [35].

The 3D sparse convolution on a 3D sparse tensor is defined as follows:

$$\begin{aligned} \mathbf {T}^{\text {out}}[x, y, z]= \sum\limits_{p,j,k\in \mathcal {N}(x,y,z)} \mathbf {W}[p,j,k]\mathbf {T}^{\text {in}}[x+p,y+j,z+k] \end{aligned}$$

(4)

for \((x,y,z)\in {C}^\prime _\mathrm {out}\). Where \(\mathcal {N}(x,y,z)=\{(p,j,k)||p|\le \Gamma , |j|\le \Gamma , |k|\le \Gamma , (p+x,j+y,k+z)\in {C}^\prime _\mathrm {in}\}\). \(\mathbf {W}\) are the weights of the 3D convolutional kernel and \(2\Gamma +1\) is the convolution kernel size. \({C}^\prime _\mathrm {in}\) and \({C}^\prime _\mathrm {out}\) are predefined input and output discretized coordinates of sparse tensors [35].

#### 3.2.3 Vision network

The vision network consists of a Resnet18 [36] architecture with 3D sparse convolutions [35] that extracts a visual feature from the 3D point cloud scene. Resnet18 with 3D sparse convolutions has been successfully used in several tasks such as 3D semantic segmentation [35] or 3D single-shot object detection [34]. Thus, we consider sparse Resnet18 suitable for our scenario, where extracting information about the position of the sources while recognizing the type of source is critical for reliable binaural synthesis. Sparse Resnet18 learns at different scales by halving the feature space after two residual blocks and doubling the receptive field by using a stride of 2. A key characteristic of residual blocks is their residual connections which allow to propagate the input data through later parts of the block by skipping some layers. Through the network, ReLU is used as an activation function and batch normalization is applied after sparse convolutions. At the top of the Resnet18 4\(^{th}\) block, we add a 3×3×3 sparse convolution with \(K=16\) output channels and apply a max-pooling operation to adequate the dimensions of the extracted visual feature \(\mathbf {h}\).

#### 3.2.4 Audio network

We adapt the Demucs architecture [9] to synthesize binaural versions \(\hat{s}_b\) from mono audio signals \(s_m\) given the 3D scene of the sound sources \(\mathcal {P}\). Although Demucs was originally designed for source separation, we find it appropriate for binaural synthesis because the model needs to intrinsically learn to separate the sound of the sources in the mono mixture for further rendering (see Eq. 2). Demucs works in the waveform domain and has a U-Net-like structure [37] (see Fig. 2). The encoder-decoder structure learns multi-resolution features from the raw waveform while skip connections allow low-level information to be propagated through the network. In the current case, skip connections allow later decoder blocks of the network to access information related to the phase of the input signal, which otherwise may be lost when propagated through the network. In this work, we keep the original six convolution blocks for both the encoder and decoder but extend the architecture so that the input and output channels match our mono and binaural signals.

#### 3.2.5 Conditioning

We use a global conditioning approach on the audio network to guide the binaural synthesis according to the 3D scene. Global conditioning was introduced in Wavenet [38] and has been recently used in the Demucs architecture for separation purposes using one-hot vectors [39]. In a similar way, we use the extracted visual feature \(\mathbf {h}\) from the vision network and insert it in each encoder and decoder block of the audio network. Specifically, the visual feature is inserted after being multiplied by a learnable linear projection \(\mathbf {V}_{\cdot , q, \cdot }\). As in [39], Demucs encoder and decoder take the following expression:

$$\begin{aligned} \texttt {Encoder}_{q+1}&= \text {GLU}(\mathbf {W}_{\text {encoder},q,2} *\text {ReLU}(\mathbf {W}_{\text {encoder},q,1} *\nonumber \\ {}&\texttt {Encoder}_{q} + \mathbf {V}_{\text {encoder},q,1} \mathbf {h}) + \mathbf {V}_{\text {encoder},q,2} \mathbf {h}),\end{aligned}$$

(5)

$$\begin{aligned} \texttt {Decoder}_{q-1}&= \text {ReLU}(\mathbf {W}_{\text {decoder},q,2} *^\top \text {GLU}(\mathbf {W}_{\text {decoder},q,1} *\nonumber \\&(\texttt {Encoder}_{q} + \texttt {Decoder}_{q}) + \mathbf {V}_{\text {decoder},q,1} \mathbf {h})\nonumber \\&+ \mathbf {V}_{\text {decoder},q,2} \mathbf {h}). \end{aligned}$$

(6)

where \(\texttt {Encoder}_{q+1}\) and \(\texttt {Decoder}_{q-1}\) are the outputs from the \(q\text {-th}\) level encoder and decoder blocks, respectively. \(\mathbf {W}_{\cdot , q, \cdot }\) are the 1-D kernel weights at the \(q\text {-th}\) block. Rectified linear unit (ReLU) and gated linear unit [40] (GLU) are the corresponding activation functions. The operator \(*\) denotes the 1-D convolution while \(*^\top\) corresponds to a transposed convolution operation, as commonly defined in the deep learning frameworks [41].

#### 3.2.6 Learning objective

During the training of Points2Sound, the parameters of both vision and audio networks are optimized to reduce the L1 loss function between the estimated binaural signal \(\hat{s}_b^{L,R}\) and the ground truth binaural signal \(s_b^{L,R}\). The L1 loss computes the absolute error between the estimated and the ground truth waveform samples. We refer to this learning objective as

$$\begin{aligned} \mathcal {L}_{\mathrm {full}} = \Vert s_{b}^{L,R} - \hat{s_{b}}^{L,R}\Vert . \end{aligned}$$

(7)

Note that in Section 4.3 we will investigate the effect of another learning objective on the performance of Points2Sound.

### 3.3 Data

While there are lots of audio datasets, there is data scarcity of 3D point cloud videos of performing musicians. For the purposes of this work, we capture 3D videos of the same twelve performers playing different instruments: cello, doublebass, guitar, saxophone, and violin. In addition, we separately collect audio recordings of these instruments from existing audio datasets. This data will serve later to generate 3D audio-visual scenes for supervised learning.

#### 3.3.1 Point clouds

Recordings were conducted using an Azure Kinect DK (by Microsoft) placed 1 m above the floor and capturing a frontal view of the musician at a distance of 2 m. Azure Kinect DK comprises a depth camera and a color camera. The depth camera was capturing a \(75^{\circ } \times 65^{\circ }\) field of view with a \(640\times 576\) resolution while the color camera was capturing with a \(1920\times 1080\) resolution. Both cameras were recording at 15 fps, and Open3D library [42] was then used to align depth and color streams and generate a point cloud for each frame. The full 3D video recordings span 1 h of duration with an average of 12 performers for each instrument.

We increase our 3D point cloud video dataset collecting 3D videos from *small ensemble 3D-video database* [43] and *Panoptic Studio* [44]. In *small ensemble 3D-video database*, recordings are carried out using three RGB-Depth Kinect v2 sensors. LiveScan3D [45] and OpenCV libraries are then used to align and generate point clouds for each frame given each camera point of view and sensor data. The average video recording is 5 min per instrument and a single performer per instrument. In *Panoptic Studio*, recordings are carried out using ten Kinect sensors. In this case, recordings span two instrument categories: cello and guitar. The average time per video recording is 2 min per instrument for a single performer per instrument. As we gather 3D point cloud videos from different sources, we set the axes representing the point clouds to have the same meaning for all the collected 3D videos. This is as follows: the body face direction is the *z*-axis, the stature direction is the *y*-axis, and the side direction is the *x*-axis. Then, we split 75% of the data for training, 15% for validation, and the remaining 10% for testing. Data split is made ensuring that there is no overlap in identities between sets.

#### 3.3.2 Audio

We collect 30 h of mono audio recordings at 44.1 kHz from *Solos* [46] and *Music* [25]. Both datasets gather music from YouTube which ensures a variety of acoustic conditions. In total, we gather 72 recordings per instrument with an average of 5 min per recording. We split 75% of the recordings for training, 15% for validation, and the remaining 10% for testing.

For further binaural auditory scene generation, we also create multiple binaural versions of each recording using the *Two!Ears* Binaural Simulator [47]. Specifically, for each audio recording, we simulate the binaural version at a discrete set of angular positions in the horizontal plane with no elevation, i.e., \((\varphi _k, \theta ):= (\frac{k\pi }{4} ,0)\) for \(k=0,\ldots ,7\). For binaural auditory modeling, we use the HRTFs at 1m of distance between source and receiver measured with a KEMAR manikin (type 45BA) at the anechoic chamber of the TU Berlin [48].

### 3.4 Audio-visual 3D scene generation

We synthetically create mono mixtures, 3D scenes, and the corresponding binaural version to train the model in a supervised fashion.

For each instance, we randomly select *N* sources and *N* angular positions with *N* chosen uniformly between 1 and 3. The binaural mixture, which serves as supervision, is created following Eq. 2. First, we select 3-s length binaural signals for each sound source in the mix based on its angular position, and then, we sum all selected binaural signals to create the binaural mixture.

For the 3D scene, we first select individual musician’s point clouds corresponding to these sources. Then, musician point clouds are located at their corresponding angular position in a random distance ranging from 1 to 3 m from the listener’s head in the 3D space. Finally, all N musician point clouds are merged to create a single 3D point cloud scene. Note that we generate binaural versions using HRTFs computed at 1-m distance from the listener’s head but locate the sources in a distance ranging from 1 to 3 m. This assumption is based on the fact that distance has a smaller influence on the shape of HRTFs, for source-receiver distances greater than 1m [49].

During the training of the model, each individual musician point cloud is independently augmented in both coordinates and color. We randomly shear and translate the coordinates of each musician in the scene. Shearing is applied along all axes and the shear elements are sampled from a Normal distribution \(\mathcal {N}(0, 0.1^2)\). Translation is applied along the stature direction and the translation offset is sampled from \(\mathcal {N}(0, 0.2^2)\).

Regarding color, we distort the brightness and intensity of each sound source in the scene. Specifically, we apply color distortion to each point via adding Gaussian noise sampled from \(\mathcal {N}(0, 0.05^2)\) on each rgb color channel. We also alter color value and color saturation with random amounts uniformly sampled ranging from −0.2 to 0.2 and −0.15 to 0.15, respectively. Figure 3 illustrates the different augmentation operations applied. After augmentation, each scene is represented as a sparse tensor by discretizing the point cloud coordinates using a voxel size of 0.02 m. We select a small voxel size as it has been shown to work better than bigger ones for several 3D visual tasks [35].

### 3.5 Implementation details

Initially, we pretrain the vision network to facilitate the future learning process. Pretraining is done on the 3D object classification task modelnet40 [50]. Since modelnet40 consists of 3D CAD models, we sample point clouds from the mesh surface of the objects shapes. For the pretraining, we also discretize the coordinates setting the voxel size to 0.02 m. Then, Points2Sound vision and audio networks are jointly trained for 120k iterations using the Adam [51] optimizer. We use a batch size of 40 samples and we set the learning rate to \(1\times 10^{-4}\). We select the weights corresponding to the lowest validation loss after the training process. Training and testing are conducted on a single Titan RTX GPU. The training stage takes about 72 h while the inference takes 0.115 s to binauralize 10 s of mono audio (value averaged from 300 samples). We use the Minkowski Engine [35] for the sparse tensor operations and PyTorch [41] for the other operations required.