Points2Sound: From mono to binaural audio using 3D point cloud scenes

For immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. We propose Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network and an audio network. The vision network uses 3D sparse convolutions to extract a visual feature from the point cloud scene. Then, the visual feature conditions the audio network, which operates in the waveform domain, to synthesize the binaural version. Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. We also investigate how 3D point cloud attributes, learning objectives, different reverberant conditions, and several types of mono mixture signals affect the binaural audio synthesis performance of Points2Sound for the different numbers of sound sources present in the scene.


Introduction
People perceive the world through multiple senses that jointly collaborate to understand the environment. While visual stimuli are important for spatial cognition, auditory stimuli are particularly critical. For example, being capable of hearing instantly from all angles helps people orient themselves in space and influences their visual attention [5,35]. As auditory stimuli are received by both ears, our brain locates sound sources in space by comparing the sound that our ears receive. This process, known as binaural hearing, relies mainly on two acoustic cues: interaural time difference (ITD) and interaural level difference (ILD). ITD is the difference in the arrival time of a sound between the ears and ILD is the difference in sound intensity. In the median plane, i.e. the vertical plane between the ears, ITD and ILD are both small, and we rely on spectral cues to locate sources [1]. All such acoustic cues can be described by the head-related transfer function (HRTF), which encodes the sound distortion caused by geometries of the head and the torso [38].
In immersive applications, the generation of accurate binaural acoustic cues that match the visual counterpart is key to providing people with meaningful experiences in the virtual environment. These acoustic cues, ITD and ILD, strongly rely on the 3D position between the receiver and the sound sources. Recently, several methods using neural networks have been proposed for generating binaural audio from mono audio, using 2D visual information as guidance [11,24,48,52]. However, using 2D visual information inherently restricts the neural network's ability to extract information about the 3D positions between the receiver and the sound sources present in a scene. Not having access to this potentially useful information entails the risk of finding a sub-optimal solution for the task of binaural generation. In addition, these recent methods extract visual information by applying 2D dense convolutions to planar projections of the scene. This process forces the 2D convolutional filters to attend to local planar-projection regions with no relationship to physical space -possibly hindering the audio-visual learning necessary for the binaural synthesis task.
In this paper we introduce Points2Sound, a multimodal neural network that synthesizes binaural audio from mono audio using 3D visual information as guidance (see Fig. 1). For the visual learning, we propose the use of 3D point clouds as visual information as well as 3D sparse convolutions for extracting information from the 3D point clouds. This approach enables the model to extract information about the 3D position of the sound sources while convolutional filters attend to 3D data structures in local regions of the 3D space. For the audio learning, Points2Sound uses advancements in neural audio modeling in the waveform domain. We extend the Demucs architecture [7] and show that this model can be effectively conditioned by using visual information. Although Demucs was originally designed for source separation, we find it appropriate for binaural synthesis given that the model needs to intrinsically separate the sound of the sources in the mono mixture for further binaural rendering. This study thus analyzes the performance of Points2Sound for different 3D point cloud attributes, learning objectives, reverberant conditions and types of mono mixtures.
The main contributions of this work are the following: • We introduce the use of 3D point clouds to condition audio signals. By using 3D visual information and 3D sparse convolutions, the neural network can learn the correspondence between audio characteristics (e.g. spatial cues or timbre) and 3D structures found in local regions of the 3D space -a correspondence relevant for binaural audio synthesis.
• We tackle visually-informed binaural audio generation directly in the waveform domain, thereby optimizing our model in an end-to-end fashion and allowing it to learn audio features that are not limited by a fixed resolution of a spectrogram representation.
• We evaluate how 3D point cloud attributes, i.e. depth or rgb-depth, several types of mono mixture signals, the effect of the room, and different learning objectives affect binaural audio synthesis performance for different numbers of sound sources in the scene.
• We provide a dataset of the captured 3D-video point  clouds, videos with listening examples, and the source  code for reproducibility purposes 1 . This paper is organized as follows: Section 2 provides a brief overview of the related work. Section 3 details the neural network architecture, the training procedure, and the data used. Section 4 presents the evaluation metrics and the obtained results. Section 5 discusses the results, and Section 6 concludes.

Related work
We provide a brief overview of related works in the field of audio-visual source separation and audio-visual spatial audio generation.

Audio-visual Source Separation:
Source separation has been traditionally approached using only audio signals [2,15] with methods such as independent component analysis [17], sparse coding [29] or non-negative matrix factorization [22]. Recently, audio source separation has experienced significant progress due to the application of deep learning methods [25,37,40,49]. Current trends include performing source separation in the waveform domain [7,23,39] or preserving binaural cues during the separation process [14,41]. In addition, deep learning methods have facilitated the inclusion of visual information to guide the audio separation [8,32]. In the case of music source separation using visual information, learning methods mainly use appearance cues from the 2D visual representations [10,51], but have been enhanced also with motion information [9,50]. Interestingly, audio-visual source separation models intrinsically learn to map audio to their corresponding position in the visual representation. This has encouraged the use of visual information for spatial audio generation [52].

Audio-visual Spatial Audio Generation:
With the recent advances on audio modeling using neural networks, end-to-end deep learning approaches using explicit information about the position and orientation of the sources in the 3D space have been proposed for binaural audio synthesis [12,34]. These approaches require head tracking equipment to know the pose of the receiver and the sound source in the environment. Concurrently, audiovisual learning for spatial audio generation has gained interest. Several methods have been proposed to infer spatial acoustic cues to mono audio from leveraging visual information. Morgado et al. [28] proposed a learning method to generate the spatial version of mono audio guided by 360 • videos. Their approach is to predict the spatial audio in ambisonics format which can be later decoded as binaural audio for reproduction through headphones. Gao et al. [11] show that directly predicting the binaural audio creates better 3D sound sensations. They propose a U-Net-like framework for mono-to-binaural conversion using normal field of view videos. Since then, binauralization models using 2D visual information have been enhanced using different approaches such as using an auxiliary classifier [24] or integrating the source separation task in the overall binaural generation framework [52]. In addition, it has been shown that features from pretrained models on audio-visual spatial alignment tasks are beneficial for audio binauralization [48]. Note that many of these approaches use audio spectrogram representations while considering mono mixture signals represented as the sum of the two spatial channels in order to train their networks [11,24,48,52]. It remains unclear how operating in the waveform domain and using other mono representations may affect the binaural synthesis performance.

Approach
We propose Points2Sound, a deep learning algorithm capable of generating a binaural version from a mono audio using the 3D point cloud scene of the sound sources in the environment.

Problem Formulation
Consider an audio mono signal s m ∈ R 1×T of length T samples generated by N sources s i , with along with the 3D scene of the sound sources in the 3D space represented by a set of I points P = {P i } I i=1 , and the corresponding binaural signal s b ∈ R 2×T . We aim at finding a model f with the structure of a neural network such that s b (t) = f (s m (t), P). The binaural version s b (t) generated by N sources s i is defined as: where HRTF(ϕ i , θ i , d i )| left,right is the head-related transfer function of the sound incidence at the specified i-source orientation (ϕ i , θ i ) and distance d i for both left (L) and right (R) ears. Throughout this work, orientation and distance are defined based on a head-related coordinate system, i.e. the center of coordinates is considered the head of the listener. During the training of the model we consider generic HRTFs measured in an anechoic environment.
However during testing, we also evaluate the model performance under reverberant room conditions by using two binaural room impulse responses. Also note that in this work only the sound source's contribution through the 3D point cloud is explicitly considered for binauralization. The contributions of the listener and the environment are implicitly considered via the choice of HRTF.

Points2Sound Model Architecture
We propose a multi-modal neural network architecture capable of synthesizing binaural audio in an end-to-end fashion: the architecture takes as inputs the 3D point cloud scene along with the raw mono waveform and outputs the corresponding binaural waveform. The architecture comprises a vision network and an audio network. Broadly, the vision network extracts a visual feature from the 3D point cloud scene that serves to condition the audio network for binaural synthesis. Figure 2 shows a schematic diagram of the proposed model.

Sparse Tensor for 3D Point Cloud Representation
3D scenes captured by rgb-depth cameras or LiDAR scanners can be represented by a 3D point cloud. A 3D point cloud is a low-level representation of a 3D scene that consists of a collection of I points The essential information associated to each point P i is its location in space. For example, in a Cartesian coordinate system each point P i is associated with a triple of coordinates c i = (x i , y i , z i ) ∈ R 3 in the x,y,z-axis. In addition, each point can have associated features f i ∈ R n like its color.
Extracting information from a 3D point cloud using neural networks requires non-standard operations that can handle the 3D data sparsity. It is common to represent the 3D point cloud information using a sparse tensor [4,13,47] and define operations on that sparse tensor, such the 3D sparse convolution operation. Note that sparse tensors require a discretization step that enables point cloud coordinates to be defined in the integer grid of the sparse tensor. In this work, the 3D point cloud is represented with a third order tensor by first discretizing its coordinates using a voxel size v s . The voxel size v s denotes the discretization step size, and allows to define point cloud coordinates in the integer grid of the tensor. The discretized coordinates of each point are given by c i = ci vs = ( xi vs , yi vs , zi vs ). Then, the resultant tensor representing the point cloud is given by where C is the set of discretized coordinates of the point cloud and f i is the feature associated to the point P i . Note that in the following we will evaluate how 3D point cloud  attributes affect the binaural audio synthesis. Accordingly, we will consider two types of 3D point cloud scenes: when 3D point cloud scenes consist of depth-only data and when 3D point cloud scenes consist of rgb-depth data. In the cases where depth-only information is available, we use the nondiscretized coordinates as the feature vectors associated to each point, i.e. f i = c i . When rgb-depth information is available, we use the rgb values as the feature vectors associated to each point, i.e. f i = (r i , g i , b i ).

3D Sparse Convolution on a 3D Sparse Tensor
The 3D sparse convolution is a generalized version of the conventional dense convolution designed to operate on a 3D sparse tensor [3]. The 3D sparse convolution on a 3D sparse tensor is defined as follows: W are the weights of the 3D convolutional kernel and 2Γ + 1 is the convolution kernel size. C in and C out are predefined input and output discretized coordinates of sparse tensors [3].

Vision Network
The vision network consists of a Resnet18 [16] architecture with 3D sparse convolutions [3] that extracts a visual feature from the 3D point cloud scene. Resnet18 with 3D sparse convolutions has been successfully used in several tasks such as 3D semantic segmentation [3] or 3D single-shot object detection [13]. Thus, we consider sparse Resnet18 suitable for our scenario, where extracting information about the position of the sources while recognizing the type of source is critical for reliable binaural synthesis. Sparse Resnet18 learns at different scales by halving the feature space after two residual blocks and doubling the receptive field by using a stride of 2. A key characteristic of residual blocks are their residual connections which allow to propagate the input data through later parts of the block by skipping some layers. Through the network, ReLU is used as activation function and batch normalization is applied after sparse convolutions. At the top of Resnet18 4 th block, we add a 3x3x3 sparse convolution with K = 16 output channels and apply a max-pooling operation to adequate the dimensions of the extracted visual feature h.

Audio Network
We adapt the Demucs architecture [7] to synthesize binaural versionsŝ b from mono audio signals s m given the 3D scene of the sound sources P. Although Demucs was orig-inally designed for source separation, we find it appropriate for binaural synthesis because the model needs to intrinsically learn to separate the sound of the sources in the mono mixture for further rendering (see Eq. 2). Demucs works in the waveform domain and has a U-Net-like structure [36] (see Fig. 2). The encoder-decoder structure learns multi-resolution features from the raw waveform while skip connections allow low-level information to be propagated through the network. In the current case, skip connections allow later decoder blocks of the network to access information related to the phase of the input signal, which otherwise may be lost when propagated through the network. In this work, we keep the original six convolution blocks for both encoder and decoder but extend the architecture so that the input and output channels match our mono and binaural signals.

Conditioning
We use a global conditioning approach on the audio network to guide the binaural synthesis according to the 3D scene. Global conditioning was introduced in Wavenet [30] and has been recently used in the Demucs architecture for separation purposes using one-hot vectors [18]. In a similar way, we use the extracted visual feature h from the vision network and insert it in each encoder and decoder block of the audio network. Specifically, the visual feature is inserted after being multiplied by a learnable linear projection V ·,q,· . As in [18], Demucs encoder and decoder takes the following expression: Decoder q−1 = ReLU(W decoder,q,2 * GLU(W decoder,q,1 * (Encoder q + Decoder q ) + V decoder,q,1 h) where Encoder q+1 and Decoder q−1 are the outputs from the q-th level encoder and decoder blocks respectively. W ·,q,· are the 1-D kernel weights at the q-th block. Rectified Linear Unit (ReLU) and Gated Linear Unit [6] (GLU) are the corresponding activation functions. The operator * denotes the 1-D convolution while * corresponds to a transposed convolution operation, as commonly defined in the deep learning frameworks [33].

Learning Objective
During the training of Points2Sound, the parameters of both vision and audio networks are optimized to reduce the L1 loss function between the estimated binaural signalŝ L,R b and the ground truth binaural signal s L,R b . The L1 loss computes the absolute error between the estimated and the ground truth waveform samples. We refer to this learning objective as Note that in Section 4.3 we will investigate the effect of another learning objective on the performance of Points2Sound.

Data
While there are lots of audio datasets, there is data scarcity of 3D point cloud videos of performing musicians. For the purposes of this work, we capture 3D videos of the same twelve performers playing different instruments: cello, doublebass, guitar, saxophone, and violin. In addition, we separately collect audio recordings of these instruments from existing audio datasets. This data will serve later to generate 3D audio-visual scenes for supervised learning.

Point Clouds
Recordings were conducted using an Azure Kinect DK (by Microsoft) placed one meter above the floor and capturing a frontal view of the musician at a distance of two meters. Azure Kinect DK comprises a depth camera and a color camera. The depth camera was capturing a 75 • × 65 • field of view with a 640 × 576 resolution while the color camera was capturing with a 1920 × 1080 resolution. Both cameras were recording at 15 fps and Open3D library [53] was then used to align depth and color streams and generate a point cloud for each frame. The full 3D video recordings span 1 hour of duration with an average of 12 performers for each instrument.
We increase our 3D point cloud video dataset collecting 3D videos from small ensemble 3D-video database [42] and Panoptic Studio [19]. In small ensemble 3D-video database, recordings are carried out using three RGB-Depth Kinect v2 sensors. LiveScan3D [21] and OpenCV libraries are then used to align and generate point clouds for each frame given each camera point of view and sensor data. The average video recording is 5 minutes per instrument and a single performer per instrument. In Panoptic Studio, recordings are carried out using ten Kinect sensors. In this case, recordings span two instrument categories: cello and guitar. The average time per video recording is 2 minutes per instrument for a single performer per instrument. As we gather 3D point cloud videos from different sources, we set the axes representing the point clouds to have the same meaning for all the collected 3D videos. This is: the body face direction is z-axis, stature direction is y-axis, and the side direction is x-axis. Then we split 75% of the data for training, 15% for validation, and the remaining 10% for testing. Data split is made ensuring that there is no overlap in identities between sets.

Audio
We collect 30 hours of mono audio recordings at 44.1 kHz from Solos [27] and Music [51]. Both datasets gather music from YouTube which ensures a variety of acoustic conditions. In total, we gather 72 recordings per instrument with an average of 5 minutes per recording. We split 75% of the recordings for training, 15% for validation, and the remaining 10% for testing.
For further binaural auditory scene generation, we also create multiple binaural versions of each recording using the Two!Ears Binaural Simulator [45]. Specifically, for each audio recording we simulate the binaural version at a discrete set of angular positions in the horizontal plane with no elevation, i.e. (ϕ k , θ) := ( kπ 4 , 0) for k = 0, . . . , 7. For binaural auditory modeling, we use the HRTFs at 1m of distance between source and receiver measured with a KE-MAR manikin (type 45BA) at the anechoic chamber of the TU Berlin [44].

Audio-visual 3D Scene Generation
We synthetically create mono mixtures, 3D scenes, and the corresponding binaural version to train the model in a supervised fashion.
For each instance, we randomly select N sources and N angular positions with N chosen uniformly between 1 and 3. The binaural mixture, which serves as supervision, is created following Eq. 2. First, we select 3 seconds length binaural signals for each sound source in the mix based on its angular position, and then we sum all selected binaural signals to create the binaural mixture.
For the 3D scene, we first select individual musician's point clouds corresponding to these sources. Then, musician point clouds are located at their corresponding angular position in a random distance ranging from 1 to 3 meters from the listeners head in the 3D space. Finally, all N musician point clouds are merged to create a single 3D point cloud scene. Note that we generate binaural versions using HRTFs computed at 1 meter distance from the listeners head but locate the sources in a distance ranging from 1 to 3 meters. This assumption is based on the fact that distance has a smaller influence on the shape of HRTFs, for sourcereceiver distances greater than 1m [31].
During the training of the model, each individual musician point cloud is independently augmented in both coordinates and color. We randomly shear and translate the coordinates of each musician in the scene. Shearing is applied along all axes and the shear elements are sampled from a Normal distribution N (0, 0.  Regarding color, we distort the brightness and intensity of each sound source in the scene. Specifically, we apply color distortion to each point via adding Gaussian noise sampled from N (0, 0.05 2 ) on each rgb color channel. We also alter color value and color saturation with random amounts uniformly sampled ranging from -0.2 to 0.2 and -0.15 to 0.15 respectively. Figure 3 illustrates the different augmentation operations applied. After augmentation, each scene is represented as a sparse tensor by discretizing the point cloud coordinates using a voxel size of 0.02 meters. We select a small voxel size as it has been shown to work better than bigger ones for several 3D visual tasks [3].

Implementation Details
Initially, we pretrain the vision network to facilitate the future learning process. Pretraining is done on the 3D object classification task modelnet40 [46]. Since modelnet40 consists of 3D CAD models, we sample point clouds from the mesh surface of the objects shapes. For the pretraining, we also discretize the coordinates setting the voxel size to 0.02 meters. Then, Points2Sound vision and audio networks are jointly trained for 120k iterations using the Adam [20] optimizer. We use a batch size of 40 samples and we set the learning rate to 1×10 −4 . We select the weights corresponding to the lowest validation loss after the training process. Training and testing are conducted on a single Titan RTX GPU. The training stage takes about 72 hours while the inference takes 0.115 seconds to binauralize 10 seconds of mono audio (value averaged from 300 samples). We use the Minkowski Engine [3] for the sparse tensor operations and PyTorch [33] for the other operations required.

Evaluation Metrics
As in previous work [28], we measure the quality of the predicted binaural audio assessing the short-time Fourier transform (STFT) Distance. Using the STFT Distance we assess how similar the frequency components of each predicted binaural channel are to the ground truth. STFT Distance (d STFT ) between a binaural signal s b and its estimatê s b is defined as: where · 2 is the L2 norm and STFT(·) is the short-time Fourier transform. The STFT is computed using a Hann window of 23 milliseconds and hop length of 10 milliseconds.
We also assess the quality of the predicted binaural audio using the Envelope Distance. The Envelope Distance operates in the time domain and is intended to capture the perceptual similarity between two binaural signals in a better way than directly computing the loss between its waveform samples [28]. Envelope Distance (d ENV ) between a binaural signal s b and its estimateŝ b is defined as: where E[s(t)] corresponds to the envelope of the signal s(t). The envelope is given by the magnitude of the analytical signal computed using the Hilbert transform.
Note that we report the performance depending on the number of sources (N = 1, 2, 3), based on the average of the evaluation metrics. Average values for any number of sources are given by d ENV and d STFT . Also, before computing the distances, the predicted and the ground truth signals are normalized according to their maximum absolute value.

Baselines
We use two baselines to assess the quality of the predicted binaural versions: Rotated-Visual. Rotated-Visual baseline assesses the performance of Points2Sound (L full ) when wrong visual information is provided. To this end, during testing we rotate the 3D scene by π/2 in the horizontal plane of the listener's head.
Mono-Mono. Mono-Mono baseline simply copies the mono input audio to both binaural predicted channels.
A quantitative comparison with a similar method for mono to binaural synthesis using visual information is provided in the Appendix.

Evaluation
For evaluation we use 504 audio-visual 3D scenes with N = 1, 2, 3 sound sources. Audio-visual 3D scenes are generated using the test data set and following the procedure explained in Section 3.4. During evaluation, augmentation operations are not applied and 10 second audio clips are selected.
Points2Sound input audio. We consider three different types of mono mixture input signals for Points2Sound. Table 1 shows quantitative results of Point2Sound for the different types of mono mixture signals and number of sources.
First, we consider true mono mixture signals which come from the audio dataset, detailed in 3.3.2, where no HRTFs have been applied. In this case, Points2Sound improvement over the baselines is notable. This is especially observed in the d STFT metric, where the Mono-Mono baseline achieves 26.626 while Points2Sound achieves 6.340. Despite the improvement, the binaural predictions are degraded and contain time-frequency artifacts that change the timbre characteristics of the original sound.
Second, we consider mono mixture signals which contain only the left binaural channel, i.e. s m = s L b . Note that in this case, ITDs are not preserved in the mono mixture and the network has to shift differently between left and right channels in the binaural prediction. Results show that Points2Sound improves the evaluation metrics for all Table 1: Quantitative results of baselines and Points2Sound considering different mono input audio. For each method, we use rgb-depth 3D point cloud attributes and report the performance depending on the number of sources (N = 1, 2, 3), based on the average of the evaluation metrics. Average values for any number of sources are given by d ENV and d STFT .  Third, following previous work [11,24,48,52], we consider mono mixture signals represented as the sum of the two spatial channels, i.e. s m = s L b + s R b . In this case the mixing of the channels creates a mono signal that loses spatial properties. But, the resultant mono signal preserves the correct ITDs from the binaural version. Results show that Point2Sound achieves the best results using this mono representation with a d ENV of 0.095 and a d STFT of 1.686. Note that the obtained quantitative results are similar to the ones achieved using the above approach, where mono mixture signals are represented as s m = s L b . For the following, when we refer to Points2Sound we assume it has been trained using mono mixture signals represented as s m = s L b + s R b .
3D point cloud attributes. We evaluate how 3D point cloud attributes affect the binaural audio synthesis. Accordingly, we consider two types of 3D point cloud scenes: when 3D point cloud scenes consist of depth-only data and when 3D point cloud scenes consist of rgb-depth data. We use the term "only-depth" in Table 1 to report the performance when 3D point cloud scenes consist of depth-only data. Otherwise, we report the performance when 3D point cloud scenes consist of rgb-depth data.
We observe that Points2Sound benefits from the rgb information especially when multiple sound sources are present. For example with N = 3 sources, Points2Sound using rgb-depth features achieves a d ENV of 0.114 and a d STFT of 1.521, as opposed to 0.122 and 1.736 achieved by only depth features. But with N = 1 source, Points2Sound using depth features slightly outperform rgb-depth features, providing a d STFT of 0.082 and a d ENV of 0.016, as opposed to 0.099 and 0.015 We suspect that with a single source, depth features already provide straight information about the position of the source for accurate binaural synthesis. However, when multiple sources are present, rgbdepth features better distinguish each source which facil-itates the binaural synthesis of the audio network. Informal listening corroborates that rgb-depth features helps the model to recognize and locate multiple sources, as it provides more stable auditory images of the sound sources for the whole 10 second clips.
Number of sources in the 3D scene. We are also interested in evaluating the quality of the predicted binaural audio depending on the number of sound sources present in the 3D scene. We observe that when N = 1 sources, Points2Sound provides perceptually convincing binaural predictions with a consistent performance across almost all examples.
When N = 2 sources, informal listening reveals that binaural audio predictions are convincing especially when the sources are located at the same side of the listener's head. Quantitative results show that Points2Sound using rgb-depth features achieves a d ENV of 0.044 and a d STFT of 0.284 when both sources are in the same side and a d ENV of 0.084 and a d STFT of 0.944 for other source position configurations. We also observe that in some cases, Points2Sound has difficulties when one of the two sources is located in front or behind the listener's head. This results in binaural predictions where the auditory image of the front/back source is not stable for the whole 10 seconds clip.
When N=3 sources, Points2Sound has difficulties to provide stable auditory images for every sound source for the whole 10 seconds clip.
Effect of the room. We evaluate how Points2Sound performs under reverberant room conditions. To this end, we use binaural room impulse responses (BRIRs) to generate two test sets. Note that Points2Sound is trained using HRTFs measured in an anechoic room. The first test set is generated using the BRIRs measured in the studio room Calypso at TU Berlin [43]. The Calypso room has a volume of 83 m 3 and a reverberation time RT60 of 0.17 seconds at a frequency of 1 kHz. The second test set is generated using the BRIRs measured in the meeting room Spirit at TU Berlin [26]. The Spirit room has a rectangular shape with an estimated reverberation time RT60 of 0.5 seconds. In both rooms, BRIRs were measured using a KEMAR manikin (type 45BA) and a loudspeaker (Genelec 8250A) placed in front of the manikin at two meters of distance. As the head movements of the manikins were measured from -π/2 to π/2, we simulate the binaural versions at the following discrete set of angular positions in the horizontal plane with no elevation (ϕ k , θ) := ( kπ 4 , 0) for k = −2, . . . , 2. Finally, 504 audio-visual 3D scenes with N = 1, 2, 3 are generated using the same procedure explained in Section 3.4. Figure 4 shows a visual comparison of the performance of Points2Sound for different room acoustic conditions. The results are reported for a different number of sources using rgb-depth point cloud features. We observe that Points2Sound performance de- Points2Sound loss function. We analyze Points2Sound performance depending on the learning objective. The initial proposed learning objective, referred to as L full , optimizes the parameters of the model to reduce the L1 loss between the estimated binauralŝ b L,R and the ground truth binaural s L,R b , i.e.
Note that in this case, Points2Sound predicts the full binaural signal, i.e. predicts both left and right binaural channels. Several methods in the literature propose to optimize the models by predicting the difference of the two binaural channels [11,52]. To this end, we consider another loss function for Points2Sound, i.e. L diff , which optimizes the parameters to reduce the L1 loss between the estimated binaural difference channelsŝ b diff and the ground truth binaural difference channels s diff b , i. e.
Note that when using L diff , Point2Sound is forced to learn the differences between the left and right binaural channels and predicts a one-channel signalŝ b diff . Then, considering the mono signal represented as s m = s L b + s R b , both predicted binaural channels are recovered as follows: Results in Table 1 show that Points2Sound benefits from directly predicting the binaural signal using the L full loss function as opposed to predicting the difference between binaural channels with the L diff loss function. Using rgbdepth point cloud features, Points2Sound (L full ) achieves a d ENV of 0.067 and a d STFT of 0.794 while Points2Sound (L diff ) achieves 0.076 and 1.063 respectively. The poor performance obtained with the rotated-visual baseline indicates that Points2Sound strongly relies on the 3D scene to synthesize binaural audio and incorrect predictions are expected when using wrong visual information. In the following, we refer to Points2Sound assuming it has been trained using the L full loss function.

Listening Examples
We provide a supplementary video with four listening examples where Points2Sound is applied to real-world data we record from expert musicians. We consider four challenging audio-visual scenes of N = 2 sources performing simultaneously in the same room. Specifically, two audiovisual scenes contain guitar and violin as sound sources while the other two contain doublebass and violin. The recorded audio fragments cover a variety of music styles (classical and jazz), tempi (vivace and lento), and dynamics (forte and piano). The 3D scenes of musicians are captured using Azure Kinect DK cameras while mono audio is captured using a Google Pixel 4 smartphone at a static position in the middle of the room. For each scene, the video shows the raw data first, and then demonstrates the binaural predictions of Points2Sound. Despite the discrepancy between training data and real-world scenarios, the binaural predictions of Points2Sound show promising extrapolation ability.

Discussion
The work presented in this paper indicates the potential of using 3D visual information to guide multi-modal deep learning models for the synthesis of binaural audio from mono audio. By using 3D point clouds as visual information, the vision network has the ability to extract information about the 3D positions between the receiver and the sound sources in a scene to guide the binaural synthesis. By using 3D sparse convolutions, the network learns the correspondence between 3D structures found in local regions of the 3D space and audio characteristics.
When Points2Sound is trained using true mono signals that do not contain HRTFs information, our proposed method introduces time-frequency artifacts that lead to degraded binaural predictions. This suggests that a significant amount of Points2Sound's capacity is needed to model the HRTFs information. As a result, the model has more difficulties to synthesize accurate binaural sound. Considering one of the two binaural channels as input mono, i.e. s m = s L b , Points2Sound achieves similar quantitative results as when considering the mono audio as the sum of the two binaural channels, i.e s m = s L b + s R b . Interestingly, the model trained using the sum of the two binaural channels as mono input provides encouraging extrapolation results when applied to real mono recordings, as demonstrated by the provided sound examples.
Results suggest that waveform-based approaches can provide convincing performance for the task of visuallyinformed spatial audio generation without the need to rely on hand-crafted spectrograms as input. In addition, by operating in the waveform domain, our model synthesizes the signal directly. This is in contrast to spectrogram-based models which predict a mask to overcome the difficulties of directly predicting the spectrum, due to the large dynamic range of STFTs [11,24,48,52].
Our proposed model benefits from predicting the full binaural signal as opposed to the difference between binaural channels. This might be of relevance for other applications where visually-informed models operating in the waveform domain are used to generate spatial audio.
Our proposed model benefits from using visual features extracted from rgb-depth point clouds to improve the binaural synthesis when multiple sources are present, in comparison with features extracted from depth-only point clouds. However, the fact that Points2Sound can work with onlydepth information may be beneficial in cases of low ambient light, where RGB sensors would fail to capture the scene, in contrast to LiDAR sensors that are still able to capture depth information. As mentioned above, we observe that in some cases Points2Sound predicts binaural versions where the auditory image of the sources is not stable. As this effect is mainly observed for cases with the number of sources N > 1, we suspect that this problem is related to the source separation capability of the audio network. To further investigate this phenomenon, a separate study on channel bleeding in source separation for different types of musical instruments would be required.
After analyzing the performance of Points2Sound under reverberant conditions, it is shown that the method could be applied to dry rooms that resemble the anechoic training conditions. However, a decreased performance is expected when the room acoustics conditions diverge from the anechoic training. The performance of Points2Sound in highly reverberant rooms, after retraining or fine-tuning the model using binaural room impulse responses that contain the influence of the room, remains to be studied.

Conclusion and future work
This work introduced Points2Sound, a multi-modal deep learning model capable of generating a binaural version from mono audio using a 3D point cloud scene as guidance. Points2Sound shows that 3D visual information can successfully guide the binaural synthesis while demonstrating that waveform-based approaches can provide convincing performance for the task of visually-informed spatial audio generation.
Such models see increased interest for the generation of spatial audio in immersive applications. Recent portable devices, like smartphones, have the ability to capture 3D visual data from the environment using LiDAR or rgb-depth cameras. However, such devices have limited capabilities to record spatial audio from the sound sources. Having a recorded rgb-depth environment and its corresponding mono audio, our approach is a step towards synthesizing proper acoustic stimuli for the users navigating the virtual environment depending on their location and head position.
Future work could involve adding loudness into the learning process via predicting a reference sound level for each source. This would allow to infer also sound attenuation in 3D dynamic scenes.   resemble its original model, the last Resnet18 3x3x3 sparse convolution is implemented with K = 512 channels. Then, as in Mono2Binaural, the visual feature vector is replicated to match the spatial feature dimensions of the U-Net bottleneck and concatenated along the channel dimension. During training, we select 0.63 seconds clips of audio and compute the STFT using a Hann window of 23 milliseconds and hop length of 10 milliseconds. Mono2Binaural considers mono inputs represented as s m = s L b + s R b and the learning objective is to predict the complex-valued spectrogram of the difference of the two binaural channels. Then, both predicted binaural channels are recovered as follows: We use Adam optimizer and minimize the mean squared error loss function. During testing, Mono2Binaural uses a sliding window with a hop size of 50 milliseconds to binauralize the 10 seconds audio clips. Figure 5 shows a visual comparison of the performance of Points2Sound and Mono2Binaural for different sources when rgb-depth point cloud features are used. Table 2 shows quantitative results of both learning methods for different types of 3D point cloud attributes and number of sources.
In addition, we provide a second supplementary video with listening examples where three audio-visual scenes from the test set with N = 2 sources are present. For each listening example, we first show the 3D point cloud scene, and then provide the input mono audio, the Points2Sound and Mono2Binaural predicted binaural audios, and the ground truth binaural audio. The audio-visual scenes are selected to contain sound sources that are not located in the same side of the listener's head. Also, the scenes contain a variety of sound sources which play in the same frequency range in some fragments.