An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction

The domain of spatial audio comprises methods for capturing, processing, and reproducing audio content that contains spatial information. Data-based methods are those that operate directly on the spatial information carried by audio signals. This is in contrast to model-based methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information. Signal processing has traditionally been at the core of spatial audio systems, and it continues to play a very important role. The irruption of deep learning in many closely related fields has put the focus on the potential of learning-based approaches for the development of data-based spatial audio applications. This article reviews the most important application domains of data-based spatial audio including well-established methods that employ conventional signal processing while paying special attention to the most recent achievements that make use of machine learning. Our review is organized based on the topology of the spatial audio pipeline that consist in capture, processing/manipulation, and reproduction. The literature on the three stages of the pipeline is discussed, as well as on the spatial audio representations that are used to transmit the content between them, highlighting the key references and elaborating on the underlying concepts. We reflect on the literature based on a juxtaposition of the prerequisites that made machine learning successful in domains other than spatial audio with those that are found in the domain of spatial audio as of today. Based on this, we identify routes that may facilitate future advancement.


Introduction
Interest in immersive communication technologies has been growing over the last two decades due to the emergence of today's multimedia applications. Gaming, virtual and augmented reality (VR and AR), teleconferencing, and but also the spatial attributes of the sound scene resulting from the actual locations of the sound sources and the properties of the acoustic environment [3][4][5].
In general, spatial audio methods and techniques may be broadly described as illustrated in Fig. 1. A complete spatial audio pipeline generally comprises a capture stage, a processing stage in which the spatial information in the captured sound scene is modified or in which given information is extracted from the sound scene that cannot be measured directly, and finally a reproduction stage in which the (potentially manipulated) sound scene is auralized. A more detailed overview of these stages, which are central to the organization of this overview, is provided in Section 1.1.
While a hard classification of spatial audio methods can be difficult to establish, many of them can be broadly categorized as model-based or data-based [6]. Traditionally, model-based methods 1 compose sound scenes from individual virtual sound sources that are described analytically by mathematical or physical models and driven by a set of audio input signals. Wave field synthesis (WFS) [8], stereophonic amplitude panning, and vector base amplitude panning (VBAP) [9] are all model-based methods.
In contrast, data-based spatial audio methods 2 employ sound scene representations in which the spatial information is encoded in the audio signals. The spatial information can originate from array recordings, acoustic measurements, or from simulations. Section 1.2 elaborates more on the fundamental differences existing between model-based and data-based spatial audio.
We also mention the concept of object-based audio representation here [12]. This concept is very similar to model-based representation in that a spatial scene is represented by its components. Audio objects can be more abstract than the objects in a model-based representation. Data-based reverberation, for example, can be an audio object in an otherwise model-based scene.
Acoustics and signal processing have been traditionally highly intertwined in the development of spatial audio techniques [13]. Signal processing algorithms for ambience extraction, personalization of head-related transfer functions (HRTFs), audio up-mixing, and sound field rendering have been available for several decades and are still finding application in current multimedia systems. Most traditional methods in spatial audio have been designed from a pure signal processing perspective. The irruption of deep learning (DL) [14] in the recent years is starting to create a turning point in many areas of digital signal processing, and consequently, spatial audio is also starting to feel the impact of machine learning (ML) in general, and deep neural networks (DNNs) in particular.
ML comprises a learning process that enables ML models to recognize patterns of interest in the data on which the systems are trained and to apply that knowledge to detect or generate similar patterns on new, unseen data. We highlight at this point that the term data when used in an ML context can refer to any type of data, be it audio signals, digital images, financial transactions, or others. The data in data-based spatial audio, on the other hand, are always spatial information that is encoded in the signals. Throughout this article, the term will primarily refer to multichannel audio data with spatial cues encoded as inter-channel dependencies. As will be shown by numerous examples of audio applications, such data are suitable for ML.
Undoubtedly, the popularity of DL in image processing, computer vision, and natural language processing has led to significant impact in fields closely related to spatial audio, including speech enhancement or music information retrieval [15,16]. While ML algorithms have already positioned themselves at the top of the state of the art within the aforementioned fields, their use in immersive spatial audio is only emerging, as it will be illustrated throughout this review.
This article provides an overview of data-based spatial audio methods and establishes a topology of the concepts that have been employed. The scope in which data-based methods have been utilized in spatial audio capture, processing, and reproduction is broad, and the potential of DL has been in the focus particularly in the recent couple of years. We complement our review with the relevant works on signal-processing-based (i.e., non-ML-based) methods that are sometimes alternatives to the ML-based methods and complementary at other times.
With the aim of elaborating better on the role of databased methods within the general spatial audio pipeline, the remaining sections of this introduction are devoted to introducing these two important aspects, and it wraps up with formalizing the article scope.

The spatial audio pipeline
This overview is organized based on the topography of the spatial audio pipeline. The spatial audio pipeline (Fig. 1) starts with a given representation of a sound field including spatial information. Usually, spatial sound scenes are captured using an array of microphones with a given geometry (cf. Fig. 2 for examples). The microphone output signals together with the microphones' positions and their directivity already constitute a representation of the sound scene. In some setups, the microphone signals are combined by suitable mathematical operations to obtain an abstract representation of the physical structure of the captured sound field. Examples for this are the plane wave decomposition or the spherical harmonic decomposition. The coefficients of the decomposition serve as the output format of the capture stage. The output of the capture stage may be piped directly to the reproduction stage, or, it may be processed. The processing stage uses a suitable representation of the sound scene as input to, for example, extract information on the sound scene such as the number of sound sources and their locations or the instantaneous directions of incidence of the wave fronts. The processing stage may also manipulate the sound scene, for example, by separating direct sound components from diffuse reverberation and recombining them such that this results in a change of the characteristics of the reverberation or in a change of the apparent location of a source. The ultimate goal could be to decompose a sound scene into all its independent conceptual components, i.e., the individual source signals and all components of the reverberation that each source produces. This would allow for unrestricted manipulation. This goal still lies in a considerably distant future so that the available methods rather target different subsets of the sound scene components.
The reproduction stage renders the sound scene and produces the input signals to the loudspeakers that are available. These loudspeakers can either be mounted in a pair of headphones-one speaks of head-related reproduction-or mounted in the space around the listener(s)-one speaks of room-related reproduction [17]. Figure 3 presents some examples. A plethora of methods have been proposed for room-related reproduction depending on the number of loudspeakers that are available, the size of the listening area, and the number of simultaneous listeners [18]. Head-related reproduction injects the signals directly into the listener's ear canals and uses an acoustical model of the human head to convey the spatial information [19]. This acoustical model is represented by the user's HRTFs. As HRTFs are individual to a person, HRTF individualization also by means of ML has become a topic of considerable activity and is covered by this article.
Ideally, one would like to have available a universal representation of the sound scene based on which all conceivable methods of the processing stage can operate and that can serve as the input to the reproduction stage.
As of now, such universal representation does not exist. Rather, a set of representations have become popular that are partly compatible and partly incompatible with each other so that many times; the employment of a given method in the processing stage poses certain requirements on the capture and/or the reproduction stage. Some methods were formulated for processing exclusively spatial room impulse responses (SRIRs), i.e., room impulse responses (RIRs) that retain spatial information such as an array room impulse response, whereas other methods are formulated for running signals. We will not explicitly differentiate those as the underlying concepts are identical.

Data-based and model-based methods
As already introduced, the use of the term "data" can relatively easily lead to confusion within a conceptual framework merging traditional spatial audio concepts with ML. Note that ML-based algorithms are usually said to be datadriven approaches to emphasize the fact that they are designed to perform a given task by learning from data. In contrast, the term "data" in data-based spatial audio has been traditionally used to describe approaches that process signals in which the spatial information is encoded in the audio signals, even if the algorithms are not necessarily "data-driven. " We illustrate the difference between databased and model-based methods in spatial audio with a few examples in the following. Stereophony uses differences between the two channels of a loudspeaker pair to encode spatial information [20,21]. Typically, this is done using level or timing differences and also differences in the amount of signal correlation. Traditionally, the interchannel differences were produced by capturing a given scene with two microphones that are either located coincident and exhibit a suitable directivity or that are spaced to exploit differences in the arrival times of a given wave front. This may be considered a data-based representation as the spatial information of the sound scene is captured.
It is equally possible to create such interchannel differences manually by means of analog or digital signal processing so that spatial information can be imposed onto an otherwise non-spatial single-channel signal [21]. This may then be considered a model-based method.  The underlying physical source model is not very explicit in stereophony. More recent spatial audio presentation methods like WFS or the ambisonics family can employ explicit physical source field models like spherical and plane waves [22]. Acoustic environments can, of course, also be represented using models [23]. Examples for modern purely data-based methods are the rendering of signals obtained from spherical microphone arrays (SMAs), which can be binaural [24] or using a method from the ambisonics family [22,25]. Rendering of SMA signals has also been achieved in WFS [26].
Data-based and model-based rendering can, of course, also be combined, for example, by augmenting a databased scene representation with additional model-based objects. Stereophonic recordings of large orchestras are simple examples where the spatial information from the main microphone is augmented with the signals from support microphones that are distributed across the orchestra and whose signals are panned manually to the desired spatial location. A more modern example are virtual panning spots that constitute virtual stereo loudspeakers that are embedded in a model-based scene and that contribute data-based information [27].
The named data-based methods all aim at maintaining the spatial information the way it was captured. The present article focuses on methods that go one step further in that they employ an enhancement of the original spatial information such as sharpening, up-mixing, or manipulation of the spatial information. This is performed in some cases in tangible representations of the data such as a plane wave expansion. In other cases, the data representations are abstract.

Article scope
Spatial audio covers a broad range of techniques and applications that encompasses a very large area or research. This overview intends to provide the reader with a comprehensive compilation of representative data-based methods having specific applications in spatial audio. While this is by no means an exhaustive description of all the existing approaches, the intention is to picture the current state of this field, elaborating on the impact that ML algorithms are having on different aspects of spatial audio research in the DL era.
Section 2 of the article focuses at the first stage of the pipeline, the spatial audio capture. The two major sound field representations (as opposed to array-specific representations) that are most commonly encountered in research, plane wave and spherical harmonic basis decompositions, are discussed in Sections 2.1 and 2.2, respectively. Some additional less common transformbased representations for the whole scene are discussed in Section 2.3, while Section 2.4 introduces the emergent topic of abstract acoustic and audio representations learnt directly from sound data with DL methods trained with a suitable task objective.
An overview of data-based spatial audio processing is provided in Section 3. The research topics involved herein span a very wide range of concepts, approaches, and applications, and the more mature methods that deliver spatial audio for reproduction are covered more extensively in Sections 3.2.2-3.2.5. Otherwise, advances in DL and other data-based methods that have strong potential in spatial audio applications are mentioned on the topics of acoustical analysis and parameter estimation in Section 3.1, signal decompositions or semantic descriptions of the spatial scene in Section 3.2.1, and joint audiovisual processing in Section 3.3.
In the subsequent Section 4, the final stage of the pipeline is briefly discussed. Tools for reproduction on loudspeakers or headphones of transform-based scene representations, spatial objects, or decomposed spatial components are technologically mature and are covered mainly by linear rendering techniques. Those are mentioned only briefly in Section 4.1 since they have been covered extensively in past literature. Recent DLdriven developments on personalization of spatial audio for headphone playback are reviewed more extensively in Section 4.2. Finally, in Section 5, we discuss the associations between the two major paradigms under review, signal-processing techniques, and the emerging DL methods, with regard to their potential in spatial audio, and we identify some open questions and possibilities for the future.

Spatial audio capture
Most methods for spatial audio processing, be it MLbased or not, use a scene representation as input that is either composed of "raw" microphone signals or originates from conventional linear processing applied to the signals from a microphone array. The goal of the capture stage is providing a representation that facilitates the application of a perceptual or physical model in the subsequent spatial audio processing stage.
There exist a number of general representations of SRIRs and measurement procedures for obtaining them that were proposed independent of a spatial audio context. General requirements for representations of room responses are identified in [28]. Measurement and extrapolation procedures based on sparse representations were proposed in [29][30][31][32][33][34] and based on non-sparse representations in [35]. The use of above methods in spatial audio is conceivable but has not happened on a large scale yet.
The main difference between these methods and those used in spatial audio applications is that the latter usually require information on the local propagation direction of a given sound field, which the referenced works do not comprise in a explicit manner. The remainder of the section presents capture methods and the resulting scene representations that have been used explicitly in the field of spatial audio.

Plane wave decomposition
The plane wave decomposition (PWD) is primarily a fundamental mathematical representation of a general wave field. The circumstance that the basis functions are intuitive and even constitute useful conceptual elementary components of a sound scene made them popular [36]. An inconvenience in practice is that the PWD comprises parameters that are continuous with respect to space so that only a sampled (i.e., discretized) version can be stored and transmitted. An early example for its use in spatial audio is [37], which converts a SRIR that was measured with a circular microphone array into a two-dimensional (2D) PWD for being able to auralise the space using WFS. The authors exploit the fact that it is known that the data represent a room impulse response by interpreting strong plane wave components as room reflections and rendering them in a dedicated manner per frequency band.
The microphone arrays employed in [37] are relative large in size with a radius in the order of 1 m. In [38], a similar method is proposed that employs compact microphone arrays that have a lower physical accuracy. The authors compensate for this limitation by using a higher degree of parameterization of the room response.
In [39], a method is proposed for manual manipulation of SRIRs that are parameterized in a manner similar to above described methods.
A variant of the classical PWD is the spatial decomposition method (SDM), which parameterises an SRIR into a single-channel pressure signal that encodes all temporal and spectral information as well as a direction-of-arrival (DOA) for each individual digital sample of the pressure signal. SDM was originally proposed for visualization of spatial room impulse responses [40]. An auralization of SDM-data both for binaural and for loudspeaker-based playback was proposed in [41]. Improvements of the loudspeaker-based variant were presented in [42] and in [43,44] of the binaural variant.

Spherical harmonics-based representations
Another popular sound field representation is the spherical harmonic (SH) decomposition [45]. SHs are the angular solutions to the wave equation in spherical coordinates and are used in many fields of mathematics and physical science. Acoustic fields can theoretically be perfectly represented by superposition of an infinite set of SHs. In practice, a finite set has to be used, which limits the accuracy of the representation in different respects that are often abstract and intangible. Contrary to plane waves, SHs are a discrete representation, which means that a finite set of audio channels represents continuous spatial information. This aspect has contributed significantly to their popularity.
SHs found their way into the field of spatial audio through the visionary works of Michael Gerzon [46] and are often referred to as the ambisonics representation of a sound field. Later works particularly on spherical microphone arrays, much of which was performed by researchers from outside of the ambisonics community, highlighted the convenient properties of SH representations without necessarily referring to the concept of ambisonics [47,48]. Even nowadays, the terminology is inconsistent in that many researchers do not necessarily employ the term ambisonics when dealing with SH representations of sound fields in a spatial audio context.
Particularly, the methods that will be discussed in Section 3.2.2 often employ an SH representation. In fact, while both PWD and SH representations may also be understood as acoustic analysis methods by themselves, this overview treats them as initial features that enable other spatial audio processing tasks.

Other transformation-based representations
Other representations based on linear spatial filtering techniques have also been proposed for a variety of spatial audio applications using space-time processing. Examples of particular interest are those that do not rely strongly on far field assumptions, but which approximate fields produced by nearby sources with far-field components. In this context, the ray space transform (RST) was proposed in [49] as a framework to formalize the plenacoustic analysis of [50,51], through the adoption of Gabor frames. For its computation, the RST considers the output of a uniform linear array of microphones and applies a spatial sliding window to perform a "local" PWD of the recorded sound field. As a result, the RST is able to map the directional components of the sound field onto a "ray space" that provides some benefits in terms of invertibility and parameterization. For example, point-like sources are mapped onto linear patterns in the RST domain and spatial audio processing tasks such as source localization [52] or separation [53] can be directly performed over such representation. The projective form of the RST allows as well to process the signals captured by a set of compact microphone arrays, allowing applications such as sound field reconstruction [54].

Feature-based representations
The audio representations described above are all derived from mathematical manipulations that are both dataindependent and motivated by already well-understood physical processes. Such representations have the advantage of being "general" and applicable to a wide range of problems, providing as well some valuable intuition on the underlying acoustic phenomena. However, one of the most celebrated advantages of DL-based approaches is their capability to learn hierarchical representations of the input data automatically during training. Feature learning or representation learning is understood as a set of techniques that allow DL algorithms to discover automatically good representations from the input data that are able to encode efficiently the information needed for performing a given task. The feature learning process may be supervised (when labeled data is used) or unsupervised (when no labeled data is needed). In DL-based approaches, convolutional neural networks (CNNs) are typically applied to extract such abstract representations, which in the case of spatial audio should jointly capture spatial and spectro-temporal information about the sound scene. Since the majority of spatial audio processing techniques operate in the time-frequency domain, the most straightforward approach is to feed the network with the magnitude and phase of the available audio signals at the desired time-frequency resolution, and let CNNs extract the relevant information needed for accomplishing the task, as identified by their internal activations. Classical representations such as SHs can also conveniently be used as input features.
Another approach is to provide "hand-crafted" features that already represent meaningful information for the intended task. For instance, spatial information can be conveniently represented by sines and cosines of interchannel phase differences [55,56] or generalized crosscorrelations between the audio signals [57][58][59], which helps to avoid phase wrapping problems and thereby eases the network training. Sound intensity computed from the SH representation, which has been successfully applied in DL-based sound localization in 3D [60], is another example of such a "hand-crafted" feature obtained in a pre-processing step. Another recent example is that of [61], where a rotation-invariant DNN architecture that performs sound event localization and detection was proposed for SH signals.
Spatial audio methods and systems rely on well-known perceptual mechanisms used by the auditory system [19]. Spatial hearing cues result from the acoustical interaction of an impinging sound and a listener's anthropometric features, which leads to filtering effects caused by the head, shoulders, torso, and pinnae. Interaural differences also have a strong influence. The above cues are typically encoded into HRTFs in the frequency domain or, equivalently, into head-related impulse responses (HRIRs) in the time domain. Datasets of HRIRs measured over a grid of spatial locations are important not only for realistic reproduction purposes but also for extracting general or universal patterns useful for understanding the relative influence of certain spectral features in the perceived sound. Traditionally, studies aimed at analyzing spatial audio perception have relied on listening experiments, which usually require a carefully designed and time-consuming experimental setup.
The data extracted from HRIR measurements has been computationally analyzed in the past by ML algorithms to gain insight into the auditory localization process. One of the earlier attempts was [62,63], where a biologically inspired model of the source localization process was conceived by combining a cochlear model and a time-delay neural network. Similar ideas have been more recently exploited with the advent of DNNs. In [64], a spiking neural network is used to extract features from binaural recordings and training a three-hidden-layer feedforward network on such features to perform both single-source and multi-source localization over a range of noise conditions.
The learning capabilities of CNNs were recently exploited in [65] to identify primary elevation cues encoded in HRTFs shared across a population of subjects. A CNN was trained on multiple HRTF datasets to estimate the elevation angle of a virtual sound source, and salient audio features were extracted by using layer-wise relevance propagation. The results indicated that the discovered features were in line with those obtained from the psychoacoustic literature.
The spatial information comprised by binaural signals has also been exploited by DNN-based approaches to understand the spatial arrangement of musical acoustic scenes. In [66], a CNN was trained to classify binaural music recordings into foregroundbackground, background-foreground, and foregroundforeground scenes, indicating the relative position of the listener with respect to ensembles of musical sources (foreground) and room reflections (background). The authors compared the performance of automatic algorithms to that of human listeners in this task [67], with results suggesting that ML algorithms can significantly outperform human listeners under matching binaural room impulse reponse (BRIR) conditions (test scenes rendered by using the same set of BRIRs as training scenes) and exhibiting similar performance in the mismatched case. Despite the task addressed is not particularly aimed at feature discovery, meaningful internal representations might be obtained as a byproduct.
Techniques such as the ones described above and others that also exploit visual information (cf. Section 3. 3) may open the door to learning-based methods capable of leading to alternative signal representations for spatial audio.

Spatial audio processing
The processing stage typically either extracts desired information from the output of the capture stage or manipulates the spatial information. Indeed, most meth- ods for spatial audio processing are data-based in nature, as their ultimate goal is usually aimed at analyzing or modifying the spatial information present in their input signals or, alternatively, to the exploitation of such spatial information to extract meaningful constituent signals of the sound scene. Signal enhancement by modifying the statistical relation between signal channels may also be considered part of the processing [68]. Some of the methods presented in this section can be used sequentially in a processing pipeline. Many reference and commercial methods for spatial audio processing today are still based on classical signal processing, as in the case of the family of parametric spatial audio techniques (Section 3.2.2) or the methods employed within spatial audio coding standardization frameworks (Section 3.2.3). This is also the case for the vast majority of viewpoint translation methods (Section 3.2.5). On the other hand, many DLdriven approaches have recently appeared in the context of acoustic analysis (Section 3.1.1), sound scene decomposition (Section 3.2.1) and audio-oriented audiovisual processing (Section 3.3), which are rapidly displacing traditional methods. For example, DL approaches are now a reference in fields like source separation and enhancement. In a middle-point, while classical methods are a reference for audio up-mixing due to their inherent relation to parametric spatial audio approaches (Section 3.2.4), there is a clear trend in the use of DL for such task. As a result, we will cover all the above spatial audio processing systems by emphasizing such diversity and coexistence of traditional and ML-based techniques. Spatial audio processing techniques can be of a very different nature and oriented towards significantly different objectives. For the sake of clarity in the presentation, we broadly divide processing techniques into three blocks (cf. Fig. 1). The first one covers techniques aimed at analyzing and describing acoustically the sound field (Section 3.1). The second block describes techniques for sound scene processing with a special emphasis on methods oriented towards the analysis and modification of the spatial information in the recorded scene for subsequent re-synthesis (Section 3.2). Finally, we discuss recent approaches making use of audiovisual data to address several tasks related to spatial audio in the third block (Section 3.3).

Acoustic analysis
This section covers data-based spatial audio methods for acoustic analysis. We discuss separately methods aimed at estimating acoustic parameters for sound field rendering and those for acoustic imaging and sound field reconstruction. While a large body of literature exists on these topics, we limit our discussion to the most recent approaches that make use of DL. Note, however, that in order to provide a comprehensive overview of such recent approaches, we cover as well works where visual data is considered as input, even though the final objective is on the acoustics side.

DL-driven acoustical parameter estimation for spatial audio rendering
The capability of DL to model complex relationships between different representations and their effects in a certain domain has found recent use in acoustical modeling problems. An interesting such application is on acoustical parameter estimation for virtual acoustics and spatial auralization. More specifically, a DNN is trained in [69] to map a rectangular plate geometry that occludes a source from a receiver to filter parameters modeling the perceived effect of the diffracted sound at the listener. Going even further, fast computation of the 2D scattered field around an acoustically hard 2D object is approached in [70] as an image-to-image learning task for a CNN, trained on images generated with wave-based acoustical simulations. The principle is inverted in a further work to estimate the 2D shape of the object from its scattered field in [71]. Finally, a similar training strategy is followed in [72] while instead of 2D objects and field images, they map 3D geometries to far-field spherical harmonic coefficients of the scattered field. Note that training NNs to model acoustical scattering [73] or infer geometry from scattering measurements [74] has been attempted much earlier than those works. However, their considerations are different. Acoustical parameter estimation using DL has also been used to extract room acoustic parameters from geometry or image data for fast spatial rendering in audio VR. In [75], energy decay relief curves are estimated directly from images of acoustical spaces using CNNs. Training pairs of features of room geometrical configurations and spatial impulse responses captured for those configurations are used in [76]. Alternatively, simplified room geometries are reconstructed from 360 camera images in [77,78], which are used to drive virtual acoustic simulators for AR/VR applications. In the same spirit, [79] uses a DNN to classify materials from textures in a 3D room geometry, to deduce and optimize absorption coefficients to be used in conjunction with the geometry for interactive geometrical acoustics simulations. Finally, [80] combines measurements with geometric generation of reverberation by estimating a simplified geometry from a moving 360 camera recording with structure-from-motion and used to synthesize early reflections at any position. Additionally, a single monophonic RIR is captured in the room and used as a guide for obtaining absorption filters for the inferred geometry, low-frequency modal filters, and also the late reverberation tail to append to the generated early part in a position-independent manner. The method is used to synthesize ambisonic SRIRs. The work is extended in [81] by replacing the RIR measurement signal with a general audio signal such as speech and using commodity hardware (i.e. a mobile phone). Since the RIR parameters are not readily available in this case, they are estimated from the source recording using DNNs.

DL-driven sound field reconstruction
Recently, DL techniques have also been exploited for sound field reconstruction from a small number of irregularly distributed microphones in a room. The work in [82] proposed the use of a U-net neural network [83] with partial convolutions trained on simulated data for sound field reconstruction in rectangular rooms. The proposed data-driven method allows to reconstruct the magnitude of the sound pressure on a plane, performing jointly inpainting and superresolution from irregular discrete measurements in the frequency range 30-300 Hz. The same method was recently extended to reconstruct both magnitude and phase with testing over real-world sound fields in [84] using a publicly available dataset, showing as well the potential application of DL-driven sound field reconstruction in sound zone control. Other interesting and novel architectures are appearing in the context of sound field analysis for acoustic imaging. In [85], a recurrent neural network with a fully customized architecture is proposed, taking into account relevant aspects of acoustic imaging problems and inputs from a spherical microphone array. In this context, other DL-driven approaches for high-accuracy acoustic camera solutions have recently appeared, which also make use of additional modalities such as stereo vision technology [86].

Sound scene processing
We refer to sound scene processing techniques as those that analyze and manipulate audio signals with the aim of extracting and modifying the spatial information in the captured scene by decomposing the scene into perceptually meaningful elements. These elements may be either in the form of spatially relevant components (e.g., directional vs. diffuse sound) or related to the sources making up the scene. Despite its proximity to the term "auditory scene analysis" [87,88], coined by psychologist Albert Bregman and usually linked to the field of source separation, we use the term sound scene processing in a more general way that encompasses not only the extraction and decomposition of sound into source objects, but also the analysis and manipulation of spatial features.

Sound scene decomposition
With the term scene decomposition, we refer to the wide variety of methods that aim to break down the sound scene into its constituent components. A prominent such example is decomposing the scene into constituent signals based on their spatial properties, such as foregroundbackground, primary-ambience, or directional and non-directional separation. More elaborate spatial decompositions can detect and separate distinct localized sources from different directions. Such decompositions rely heavily on source detection, source localization, and spatial filtering techniques. These research topics constitute core problems in microphone array processing with an accumulated intensive research history of decades and applications spanning a much wider range than the scope of this article. Of course, such techniques are employed by, e.g., the parametric spatial audio processing methods reviewed in the next section, but they are not analyzed separately. For a comprehensive overview of them the reader is referred to [89,90]. Recently, there is also intensive research on DL variants of source localization, e.g., [91][92][93], and spatial filtering [94]. Another family of methods aiming to decompose the spatial scene into its constituent signals is termed multichannel source separation. Many examples are closely related to adaptive and informed spatial filtering, as reviewed in [90]. However, while localization and spatial filtering are used extensively in spatial audio methods, the source separation research has generally focused on maximum separation of source signals, and not on rerendering or modifying spatially the scene. However, a stronger separation component has obvious applications in spatial audio, such as spatial remixing of the scene and other source-dependent modifications. Works that follow a source separation perspective for spatial audio can fall into two categories. The first aims to recover a demixing matrix or separation masks using mainly spatial features and a mixing model in the time-frequency domain, such as the works in [95][96][97]. The second category attempts an even higher-level decomposition of the scene, integrating apart from spatial mixture models, also spectrotemporal models that distinguish one source from another. Separation in this case can be performed blindly or in a supervised manner, using some prior information on the spectral templates of the sources in the scene. Only a few works have attempted applications of those models to spatial audio rendering [98,99]. In general, multichannel source separation is transitioning very quickly to DLdriven solutions, which, however, are currently focused on multi-speaker separation and enhancement rather than scene audio modification, or resynthesis [55,100,101].
Apart from signal decompositions, higher level audio scene analysis with semantic information is an extremely active field of research that is completely dominated by data-based DL approaches [102]. Examples include acoustic scene classification (ASC) [103,104] and simultaneous temporal detection and sound-type classification of multiple concurrent sound events in the scene (commonly known as sound event detection (SED) [105,106] [103,105,107]. Interestingly, this research community had not involved spatial information until recently, with a few exceptions such as [104] in ASC or [106] in SED. However, currently there is increased interest on advanced spatiotemporal semantic descriptions of the sound scene, with the common task of jointly performing sound event localization and detection (SELD) using multichannel signals [91,107]. Semantic descriptions and decompositions of sound scenes are of course of interest also in spatial audio analysis and synthesis, and stronger crosspollination between those research directions and spatial audio is expected to take place in the coming years.

Sound field parameterization for analysis, modification, and resynthesis
In this section, we present an overview of parametric spatial audio techniques that analyze the captured sound scene to obtain a compact yet perceptually relevant parametric representation and subsequently re-synthesize it for spatial reproduction. In these techniques, the estimated signals together with supplementary parametric information serve as a basis for modifications (processing) as desired by the target application. A description of many state-of-the-art approaches to parametric spatial audio processing can be found in [108]. The stepping stone in development of these methods was the proposal of spatial impulse response rendering (SIRR) [109,110], which processes SRIRs in ambisonics B-format (i.e., a 1st-order SH representation) to decompose them into one direct-sound component and a diffuse residual for each time-frequency bin. The underlying assumption is that source signals tend to be sparse in the time-frequency domain (for example, the energy of a periodic signal concentrates at the harmonic oscillations) so that a single wave front sufficiently represents the directsound component at a given frequency bin [111]. This concept was extended in [112] to running signals and was termed directional audio coding (DirAC). In DirAC, a zeroth-order (omnidirectional) signal is supplemented by two parameters, namely the diffuseness and the DOA of the direct signal. The former is estimated from the temporal variation of the intensity vector [112,113], and it is used to extract the direct and diffuse signal components from a mono signal using a single-channel filter [112]. In later work [114], the parametric approach was extended to arrays of any type whereby the extraction of the direct and diffuse signals is typically performed using signaldependent spatial filters, many of which are well-known from speech enhancement [89,90,115]. The extracted signals are supplemented by parametric information on the DOAs or the positions from which the direct sound components originate.
Examples of parametric modifications include rotations of the entire recorded sound scene or manipulations of the locations of individual directional sounds [25,116,117]. For reproduction of musical recordings, the diffuse signal is usually subject to decorrelation before it is fed to the loudspeakers in order to increase the feeling of spaciousness and plausible listener envelopment [112,118]. Another example is increasing quality by a reduction of coloration while providing stable localization cues when using recordings of spaced microphone arrays [119]. Furthermore, when capturing the acoustic scene using distributed microphone arrays, the signals to be reproduced can be synthesized for an arbitrary listening position in space. This can be achieved by synthesizing the signals of virtual microphones of arbitrary spatial patterns at locations that are not populated with real microphones [120,121]. Similarly, binaural signals for different virtual listening positions can be synthesized [114], which we discuss in some detail in Section 3.2.5. In teleconference applications, preservation of spatial cues combined with flexible spatial selectivity offered by parametric approaches can help the auditory system to naturally focus on a desired speaker [114,122], which may lead to better speech intelligibility. By adjusting the output parametric gains for the direct and diffuse signal components, it is also possible to align the visual and acoustical images in digital camera recordings, including the effects of an acoustical zoom that is consistent with the visual cues and a blurred spatial audio image for sources located off the focal plane [123,124]. Another approach to generate binaural or multichannel audio which follows the moving picture of a visual scene is to perform adaptive equalization of the direct signals [125].
Considerable research has also been carried out to extend and improve the parametric representations. Early attempts include the higher angular resolution plane wave decomposition (HARPEX) [126], which decomposes Bformat signals into two plane waves per time-frequency bin, and a method in [127] in which the higher-order signals are decomposed into several plane waves, while the diffuse residual is in both cases ignored. Higher SH or microphone orders (HO) enable higher spatial resolution, which allows differentiation of more than one simultaneously impinging wave front. SIRR was extended to HO-SIRR in [128,129]. DirAC was extended to HO-DirAC in [130] whereby the standard parameterization is performed separately for a set of angular sectors to support several directional and diffuse sounds arriving from spatially separated directions simultaneously. More recently, coding and multidirectional parameterization of ambisonic sound scenes (COMPASS) [131] extends the parametric model to several overlapping directional sounds and a diffuse residual per time-frequency bin, and it also provides a convenient method to combine the parametric processing with standard ambisonic reproduction.

Spatial audio coding
The delivery of spatial audio content to the masses and the transfer of academic research to the consumer and media industries involves making such delivery flexible and efficient. The adoption of spatial audio formats and processing schemes within standardization activities should not be ignored in this overview, as they can have a very significant impact on how spatial audio will be consumed at scale and the momentum that spatial audio technology may experience in the coming years. The development of audio applications for consumer electronics and multimedia streaming services brought about a demand for representing spatial audio content in formats that support efficient transmission at limited bandwidth and require scarce storage capacity. Over the last 20 years, a number of spatial audio coding techniques have emerged for 2D and 3D reproduction that to a large degree maintain the fidelity of the rendered spatial scene. Although ML has not yet been incorporated into popular spatial audio coding formats, we can already observe an interest in applying ML as part of the processing chain, especially for data compression at low bitrates. Early coding standards include Moving Picture Expert Group (MPEG) parametric stereo [132,133] and Dolby Prologic, in which spatial information is encoded by manipulating the phase differences between the stereo channels. A notable successor has been MPEG Surround [134], which exploits three major spatial cues attributed to the perception of the 2D spatial sound image, namely inter-aural level differences, inter-aural time differences, and inter-aural coherence [135][136][137]. These perceptual spatial cues have formerly been applied in binaural cue coding [136,137]. The encoder of MPEG Surround extracts spatial information such as channel level differences and inter-channel correlations from pairs of input audio channels using an complex-exponential modulated quadrature mirror filter (QMF) filterbank. Together with additionally estimated channel prediction coefficients and prediction residual signals, this side information is transmitted along with the down-mixed mono or stereo signals to a decoder that uses it for up-mixing to multichannel audio. A stepping stone in developing high-fidelity spatial audio codecs has been the MPEG-H 3D Audio [138,139] standard, which supports spatial coding of multichannel audio signals, sound objects, and HO ambisonic signals. Similarly to the parametric methods discussed in Section 3.2.2, the scene can be decomposed into sound objects that are either static or their positions and gains may vary over time, as in MPEG-D spatial audio object coding (SAOC) [140]. The remaining ambience sound field components can be conveniently coded in an HO ambisonic representation. Until recently, attempts to incorporate ML into spatial audio coding have concentrated predominantly on the inference and compression of the associated parametric side information. In [141], ML is employed in visual and audio-visual tracking of sound objects in an end-to-end audio-video approach. More recently, DL has been applied to spatial audio object coding [142], in which a mixture network of deep convolutional architectures enable to effectively compress spatial parameters of audio objects at low bitrates.
Note that apart from coding of spatial information, a significant part of the discussed codecs concerns audio signal compression. For instance, compression in MPEG-H 3D audio is performed based on MPEG-D unified speech and audio coding (USAC) [143], while MPEG-4 high efficiency advanced audio coding (HE-AAC) [144] is typically used in MPEG Surround. The virtue of using ML in audio compression has been demonstrated in [145] for extending the frequency bandwidth and in [146] for reducing lossy coding artifacts. However, a full end-to-end neural audio codec has only very recently been proposed in [147]. The DL-based SoundStream model, which is composed of a fully convolutional encoder-decoder structure and a residual vector quantizer, has been designed to provide good quality at extremely low bitrates. Extensions of ML-based audio compression to ML-based spatial audio coding are well expected, yet they are still to come.

Up-mixing
Up-mixing techniques are those aimed at generating a higher number of audio signals from a smaller set of audio channels while preserving important aspects such as the locations of the main sound sources or the ambience components contained in an original sound recording. This comes usually with an apparent increase of the spatial resolution. As a result of the up-mixing process, recordings coming from a down-mixing process can be automatically extended to multi-channel arrangements with the objective of conveying a more natural and enveloping experience.
The sound field parameterization methods presented in Section 3.2.2 exhibit such functionality inherently, for instance, by reintroducing the extracted direct sound into the scene representation with a higher SH order such as in [42,43]. Based on the compact parametric representation, these methods can synthesize the loudspeaker channels of arbitrarily high orders by means of panning the directional signals and decorrelating the diffuse residual signals to be played back over all loudspeakers. However, the most popular application of up-mixing is in spatial audio coding covered in Section 3.2.3, where at the decoding stage, multichannel loudspeaker signals are synthesized from down-mixed representations. Several data-based methods have been proposed for up-mixing from mono or twochannel stereo to multichannel [148]. For up-mixing to a 5.1 format in MPEG Surround, the powers and crosscorrelations of stereo signals are analyzed in perceptually motivated sub-bands to extract coherent signals to be played back using a pair of front loudspeakers that enclose the estimated direction (i.e., left and center or right and center) as well as to extract the lateral ambience signals to be emitted from side or even all loudspeakers for improved listener envelopment. In [149], ambience components are identified and extracted based on interchannel coherence first. Then, they undergo decorrelation using all-pass filters in order to avoid undesired phantom images to the sides of the listener. The panning gains of individual directional signals are found based on the so-called inter-channel similarity measure, and a nonlinear mapping is applied to re-pan the sources from 2 to 3 front channels for improved stability of the spatial image for off-center sources. By introducing more than three front loudspeakers, the width of the sound scene can be further increased beyond the standard ±30 • , and the listening sweet spot region can be increased. In [68], an adaptive mixing solution was proposed to reach the target covariance matrix by exploitation of the independent components in the input channels, while minimizing the usage of decorrelated ambient signals when the target covariance cannot be reached without the application of decorrelation.
Recently, also DNNs have found application in the development of audio up-mixing and surround decoding systems. One of the earlier attempts for ambient extraction from mono signals using neural networks was proposed in [150], where a shallow architecture with one hidden layer was used to estimate spectral weights relating the ratio between the ambience and direct signal components in the time-frequency domain. The input to the network were well-known hand-crafted audio features such as spectral centroid, spectral flatness, or spectral flux. More recently, a DNN-based method to process stereo tracks was proposed in [151] that is aimed at classifying and separating the primary (direct) and ambient (diffuse) components in each time-frequency bin of the input mixture. In this case, a feedforward network with three hidden layers and a sigmoid-activated output layer was used for classification, building a time-frequency mask for the subsequent separation. Another work exploiting DNNs for stereo to 5.1 up-mixing was proposed in [152] considering the MPEG-H 3D framework. In this approach, DNN models for the center and surround channels are trained by using log-spectral magnitudes of QMF subbands. The input stereo signals are converted into rear and center channels using the trained models, where the generated subband signals are transformed back to audio signals using QMF synthesis. Following a similar subband approach, the authors proposed in [153] a method for converting mono signals to stereo training multiple DNNs for each sub-band with the objective of modeling the band-wise nonlinearity between the mid and side signals.
The system proposed in [154] uses two networks for two-to-five channel up-mixing, where one of the networks is used for primary and ambient signal separation and the other for ambience rendering. The networks are jointly trained by minimizing the mean-squared error between the magnitude spectra of the original and the decoded five-channel signals as well as the interchannel level differences of the target signals. The obtained spectral weights are multiplied for each frequency bin of the input stereo signal, allowing for the separation of primary and ambience signals and the generation of diffuse sound, respectively.

Viewpoint translation
The advent of VR and AR goggles has boosted the interest in both academia and industry in binaural rendering technologies. The two most dominating fields of activity in this regard are HRTF personalization (cf. Section 4.2) and 6-degree-of-freedom (6-DoF) binaural rendering. This section focuses on the latter. The 6 DoF in this case are 3 angles of head orientation as well as head translations in the 3 Cartesian dimensions. 6-DoF binaural rendering obviously requires tracking of the user's orientation and position in realtime on the rendering side. The requirements for the performance of the head tracking that VR and AR goggles perform for the visual rendering, particularly signal-to-noise ratio and low latency, are much stricter than what is required for the audio rendering so that tracking is readily available.
Most methods employ a SH representation of the sound field that is to be rendered. Rotation of this representation relative to the HRTFs that are used for the rendering is straightforward. 6-DoF rendering is achieved in [155] via the application of blind source separation to the captured sound field. A method based on DirAC is presented in [156,157]. Translatory head movements based on a plane wave expansion of the sound field to be rendered was presented in [158,159], which demonstrated fundamental limitations of this framework. As a consequence, 6-DoF binaural rendering methods typically use a more application-oriented sound field representation and come in four flavors: (1) methods that perform a mathematical translation of the orthogonal sound field decomposition or a re-expansion around a different center [160][161][162][163], (2) parameterization and adaptation of a single-viewpoint recording (or RIR measurement) with a microphone array [116,117,[164][165][166], (3) interpolation between microphone array recordings performed at different locations [167][168][169][170], and (4) interpolation between parameterizations of recordings performed at different locations [171][172][173][174][175].

Audio-visual processing
Some of the methods described in Section 3.1 make use of visual data to gather supplementary environmental or geometric information, which assists the estimation of acoustic-related parameters. Combined audio and video analysis has been performed extensively by the computer vision community, providing stronger cues for, e.g., activity detection or speaker recognition than processing video or audio separately. Some of the studied tasks overlap with audio-only tasks such as on-video speaker and sound source localization [176] and video-guided monophonic source separation [177]. In any case, there is no doubt that the use of video recordings for addressing spatial audio tasks has been receiving increasing attention in the last years.
In this context, the availability of very large audiovisual datasets have contributed significantly to promoting the use of audiovisual information to exploit both auditory and visual spatial cues jointly. In [178], a dataset of 360 • videos from YouTube containing firstorder ambisonics audio was collected to train a selfsupervised audiovisual model aimed at aligning spatially video and audio clips extracted from different viewing angles. The approach was shown to yield better representations, outperforming prior work on audio-visual selfsupervision for downstream tasks like audio-visual correspondence, action recognition, and semantic video segmentation. Similarly, the work in [179] proposed another self-supervised approach to understand audio-visual spatial correlation by training a DNN over a large dataset of ASMR (autonomous sensory meridian response) videos to classify whether a video's left-right audio channels had been flipped. The learnt audio-visual representations were proven to be useful for carrying out some downstream tasks including source localization, mono-to-binaural upmixing, and sound source separation. A similar application is addressed in [180], where a U-net network is proposed to infer binaural audio from videos and their respective monophonic audio recording using a database of binaural music recordings as training targets. Similar to mid-side stereo, a mid signal corresponds to the mono mix-down of the binaural audio, and the network learns to predict the side signal only. A similar approach was followed in [181]. In [178], a network is taught to upscale a monophonic signal to first-order ambisonics by self-supervised learning from 360 • videos. The network is taught to produce time frequency masks to separate the mono input into directional components along with a set of directional weights encoding those components into first-order ambisonics signals. The work in [182] shares some similarities but assumes static source positions and does not perform source separation. Finally, an attempt of a completely synthetic approach to generation of an audible scene from a 360 • image is presented in [183] by mixing background ambience based on scene classification and object, people, or action sounds, based on visual object recognition and spatialized at their respective detected image locations. Of course, there is no temporal information on the arrangement of events in this scenario, bringing the work closer to sonification of the immersive image.

Spatial audio reproduction
The final stage in the spatial audio pipeline is the reproduction of the multichannel signals that result from the preceding capture and processing stages. In general, the theory underlying spatial audio reproduction is well established and no disruptive methods or discoverings have been observed in the last years, especially in the context of loudspeaker-based reproduction. It is true that the power of DL has attracted great interest in the context of binaural reproduction where some interesting DNN-based approaches have recently emerged for selecting or synthesizing binaural signals adapted to a given listener. Therefore, while this section briefly outlines conventional linear spatial audio reproduction methods for the sake of completeness, only the aforementioned DL-driven attempts are reviewed.

Linear spatial audio reproduction
Any audio reproduction method converts a scene representation into loudspeaker input signals that produce a sound field with a given desired physical structure or binaural signals with given desired properties. Traditional audio reproduction concepts like stereophony feed the signals from the microphones of the capture stage directly into the loudspeakers of the reproduction stage. More advanced concepts like modern ambisonics formulations perform linear filtering operations to compute the loudspeakers signal from the entire set of microphone signals whereby one of the scene representations discussed in Section 2 can be an intermediate format. Many such methods are linear.
We refer the reader to the literature such as [184,185] on binaural rendering, [122, Ch. 14] on loudspeaker panning, [25] on ambisonics, and [186] on wave field synthesis for more detailed discussions. Further overviews are provided in [2,17,18,187]. We will focus on certain aspects of binaural reproduction the subsequent Section 4.2. Binaural rendering in its most common form is a linear method that is straightforward and comprises filtering the given scene representation with a suitable representation of the user's HRTFs. What is interest in the scope of the present article is the computation of HRTFs for this purpose that are personalized to the user by means of data-based processing.

DL-driven HRTF personalization and generalization
HRTFs are highly dependent on the individual anthropometric features of a given listener. A major challenge for personalized binaural audio reproduction is the measurement procedure of HRTFs, which is tedious and expensive. Research efforts have therefore addressed the problem of HRTF customization with the objective of estimating individual HRTFs based only on geometric information or user feedback without the need for any measurement process. Traditionally, principal component analysis (PCA) has been the technique of choice for dimensionality reduction in HRTF datasets, leading to many interesting observations and experiments that evaluate the impact of different eigenmodes on spatial audio perception [188]. Further improvements in PCA-based HRTF modeling and customization were presented in [189,190]. HRTFs were synthesized in [191] using a sparse combination of a subject's anthropometric features. The use of DNNs in the subject matter is rapidly gaining momentum. An HRTF selection method based on a multi-layer perceptron neural network was proposed in [192]. The system was trained by using as input a set of measured anthropometric parameters (shapes and sizes of listeners' heads and pinnas) extracted from photographs. To train the network, 30 subjects listened to music rendered by using different HRTFs to assess their fitness and obtain a score used as target output. Such an approach was shown to be more effective in selecting the best matching HRTF for a given listener than selecting the one with the smallest sum of squared errors between the listeners' measurements and each of the database members. In a similar spirit, but with the aim of synthesizing a personalized HRTF, [193] proposed a method consisting of three sub-networks: a feedforward network taking as input anthropometric measurements, a CNN using ear images, and another feedforward network that estimates a personalized HRTF by using the outputs of the other two subnetworks.
Autoencoder (AE) architectures have been selected by several works to capture relevant patterns across HRTF datasets. An AE is an unsupervised DNN that learns how to efficiently compress and encode data by learning how to reconstruct the data back from a reduced encoded representation, usually referred to as embedding. In [194], a sparse AE is used to create embeddings from the captured HRTFs, which are used to train a generalized regression neural network (GRNN) to approximate equivalent latent representations of the corresponding HRTFs at arbitrary angles. The sparse AE is then able to decode the GRNN output to reconstruct the desired HRTF. Such an approach provides an efficient way to jointly address the creation of generalized HRTF models and angle interpolation from large datasets. Similarly, a set of independent AEs was used in [195] for each elevation angle to reconstruct HRTFs on the horizontal plane and used the resulting latent representations in the bottleneck as targets for a feedforward network using anthropometric features as input. A personalized HRTF can then be synthesized by estimating the latent representation given the features of a new subject and feeding the result into the decoder part.
In [196], a training and calibration procedure based on a variational AE structure was proposed. Variational AEs are deep generative models that provide a probabilistic manner for describing an observation in latent space, modeling the probability distribution of the input data. In the training step, an HRTF generator is created by learning the individual and nonindividual features from an HRTF dataset. The generator is based on an extended variational AE that separates with a set of adaptive layers the individuality and non-individuality factors of the users in a nonlinear space. The learned latent variables together with some personalization weights optimized by user feedback are then used in the calibration step to generate a personalized HRTF for a specified direction.
A study of several aspects related to HRTF individualization that provides further insight into the research lines discussed above is presented in [197]. As expected, models seem to generalize better by having access to larger datasets. This may be achieved by a proper and careful merging of existing datasets [198] or by synthetically creating new ones [199].

Discussion
Many of the paradigms behind spatial audio are deeply rooted in physics, whereby the ultimate goal of spatial audio is evoking a given sensation in the listener. This sensation may be evoked through pure physical accuracy, i.e., if one intends to reproduce a performance in a concert hall, then the perfect re-construction of the sound field in the concert hall at the listening location is guaranteed to provide the best possible perceptual result. But it has been clear since a long time that such physical accuracy would require immense resources, if it is achievable at all [22,200]. It was shown in different situations that authenticity, i.e., a reproduced scene being indistinguishable from the captured original scene, can be achieved even if the physical accuracy is relatively low [201][202][203]. The interpretation of the ear signals by the human auditory system seems to lead to the same perceptual result for different input signals in certain situations. This psychoacoustic route is what virtually all methods in the processing stage have been taking. Some concepts that are applied in the capture and reproduction stages also rely explicitly or implicitly on given psychoacoustic properties of the human hearing system [187].
The capture and reproduction stages base heavily on the underlying physics of the problem, such as the relation between the signals impinging on a set of microphones or the spatial structure of a sound field created by an array of loudspeakers. These physical mechanisms are well understood, and there exist powerful mathematical tools that describe those in a compact and accurate way. The non-ML data-based approaches in the processing stage use perceptually motivated representations of acoustic scenes to accurately reproduce the relevant spatial cues while preserving the highest audio signal fidelity. Many of these data-based techniques, described, e.g., in Section 3.2, including parametric processing, up-mixing, spatial audio coding, and viewpoint translations, successfully achieve the designated goals. As a result, these methods steadily play a dominant role in present audio and multimedia applications.
The recent irruption of DL has opened new opportunities for spatial audio, offering enticing alternatives to the classical signal processing. Major breakthroughs in DL have taken place in the context of image processing, and these advances have been quickly adopted in closely related speech and audio tasks. For instance, image segmentation relies on partitioning a digital image into meaningful segments such as objects or contours by assigning a corresponding label to every pixel in an image. Acoustic source separation can be considered an analogous task where source labels are assigned to each time frequency bin in a 2D time frequency representation of an audio channel. Similarly, image classification in vision can be considered an analogous task to acoustic scene classification in audio since both tasks assign a class label to the input signal representation, which is clearly evident when a 2D time frequency audio spectrogram is treated as an input image.
Due to the apparent similarities, DL approaches have quickly become very successful in those tasks in audio, often outperforming classical signal processing methods. ML has also swiftly found its way into classification of audio content, e.g., in background-foreground or sound event classification discussed in Section 3.2.1, as well as in other ML-predisposed tasks such as HRTF personalization described in Section 4.2. In DL-driven personalized binaural audio reproduction, a listener can benefit from using individualized HRTFs that are synthesized based on the user's anthropometric features.
We can also observe a rise in popularity of ML in modeling psychoacoustic phenomena. The nonlinear nature of physiology and psychology related to human hearing mechanisms makes ML highly suitable for such tasks. Notable progress has been made, for example, in designing learnable low-level audio features as alternatives to the well-known filterbanks [204][205][206]. Due to the availability of large amounts of audio content and due to the scarcity of explainable models of the complex human auditory system, we shall expect DL to play an ever increasing role in psychoacoustics in the near future.
Furthermore, multi-modal DL enables spatial audio processing that was either very difficult or even impossible for non-ML-based approaches. One good example is the audio-visual processing covered in Section 3.3, in which multichannel audio is generated from a mono audio signal based on spatial information drawn from the visual content. Since audio-visual dependencies are not straightforward to model mathematically, DL unfolds its potential in finding such complex inter-dependencies, leading to the estimation of, often abstract, representations of spatial audio-visual information.
ML-based approaches have not yet reached a point in which they indisputably surpass in all types of spatial audio processing. One likely reason for the scarcity of powerful DL-based end-to-end models for spatial audio processing is the difficulty to jointly control the timbre, perceptual spatial cues, and audio signal fidelity, as required in high-end applications. Physical accuracy is a criterion that is straightforward to define on a signal level, for example, by means of the squared error. Criteria for achieving a given psychoacoustic result are incomparably more difficult to define because the relation between the signals at the ears of a listener and the resulting perception are known mostly only for relatively simple scenarios [135,207]. Models for a large range of hearing mechanisms were formulated in [208] from a machine perspective to facilitate integrating them into a machine learning framework. So far, only applications outside of the domain of spatial audio such as music information retrieval have been realized. A proof-of-concept for ML-based assessment of the quality of general (non-spatial) audio was presented in [209], which cannot be generalized to spatial audio. An initial attempt for predicting the perceptual impairment due to system errors in spherical microphone array auralizations using ML for a narrow scope of signals was presented in [210]. Consequently, it still remains a challenge to formulate a single differentiable loss function for neural network training that guarantees that all critically relevant aspects of spatial hearing are represented with the right balance.
Another limiting factor in developing DL-driven spatial audio methods is the lack of available datasets with multichannel audio content at sufficiently large amounts to facilitate the training of DL models. In addition, just the way it is currently unclear how proper loss functions should be defined, the form that ground-truth data should ideally take is also unsettled, which makes data annotation tasks difficult. Until recently, the spatial audio community has not striven to collect and make available multichannel recordings of high fidelity and in large quantity. Note that, in general, the datasets with audio are significantly larger in terms of the data volume than, for example, the datasets for image processing. For instance, the large scale audio-visual dataset of human speech known as Vox-Celeb2 [211], which is popular in speaker recognition, contains over one million of speech utterances of a length of 3-20 s with an overall duration of over 2000 h of 16- bit single-channel audio recorded at a sampling frequency of 16 kHz. This amounts to around 78 GB of data, which is straightforward to handle in practice. In DCASE 2021, the development dataset for the SELD task mentioned in Section 3.2.1 contains overall around 20 hours of 4channel audio recordings sampled at 24 kHz, which yields already around 13.8 GB of data [107]. In particular, in the case of spatial audio, multiple channels need to be stored for each recording, which poses huge requirements on storage capacity as well as on cache memory and computational power during the training. This is particularly true for more difficult tasks such as those that produce multichannel output data rather than a label, in which the training time can easily last many days even on supercomputers with several GPUs. Most likely for these reasons, only a limited number of datasets with spatial audio, or audio-video, content have been made available.
For instance, binaural audio-visual content is collected in [179,180,212], distributed multichannel recordings of multi-speaker conversations are available in [213], while large number of B-format recordings from YouTube are collected in [178]. More such effort and data are required to train generic DL spatial audio models. Furthermore, many available datasets are specific to the input and output setups, i.e., they contain multichannel recordings made using a microphone array with a particular geometry or are synthesized for a particular loudspeaker setup. One remedy to this setup-specific limitation could be to store the captured signals and output signals to be reproduced in one of the commonly accepted representations described in Section 2 such as in the ambisonic format. This way, end-to-end models could be trained irrespective of the capture and reproduction setups. The requirement for standardized capture formats was also identified as an important pre-requisite for databases for training ML systems on the quality of general (non-spatial) audio content [214].
The degree to which the ML-based spatial audio methods will be successful in the future will largely depend on how the audio research community addresses the discussed issues and on the emergence of new applications possible only with DL. For the time being, we are still waiting for the appearance of further disruptive DL-based methods that can deliver spatial audio processing that is out of reach today.

Conclusions
We presented an overview of data-based methods in the domain of spatial audio with a special focus on recent approaches that make use of ML. We categorized the methods based on their function in the general signal processing pipeline, which consist of capture, processing, and reproduction.
The capture stage is dominated by solutions that do not employ ML. This is similar for the reproduction stage, in which linear methods are most commmon apart from the task of individualization of the user's head-related transfer functions.
The processing stage is where most of the ML-based solutions are found. For many tasks in this stage, there are both ML-based and non-ML-based methods available. Unlike other domains of data-based processing like visual object recognition and others where the performance of ML-based solutions is significantly superior to non-ML-based ones, such trends have not crystallized in the field of spatial audio. Tasks like source separation, sound event detection, or extraction of spatial information from accompanying video have been highly impacted by DL, leading to outstanding results even with singlechannel recordings. Other tasks with a higher focus on extracting or analyzing the spatial properties of sound have not been so much disrupted by ML methods.
Possible causes for this are the lack of robust success criteria (i.e., how to measure that the processing was useful) and, partly as a consequence of this, the amount of the available training and test data, which is lower by orders of magnitude compared to classical application domains of ML.