Three-dimensional (3D) audio technologies are booming with the success of 3D video technology. The surge in audio channels makes its huge data unacceptable for transmitting bandwidth and storage media, and the signal compression algorithm for 3D audio systems becomes an important task. This paper investigates the conventional mid/side (M/S) coding method and discusses the signal correlation property of three-dimensional multichannel systems. Then based on the channel triple, a three-channel dependent M/S coding (3D-M/S) method is proposed to reduce interchannel redundancy and corresponding transform matrices are presented. Furthermore, a framework is proposed to enable 3D-M/S compress any number of audio channels. Finally, the masking threshold of the perceptual audio core codec is modified, which guarantees the final coding noise to meet the perceptual threshold constraint of the original channel signals. Objective and subjective tests with panning signals indicate an increase in coding efficiency compared to Independent channel coding and a moderate complexity increase compared to a PCA method.

Introduction

Recently, 3D audio has attracted more attention and developed fast following the booming market of 3D movie. Many 3D audio technologies are now introduced into audio-involved applications to replace the surround sound system to provide superior sound localization and an immersive feeling. Wave field synthesis (WFS), Ambisonics and vector-based amplitude panning (VBAP) are the three most well-developed technologies. WFS generally follows Huygens principle to reconstruct the original sound field [1]. Research institutions such as IDMT of Fraunhofer and IRCAM in France have an intensive study in WFS, and attempt to bring WFS into theater and live transmission of concert. Ambisonics utilizes spherical harmonic functions to recording sound field and driving loudspeakers, its loudspeakers have rigorous configuration and give a good sound field reconstruction in the center [2]. VBAP follows the tangent law in a three-dimensional space using three adjacent loudspeakers to form a sound vector. For its simplicity, VBAP is the most common algorithm in 3D signal panning [3]. A 3D system like 22.2 multichannel system proposed by NHK in Japan utilizes VBAP to generate 3D sound image [4]. The 22.2 multichannel system is also included in the developing MPEG-H standard for rendering 3D audio scene.

There is a clear trend that 3D audio technology will become mature gradually and replace stereo and surround sound [5]. However, a main and common feature of 3D audio technologies is the great number of sound channels. For instance, WFS system always contains dozens and even hundreds of audio channels. The 22.2 system has three layers and 24 audio channels. Although the Ambisonics system can have flexible order and channel number, it usually uses dozens of channels because fewer channels will cause quality deterioration. Comparing with a two-channel stereo and a 5.1 surround sound, the increasing of audio channel causes a dramatical 3D audio data increase. A report from Fraunhofer shows 37 Mbps is needed for live transmission of WFS [6]. For the 22.2 multichannel system, uncompressed data also reaches 28 Mbps [7]. Currently, storage media and transmission bandwidth can hardly afford those huge data size. So the compression of 3D multichannel audio signals becomes an important subject.

The well-known Spatial Audio Coding (SAC) models the signals as virtual sound sources in the frequency domain, extracts the interchannel level difference (ICLD) and interchannel time difference (ICTD) and interchannel coherence (IC) to represent the direction and width of virtual sound source and downmixes the multichannels to reduce redundancy [8–11]. The idea of using downmixed sources with spatial parameters was later developed into Spatial Audio Object Coding (SAOC) for efficiently coding the multiple input spatial audio objects with interactive and personalized rendering ability [12]. Recently, some other investigations have been published to increase the compression efficiency for multichannel 3D audio signals. In 2007, Goodwin and Jot proposed a PCA-based multichannel compression framework for parametric coding [13], which can enhance specific audio scenarios and provide robust spatial audio coding. In 2008, Cheng et al. proposed the Spatially Squeezed Surround Audio Coding (S^{3}AC) for parametrically compressing the Ambisonics signal [14]. In 2009, Hellerud used an inter-channel prediction-based coding method to remove the redundancy between Ambisonics channels [15], which has low algorithm delay but high computational complexity. Tzagkarakis used a sinusoidal model and linear prediction to parameterize the separate spot microphone channels, then downmixed the residual signals. This coding scheme is more suitable for multichannel signals with weak correlation, and such scenarios require Independent channel decoding [16]. In 2010, Pinto et al. utilized a space/time-frequency transform to decompose the WFS signals into plane waves and evanescent waves. By discarding the evanescent waves and perceptually coding the plane wave signals, coding gain is obtained. Coding efficiency increases along with the number of audio channels, because the transform decomposition accuracy depends on the spatial resolution which is the number of WFS channels [17, 18]. In 2013, Cheng further proposed a Spatial Localization Quantization Point (SLQP) codec using localization cues to compress the 3D audio signals [19, 20]. Since SLQP extracts the spatial cues and downmixes the channels, it achieved high compression ratio for SLQP signals and other 3D audio systems.

In order to increase the coding efficiency at high bitrates, some non-parametric coding schemes were developed. Yang proposed a scalable multichannel codec, using the Karhunen-Loeve Transform (KLT) to remove the interchannel redundancy to realize scalable multichannel audio coding [21]. Mid/side (M/S) coding was introduced by J.D. Johnston [22] and adopted by many audio codec such as MPEG2-Layer III and MPEG4-AAC. In 2003, Liu et al. proposed a bit allocation method for M/S coding based on allocation entropy, which increases the objective quality by allocating more bits to high energy channel in M/S coding [23]. In 2008, Derrien et al. proposed an error model for M/S coding. The error model enables tuning of the quantizer used for channels M and S at the encoder with respect to the distortion of L and R at the decoder side, which increased the coding efficiency of M/S without much complexity [24]. Since M/S coding works as the simplest interchannel prediction, Krueger generalized it using linear prediction instead of M/S transformation and residual signal instead of difference signal [25]. In 2012, Schafer further developed Krueger’s method, the multichannel case, which has low algorithmic delay [26]. Recently, M/S coding was combined with parametric stereo coding at low bitrates in the MPEG-USAC standard [27] by predicting the residual channel using spatial cue-based parameters, which aimed to bridge the stereo quality gap between low bitrates and high bitrates [28]. M/S coding also works alone at high bitrates utilizing a novel complex prediction to achieve better performance [29].

The above model-based codec and parametric codec can offer a considerable compression ratio. However, those methods need to know the direction of the real audio source to do objective-oriented coding, or estimated a virtual source direction to do downmixing and parametric coding. In practice, such as live recording, it is very difficult to obtain the real audio source direction. Downmixing and parametric coding will cause interchannel interference such as ‘tone leakage’ artifacts when channel signals differ greatly [30]. Furthermore, the computational complexity of an audio codec should be acceptable while maintaining enough coding efficiency, and parametric coding can only achieve a performance gain at low bitrates. This paper focuses on the situation that only the multichannel signals of audio sources are recorded, instead of their directions. And we consider high-quality/high-bitrate application and focus on the non-parametric coding method. Section ‘M/S coding in 3D space’ describes the conventional M/S coding process and presents a three-channel Dependent M/S coding (3D-M/S) method. The main idea is to expand M/S coding to three-dimensional audio by designing a new transform matrix, which remove the redundancy of three channels in 3D space rather than just two channels in the horizontal plane. Section ‘3D-M/S psychoacoustic model’ discusses the psychoacoustic model for transformed 3D-M/S signals. Section ‘Framework for general channel configuration’ specifies a new framework enables 3D-M/S to be applied to a more general channel configuration. Section ‘Experiment’ gives a comparison of 3D-M/S coding with PCA coding and Independent channel coding to justify the performance of compression ratio and computational complexity. Section ‘Conclusion’ summarizes and concludes this paper.

M/S coding in 3D space

Conventional M/S coding

M/S coding is based on the fact that most stereo channels are strongly correlated. By simply transforming the stereo channel pair into the M/S domain, core codec encodes a summation channel and a difference channel instead of the original channels. The difference channel has much lower energy than the original channel, so more frequency bins can be quantized to similar and smaller quantized values, which leads to the entropy of the resulting quantised time-frequency samples is lower and hence lossless coding using Huffman coding achieves a higher compression rate. To illustrate how M/S coding works, a generalized sine stereo model is used. Here, a stereo pair is denoted as a vector V_{
0
}= (C_{
L
},C_{
R
}) where

S is the virtual audio source, θ is the stereo panning angle and \theta \in \left[0,\frac{\pi}{2}\right]. The M/S coding can be denoted as two transform matrices M_{
0
}and M_{
1
}, the summation vector of M_{
1
}is denoted as {\mathbf{V}}_{\mathbf{1}}=\left(\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2}\right)

In practice, subband energy or masking threshold will be used instead of C_{
L
},C_{
R
}in V_{
0
}. Only when two channels is sufficiently correlated, for example energy difference is less than a threshold Thr=2 dB as shown in (4), will the M/S mode be used to avoid being frequently transformed and recalculating the masking threshold [22].

To discuss the switching condition more conveniently, the M/S switching condition will be expressed using the distance between vectors V_{
0
}and V_{
1
}. Given \frac{{C}_{L}}{{C}_{R}}=\text{tan}\theta, (4) can be denoted as

The left side of Figure 1 shows the M/S switching condition (5) in vector space, and Figure 2 illustrates a stereo signal and corresponding transformed sum and difference signals. Equation (3) and the two figures denote that when input signal vector V_{
0
}is close to V_{
1
}, where \text{cos}\theta \approx \frac{1}{\sqrt{2}} and \theta \approx \frac{\pi}{4}, the switching condition (4) will be satisfied. The difference signal has less amplitude than the original signals, and M/S coding will be used. Since θ is the angle of V_{
0
}, switching condition (4) can be represented by the inner product between signal vector V_{
0
}and summation vector V_{
1
}

This is an equivalent expression to the energy condition. It indicates that only when the input signal vector is close enough to the summation vector of a M/S transform matrix, this matrix will be used. This idea will be helpful when later discussing the 3D-M/S coding where more than one transform matrix exists. Here, Thr_{
v
}is the corresponding switching threshold of Thr in vectorial distance and

For diffuse audio sources like ambient sound, the multichannel signals are weakly correlated. For directional audio sources, the signals are highly correlated which make it possible to reduce the interchannel redundancy. To maintain the stability of the sound image, in a stereo and surround audio system, the virtual source is always being panned or recorded by the two most adjacent channels. So two adjacent channels always have the maximum similarity, and M/S coding and parametric coding are performed based on a two-channel unit. But for Ambisonics, VBAP and 22.2 multichannel systems, sound channels are spherically configured in 3D space as shown in Figure 3. In a VBAP system, three adjacent channels form a directional sound image and have the maximum correlation. In other 3D systems such as Ambisonics, three channels cover a basic area of 3D space and adjacent three channels also have the most similar signals [31]. If the multichannel signals are still grouped and compressed on the channel pair basis, like Ando et al. recent work for coding 22.2 multichannel signals with the AAC codec [32], there will be more than one adjacent channel for each channel. It will require extra computation to select the channel pair but may still lead to a possible to mismatch. For example, in Figure 3, channel C_{1}, C_{2} and C_{3} formed a virtual source. If C_{1} and C_{2} were grouped for parametric or M/S coding, C_{3} would be grouped with another less correlated channels, which decreases the coding performance. And if we dynamically group the signals, we have to use complex correlation analysis algorithm to analyze its six adjacent channels. Because the channel pair grouping is based on frequency subbands, it will not only increase the codec complexity dramatically but will also be unable to reduce the overall redundancy that exist in more than two channels. In brief, the conventional channel pair unit should be redesigned for 3D audio systems.

Considering the basic unit for any 3D surface is a triangle, a spherical multichannel configuration can be easily unfolded to a plane triangle structure as shown in Figure 3. The above 3D systems also utilize three or more channels to produce 3D audio effects, which leads to interchannel signal redundancy existing in or more than three channels. Hence, channel triples should be the basic unit to remove interchannel redundancy rather than channel pair in conventional coding schemes. More specifically, in a VBAP system, the input signal V_{
0
}= (C_{1},C_{2},C_{3}) is calculated following the tangent model in 3D space as

where \theta ,\phi \in \left[0,\frac{\pi}{2}\right], which determine the gain factor of the three channels.

There are infinite possible situations, where a virtual source can be located. But for the VBAP model, those possibilities can be reduced to three basic situations. First, the virtual source is located near the position of one channel. This situation corresponds to the source panned mainly using one channel, or one channel forms a virtual source with another two channels which are out of current three loudspeakers. This situation is similar to stereo audio with only one active channel, so no transform is performed and M_{
0
}will be used. Second, the virtual source is located between two channels. This situation corresponds to the source panned mainly using two channels, or two channels form a virtual source with one other channel out of the current three channels. This situation is similar to conventional stereo audio, and M/S coding can be applied. However, the M/S transform matrix must be modified to adapt to the three channels condition which is expressed in Equation 9. Third, the source is panned using all the three channels. This is a new situation that stereo audio never contains. To remove the interchannel redundancy, a new transform matrix M_{
4
}is designed following the rule of conventional M/S coding. The first vector is the summation of three channels, and the rest vectors are orthogonal with the first vector. To guarantee the conservation of energy after transformation, unit vectors are used. This matrix realizes the sum-difference processing for 3D channel, and guarantees that when three channel signals are nearly the same, two channels primarily contain the difference signal.

An example is shown in Figure 2. It can be observed that when the source is close to the center of two or three channels, a corresponding matrix can produce difference signals with lower dynamic range compared to the original channel signals. Under a certain masking threshold, far less bits are required for quantizing the difference signals which brings the coding gain.

Then we discuss the switching condition for 3D-M/S as shown on right side of Figure 1, and the transformed mid/second/third channels are shown in Table 1. Firstly, if all three channel signals are almost the same which means the input vector is close to spherical triangle center V_{
4
}(\text{cot}\theta \approx \frac{\sqrt{2}}{2} and \phi \approx \frac{\pi}{4}), matrix M_{
4
}will be chosen to give two difference channels. Secondly, if there are only two channels satisfying the conventional M/S switching condition and the projection of input vector is close to V_{
1
}(cot θ ≈ sin φ) or V_{
2
}(cot θ ≈ cos φ) or V_{
3
}(\phi \approx \frac{\pi}{4}), 3D-M/S will select the matrix having the nearest distance from the input vector. The distance is measured by vector distance following expression (6). Compared with conventional switching condition, it can be seen that conventional M/S coding works on the two-dimensional space and has two switching areas. 3D-M/S switching condition is an expansion of M/S coding, where its input vector works in three-dimensional space and has five switching areas. Following the vector distance switching condition, the switching rule of 3D-M/S can be denoted as

where i,j∈{1,2,3}, V_{
01
}= (0,C_{2},YC_{3}), V_{
02
}= (C_{1},0,C_{3}), V_{
03
}= (C_{1},C_{2},0) are the two channel projections of input vector V_{
0
}. {\mathbf{V}}_{\mathbf{1}}=\left(0,\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2}\right), {\mathbf{V}}_{\mathbf{2}}=\left(\frac{\sqrt{2}}{2},0,\frac{\sqrt{2}}{2}\right), {\mathbf{V}}_{\mathbf{3}}=\left(\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2},0\right), {\mathbf{V}}_{\mathbf{4}}=\left(\frac{\sqrt{3}}{3},\frac{\sqrt{3}}{3},\frac{\sqrt{3}}{3}\right) are the summation vectors of each transform matrix.

3D-M/S psychoacoustic model

The perceptual threshold is the maximum amount of noise for an audio signal that will not degrade its subjective quality. 3D-M/S coding transforms the original signals into summation and difference signals and then sends these to the core codec. The perceptual thresholds calculated by the core codec are for the transformed signals, which can not guarantee the noise in original signals is imperceptible. So after transformation, the perceptual threshold calculation in the core encoder needs a redesign. An example is the conventional M/S stereo coding, where the masking threshold model for the main channel and side channel is revised to avoid perceptible noise in reconstructed signals [23]. To derive the masking threshold for 3D-M/S signals, we reference the deducing method in [23] and expand it to channel triple case. Consider the transformed signals ({C}_{M},{C}_{S},{C}_{T})={\mathbf{M}}_{\mathbf{4}}{\mathbf{V}}_{\mathbf{0}}^{T} as

where C_{
M
}is the first signal after transformation and C_{
S
}and C_{
T
}are the second and third signals, respectively. After core codec quantization, independent noise is introduced into the three signals which is denoted as N_{
M
}, N_{
S
}and N_{
T
}. So at the decoder side:

where {\u0108}_{1}, {\u0108}_{2} and {\u0108}_{3} are the reconstructed signals. Compared with the original signals shown in (11), the noise energy for the original signals can be obtained

where {\sigma}_{{C}_{1}-{\u0108}_{1}}^{2}, {\sigma}_{{C}_{2}-{\u0108}_{2}}^{2} and {\sigma}_{{C}_{3}-{\u0108}_{3}}^{2} are the expected noise energy for original signals. {T}_{{C}_{1}}, {T}_{{C}_{2}} and {T}_{{C}_{3}} are the masking thresholds calculated for the original signals. The sufficient conditions that guarantee the expected noise energy will not exceed masking threshold are {N}_{M}^{2},{N}_{S}^{2},{N}_{T}^{2}\le \text{min}({T}_{{C}_{1}},{T}_{{C}_{2}},{T}_{{C}_{3}}). This means the noise energy for any transformed signal must be less than the minimum threshold of their original signals. So the masking thresholds for the transformed signals can be derived from the masking threshold of the original signals as

The same results can be deduced for other matrices.

Framework for general channel configuration

3D-M/S only works for three channels, actually all 3D audio systems contain dozens of channels. A general channel configuration is shown in Figure 4, where channel C corresponds to a loudspeaker. Here, a framework is proposed based on 3D-M/S coding as shown in Figure 5. Because all spatially placed loudspeakers can be decomposed into basic triangle units, this structure will enable 3D-M/S coding to work for arbitrary channel configurations. The framework processes the audio channels triangle by triangle until all channels are coded. C_{
M
}is the summation channel and C_{
S
}and C_{
T
}are the second and third channels, respectively. Every 3D-M/S unit shares two channels with the previous unit and only one new channel is added in. So, it only needs to compress the channel which contains the signal of the new channel. For all matrices M_{0}, M_{1}, M_{2}, M_{3} and M_{4}, C_{
T
}is the third channel after 3D-M/S transform. Because every unit outputs only one that contains a new input channel, the whole coding framework keeps the number of channels exactly the same as original input signals. And because the output channel contains either the difference signal or original signal, coding gain can be obtained. The original signals can be obtained by multiplying 3D-M/S inverse transform matrix subband by subband at the decoder side. This framework is also suitable for other methods. For example, replacing the 3D-M/S with PCA, the codec can achieve better interchannel redundancy removing performance.

Experiment

The experiment used five channels (C_{1}, C_{2}, C_{3}, C_{4}, C_{5}) in spherical 22.2 multichannel configuration as shown in Figure 4. Considering that PCA is the best decorrelation transform theoretically and Independent channel coding is widely used for 22.2 multichannel compression, the experiment compared the proposed 3D-M/S method with PCA and Independent channel coding in bitrate, complexity and objective quality. Three MPEG test sequences (es01 voice signal, sc03 symphony music signal, si02 castanets transient signal, mono 48-kHz sampling) were used as the moving virtual sources following the VBAP rule, four sequences (si03, si01, sc01, es02) were used as the discrete fixed-position virtual sources. The virtual sources and respective azimuth and altitude panning angle are generated on a per-frame basis. Here, only point virtual sources were used to test the best performance of three methods, as subband signals can be regarded as point sources in subband coding when bandwidths are small enough. Signals with decorrelated elements are beyond the scope of VBAP model and will decrease the coding performance, for its difference signals retains high energy which depends on the correlation and the energy of the decorrelated elements. Uncorrelated signals with independent audio content is tested in the end.

Three basic virtual sound movements were used to cover some basic possible virtual source locations. The three movements in a triangle are movement 1 from point to point (virtual source es01), movement 2 from point to edge (virtual source sc03) and movement 3 from edge to edge (virtual source si02). Three movements are as shown in Figure 4. Different sources were used for different panning patterns, because the experiments were designed to compare three methods on the same condition and change the condition simultaneously to see how three methods work on different kinds of sources and locations, instead of comparing the performance of one method on different virtual source locations with the same source. Considering the symmetry in the triangle, 10 discrete virtual source positions in \frac{1}{6} triangle as shown in Table 2 and Figure 4 are used which equally divided the \frac{1}{6} triangle: si03 was panned to \left(\theta =\frac{11\pi}{36},\phi =\frac{7\pi}{32}\right); si01 was divided into two sequences and panned to \left(\theta =\frac{13\pi}{36},\phi =\frac{7\pi}{32}\right), \left(\theta =\frac{13\pi}{36},\phi =\frac{5\pi}{32}\right); sc01 was divided into three sequences and panned to \left(\theta =\frac{15\pi}{36},\phi =\frac{7\pi}{32}\right), \left(\theta =\frac{15\pi}{36},\phi =\frac{5\pi}{32}\right) and \left(\theta =\frac{15\pi}{36},\phi =\frac{3\pi}{32}\right); es02 was divided into four sequences and panned to \left(\theta =\frac{17\pi}{36},\phi =\frac{7\pi}{32}\right), \left(\theta =\frac{17\pi}{36},\phi =\frac{5\pi}{32}\right), \left(\theta =\frac{17\pi}{36},\phi =\frac{3\pi}{32}\right) and \left(\theta =\frac{17\pi}{36},\phi =\frac{\pi}{32}\right).

The 3D-M/S and PCA was used in each subband in the frequency domain. The three encoders were realized based on FAAC-1.28, and decoders were based on FAAD2-2.7. AAC-LC was used as the core codec and only the long window was enabled for simplification. To avoid the influence of dynamic bandwidth setting of the FAAC, the experiment fixed the bandwidth at 12 kHz with 35 subbands.

Independent channel coding: Audio signals were sent into the core codec and compressed directly.

3D-M/S: The vector was calculated using the subband energy of three channels from AAC psychoacoustic module with no extra energy computation. Then 3D-M/S matrix switching was performed and 3 bits were used per mode parameter. The transformed signals were sent into the core codec, and the masking threshold was modified accordingly.

PCA: The eigenvectors were calculated for each subband. Subband signals were transformed using eigenvector matrix and then sent into core codec. The covariance matrix was quantized and transmitted to the decoder following a previous KLT-based multichannel audio coding scheme [21], with 4 bits per non-redundant element.

Objective evaluation

Complexity was measured by the running time of each method on PC (CPU: Intel Core2 Duo P8600 2.53GHz, RAM: 8GB). The main application scenario of the proposed method is medium and high bitrate, and the sound pressure at the center listening point is the most important in multichannel audio and the VBAP panning method. Here we compared the sound pressure SNR (original and reconstructed signal sound pressure in the center listening point) to measure the objective quality following the sound pressure equation in [33]

where p (ω) is the sound pressure, G is a proportionality coefficient, k=\frac{\omega}{c} is the wave number and c is the wave speed. r is the distance from the loudspeaker to the listening point, and in the spherical 22.2 multichannel, all loudspeakers have the same distance. And the SNR is calculated by

Firstly, the bitrates of all three methods have been set to be nearly identical to compare the performance shown in Table 3. The bitrates were adopted by modifying the masking threshold in core codec and iterating two nested loops until rate constrain satisfied. Figure 6 shows the SNR curves for each method, where each SNR curve was smoothed to omit the details. The smoothing method is a typical average filtering over 20 frames length. An overall SNR was presented in Table 3 to give an overview by averaging the SNR for all frames. The three SNR curves have a downtrend because transient signal and symphony signal have a wider bandwidth than the voice signal, which require more bits to achieve the similar SNR. When the virtual source came close to the middle of the two channels (between C_{1}, C_{2} around the 200th frame in Mov. 1, between C_{3}, C_{4} for all frames in Mov. 2, between C_{3}, C_{4} around the start frames, between C_{3}, C_{5} around the 100th frame and between C_{4}, C_{5} around the end frames in Mov. 3), 3D-M/S gets a higher SNR than the Independent channel coding and close to PCA. Moveover, around the 200th frame in Mov. 1 and Mov. 2, where all two and three channels are nearly the same, M_{
3
}and M_{
4
}can remove redundancy to the largest extent and outperform the PCA method. This is because some transformed subband signals came below the masking threshold and more bits were reserved for summation channel. The same results can be seen in the discrete virtual sources in Table 2, where \phi =\frac{7\pi}{32} (\phi \approx \frac{\pi}{4}) and (\theta =\frac{11\pi}{36},\phi =\frac{7\pi}{32}) (\theta =\frac{13\pi}{36},\phi =\frac{5\pi}{32}) (\theta =\frac{15\pi}{36},\phi =\frac{3\pi}{32}) (\theta =\frac{17\pi}{36},\phi =\frac{\pi}{32}) (cot θ ≈ sin φ). But when the virtual source located beyond the middle of two channels, such as (\theta =\frac{15\pi}{36},\phi =\frac{5\pi}{32}), M/S coding cannot bring coding gain. In conclusion, if the input signals are located in one of the five switching areas, the coding gain can be obtained by transforming them into summation and difference signals.

Secondly, the PCA parameter bitrate of 39.3 kbps/channel is considerably higher than 3D-M/S. If the three channels have little correlation (e.g. channels with different contents or ambient sound), the transformed signals will not save any bits and cause the decrease of coding efficiency. To test the three methods under such condition, the virtual sources of three different signals were fixed at three channels and coded all at 64 kbps. The experimental result is shown in Figure 6. We can see Independent channel coding achieves the best performance in this case; meanwhile, 3D-M/S degrades about 1 dB and PCA degrades nearly 7 dB. It is because, for PCA requirement, too many bits are used for parameters which now cannot bring any coding gain. But for 3D-M/S, parameter bits for modes are only 4.9 kbps/channel. It will not reduce the coding efficiency much for medium and high bitrate conditions, which is the main application scenario of M/S coding. Although the high bitrate for PCA can be alleviated by reducing the refresh rate of PCA parameters, but it will decrease the coding performance on VBAP signals at the same time.

Finally, Table 4 demonstrates the complexity of each method. Due to the computation for covariance matrix and eigenvector matrix, PCA increases about 30% complexity compared with the original AAC codec. For 3D-M/S, the matrix switching and signal transformation increased by about 11% complexity.

Subjective evaluation

Eight subjects who are actively working in the domain of audio compression participated in the subjective test based on ITU MUSHRA [34] method using a 3.5-kHz low-pass filtered original channels as anchor, and the test was carried out in a quite room with five-spherical channel configuration as shown in Figure 7. Subjects were required to evaluate the sound quality and sound orientation separately, and can draw the perceived sound position and movement to help their rating.

The MUSHRA test results are shown in Figure 8. For the sound quality, the subjective result generally matches the SNR result where PCA and 3D-M/S got a subjective quality improvement compared with the Independent channel coding. For the sound orientation, subjects reported that the voice and castanet signals were easier to locate. Although the 3.5-kHz anchor signal was poor in sound quality test, some subjects indicated its virtual source position was not so bad in the movement than the discrete virtual source. This may be because it is easier to distinguish the direction change for fixed sound source than the moving sound source. For 3D-M/S, subjects felt the virtual source had a sharper position in the center of two loudspeakers compared with Independent channel coding. On the whole, the subjective improvement in sound orientation exists but is not as obvious as the sound quality.

From the above results on three point sources and uncorrelated signals, it can be observed that both PCA and 3D-M/S method get about 13% SNR improvement for each channel. But the complexity of 3D-M/S is much lower than PCA to achieve similar performance. It can be explained that the fixed matrix transform can be regarded as some special vectors in PCA. The special vectors are chosen based on the assumption that channel signals are either quite similar or quite different. This assumption may not be always true for the diversity of subband signals, but it makes a good compromise between coding efficiency and complexity.

Conclusion

This paper proposed a 3D-M/S coding method, which inherits the low complexity of conventional M/S coding. Moreover, 3D-M/S performs the sum and difference coding triple by triple, rather than couple by couple of the conventional method. This structure is more suitable for a 3D multichannel audio configuration, because adjacent three channels form a triangle and will have the maximum redundancy in spatial configured 3D audio channels. Besides, it is also convenient to unfold 3D audio multichannel structure into plane triangles. Combining the proposed framework, 3D-M/S and PCA methods can be applied to more than three channels. An experiment on VBAP signals indicates the performance of proposed method with relatively low complexity, comparing to the PCA and independent channel coding. Considering the development of 3D audio technology and its requirement for compression efficiency, a low complexity 3D audio codec will be promising and preferable for practical application.

References

Berkhout AJ, de Vries D, Vogel P: Acoustic control by wave field synthesis. J. Acoust. Soc. Am 1993, 93(5):2764-2778. 10.1121/1.405852

Sakaida S, Iguchi K, Nakajima N, Nishida Y, Ichigaya A, Nakasu E, Kurozumi M, Gohshi S: The super hi-vision codec. IEEE International Conference on Image Processing, 2007. ICIP 2007, Volume 1 2007, I-21–I-24.

Herre J, Disch S: New concepts in parametric coding of spatial audio: from SAC to SAOC. 2007 IEEE International Conference on Multimedia and Expo 2007, 1894-1897.

Goodwin M, Jot J: Primary-ambient signal decomposition and vector-based localization for spatial audio coding and enhancement. IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, Volume 1 2007, I-9–I-12.

Cheng B, Ritz C, Burnett I: A spatial squeezing approach to ambisonic audio compression. IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008 2008, 369-372.

Hellerud E, Solvang A, Svensson U: Spatial redundancy in Higher Order Ambisonics and its use for lowdelay lossless compression. IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009 2009, 269-272.

Tzagkarakis C, Mouchtaris A, Tsakalides P: A multichannel sinusoidal model applied to spot microphone signals for immersive audio. IEEE Trans. Audio Speech Lang. Process 2009, 17(8):1483-1497.

Pinto F, Vetterli M: Wave field coding in the spacetime frequency domain. IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008 2008, 365-368.

Pinto F, Vetterli M: space-time-frequency processing of acoustic wave fields: theory, algorithms, and applications. IEEE Trans. Signal Process 2010, 58(9):4608-4620.

Cheng B, Ritz C, Burnett I, Zheng X: A general compression approach to multi-channel three-dimensional audio. IEEE Trans. Audio Speech Lang. Process 2013, 21(8):1676-1688.

Yang D, Ai H, Kyriakakis C, Kuo CC: High-fidelity multichannel audio coding with Karhunen-Loeve transform. IEEE Trans. Speech Audio Process 2003, 11(4):365-380. 10.1109/TSA.2003.814375

Liu CM, Lee WC, Hsiao YH: M/S coding based on allocation entropy. Proceedings of the 6th International Conference on Digital Audio Effects (DAFx-03) 2003.

Derrien O, Richard G: A new model-based algorithm for optimizing the MPEG-AAC in MS-Stereo. IEEE Trans. Audio Speech Lang. Process 2008, 16(8):1373-1382.

Schafer M, Vary P: Hierarchical multi-channel audio coding based on time-domain linear prediction. 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO) 2012, 2148-2152.

Helmrich C, Carlsson P, Disch S, Edler B, Hilpert J, Neusinger M, Purnhagen H, Robilliard J, Villemoes L, RettelbachN: Efficient transform coding of two-channel audio signals by means of complex-valued stereo prediction. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2011, 497-500.

Ando A: Conversion of multichannel sound signal maintaining physical properties of sound in reproduced sound field. IEEE Trans. Audio Speech Lang. Process 2011, 19(6):1467-1475.

This work was supported by the National Natural Science Foundation of China (nos. 61231015, 61102127, 61201340, 61201169) and Natural Science Foundation of Hubei (nos. 2011CDB451, 2012FFB04205).

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, Computer School, Wuhan University, Wuhan, China

Shi Dong, Ruimin Hu, Xiaochen Wang, Yuhong Yang & Weiping Tu

Open Access
This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.