An Overview of the Coding Standard MPEG-4 Audio Amendments 1 and 2: HE-AAC, SSC, and HE-AAC v2
© The Author(s) 2009
Received: 29 September 2008
Accepted: 24 February 2009
Published: 3 June 2009
In 2003 and 2004, the ISO/IEC MPEG standardization committee added two amendments to their MPEG-4 audio coding standard. These amendments concern parametric coding techniques and encompass Spectral Band Replication (SBR), Sinusoidal Coding (SSC), and Parametric Stereo (PS). In this paper, we will give an overview of the basic ideas behind these techniques and references to more detailed information. Furthermore, the results of listening tests as performed during the final stages of the MPEG-4 standardization process are presented in order to illustrate the performance of these techniques.
The MPEG-2 Audio coding standard was released in 1997 and has successfully found its way into the market. Later, MPEG-4 Audio Version 1 and Version 2 were issued in mid 1999 and early 2000, respectively. These versions have adopted the MPEG-2 AAC coder including several extensions to it. In addition several other components like speech coders (HVXC and CELP) and a Text-To-Speech Interface are specified.
In 2001, MPEG identified two areas for improved audio coding technology and issued a Call for Proposals (CfP, ). These two areas were
(i)improved compression efficiency of audio signals or speech signals by means of bandwidth extension which is forward and backward compatible with existing MPEG-4 technology;
(ii)improved compression efficiency of high-quality audio signals by means of parametric coding.
This started a new cycle in the standardization process, consisting of a competitive phase leading to a selection of the reference model, a collaborative phase for improving the technology and, finally, the definition of a new standard. Close to the finalization of the work on parametric coding, it was demonstrated that the parametric stereo (PS) module that was developed in the course of the this work item could also be combined with the bandwidth extension technology thereby providing a significant additional boost in coding efficiency. This particular combination was subsequently added to the parametric coding amendment. The work on bandwidth extension and parametric coding reached the final stage of Amendment 1 and 2 to MPEG-4 Audio mid 2003 and 2004, respectively.
This paper has the intention to outline the ingredients of the MPEG-4 Audio Amendments in a comprehensive way. Part of this material is present in the literature but mostly scattered. Therefore, this paper sets out to give an overview of the three components that make up these Amendments with references to more detailed information where necessary.
The outline of the paper is as follows. The basic technology and subjective test results for the bandwidth extension and parametric coding are discussed in Sections 2 and 3, respectively. In Section 4, the combination of AAC, SBR and PS is outlined, including subjective test results. Finally, in Section 5, the conclusions are presented.
2. MPEG-4 SBR
High-Frequency Reconstruction/Regeneration (HFR), or BandWidth Extension (BWE), techniques have been researched in the speech coding community for decades [2, 3]. The underlying hypothesis stipulates that it should be possible to reconstruct the higher frequencies of a signal given the corresponding low-frequency content only. In the speech coding community this research was done with the goal to be able to accurately reconstruct the high-band of a speech signal given only the low-pass filtered low-band signal and no other a priori information about the high-band of the original signal. Typically the high-band was recreated by upsampling of the low-band signal without subsequent low-pass filtering (aliasing), or by means of broad-band frequency translation (single side-band modulation) of the low-band signal [2, 3]. The spectral envelope of the recreated high-band was either simply whitened and tilted with a suitable roll-off at higher frequencies, or in more elaborate versions  estimated by means of statistical models. This research has not led to any wide adoption of such an HFR-based speech enhancement in the market as of today.
(i)The primary means for extending the bandwidth is transposition, which ensures that the correct harmonic structure is maintained for single- and multipitched signals alike.
(ii)Spectral envelope information is always sent from the encoder to the decoder making sure that the spectral envelope of the reconstructed high-band is correct .
(iii)Additional means such as inverse filtering, noise, and sinusoidal addition, guided by transmitted information, compensate for shortcomings of any bandwidth extension method originating from occasional fundamental dissimilarities between low-band and high-band [8, 9].
Since the HFR method enables a reduction of the core coder bandwidth and the HFR technique requires significantly lower bit rate to code the high-frequency range than a waveform coder would, a coding gain can be achieved by reducing the bit rate allocated to the waveform core coder while maintaining full audio bandwidth. Naturally, this gives the possibility to decrease the total data rate by lowering the crossover frequency between core coder and the HFR part. However, since the audio quality of the HFR part cannot scale towards transparency, this crossover frequency is always a delicate tradeoff between core coder and HFR related artifacts.
This paper only covers SBR in the MPEG context, where it is standardized for use together with AAC, forming the (High Efficiency) HE AAC Profile. However, the algorithm and bit stream are essentially core codec agnostic, and SBR has successfully been applied to other codecs such as MPEG Layer-2  and MPEG Layer-3 (the latter case is known as mp3PRO, see ), it is included in (High Definition Codec) HDC, that is, the proprietary codec used by iBiquity, and is standardized within (Digital Radio Mondiale) DRM for use together with the CELP and HVXC speech codecs . Furthermore, it is worth noting that the transposition method included in the MPEG-4 standard is a carefully selected tradeoff between implementation cost and quality, relaxing the strict requirements on harmonic continuation that are met by more advanced transposition methods.
2.2. System Overview
2.2.1. SBR Encoding Process
Following the general process of MPEG to standardize transmission formats and decoder operation (and hence allowing future encoder-side improvements) the SBR amendment contains an informative (as opposed to normative) encoder description. Hence this section gives a generic overview of the various elements of an encoder; the exact design of these elements is left up to the implementer. However, for detailed information on a realization of the encoder capable of high perceptual performance, the 3GPP specification of the SBR encoder is a good source, see .
The original time-domain input signal is first filtered in a 64-channel analysis QMF bank. The filter bank splits the time-domain signal into complex-valued subband signals and is thus oversampled by a factor of two compared to a regular real-valued QMF bank . For every 64 time-domain input samples, the filter bank produces 64 subband samples. At 44.1 kHz sample rate this corresponds to a nominal bandwidth of 344 Hz, and a time resolution of 1.4 ms. All the subsequent modules in the encoder operate on the complex-valued subband samples.
A transient detector (part of the "Control parameter extraction" in Figure 2) operates on the complex-valued subband signals in order to assist the envelope estimator in the time/frequency (T/F) grid selection. Generally, longer time segments of higher frequency resolution are produced by the envelope estimator during quasistationary passages, while shorter time segments of lower frequency resolution are used for dynamic passages. The transient detection is, for example, accomplished by calculating running short-term energies and detecting significant changes.
T/F Grid Selection and Envelope Estimation
The estimated envelope data are obtained by averaging of subband sample energies within segments in time and frequency. The time borders of these segments are determined mainly by the output from the transient detector, and are subsequently signaled to the decoder. When the transient detector signals a transient to the envelope estimator, segments of shorter duration in time are defined by the envelope estimator, starting with a minimal segment, the leading border of which is placed at the onset of the transient. Subsequent to the short-time segment by the transient, somewhat longer segments are used to correctly track a potential decay of the transient, and finally long segments are used for the stationary part of the signal.
The main objective is to avoid pre- and postechoes that otherwise would be induced by the envelope adjustment process in the decoder for transient input signals.
The envelope estimator also decides on the frequency resolution to use within each time segment. The variable frequency resolution is achieved by employing two different schemes for grouping of QMF samples in frequency: high resolution and low resolution, where the number of estimates differs by a factor of two. In order to reduce instantaneous peaks in the SBR bit rate, the envelope estimator typically trades one high-resolution envelope for two low resolution ones. The grouping in frequency can be either linearly spaced or (approximately) log spaced where the number of bands to use per octave is variable. An example of a T/F grid selection is given in Figure 3 where the grid is superimposed on a spectrogram of the input signal. As is clear from the figure, the time resolution is higher around the transient events, albeit with lower frequency resolution, and vice versa for the more stationary parts of the signal.
Noise Floor Estimation
Missing Harmonics Detection
Quantization and Encoding
The SBR envelope data, tonal component data, and noise-floor data are quantized and differentially coded in either the time or frequency direction in order to minimize the bit rate. All data is entropy coded using Huffman tables. Details about SBR data coding are given in the next section.
2.2.2. SBR Bit Stream
To ensure consistent coding of transients regardless of localization within codec frames, the SBR frames have variable time boundaries, that is, the exact duration in time covered by one SBR frame may vary from frame to frame. The bit stream is designed for maximum flexibility such that it scales well from the lowest bit rate applications up to medium and high bit rate use cases, and is easy to adapt for different core codec frame lengths. Furthermore, it is possible to trade bit-error robustness against increased coding efficiency by selecting the degree of interframe dependencies, and the signaling scheme offers error detection capabilities in addition to a Cyclic Redundancy Check (CRC).
2.2.3. SBR Decoding Process
The HF Generator transposes parts of the low-band frequency range to the high-band frequency range covered by SBR as indicated in the bit stream. In the bottom left panel of Figure 7 the spectrum of the transposed intermediate signal in combination with the low-band signal is displayed. This is how the output would look if no envelope adjustment of the recreated high-band would be performed.
The envelope adjuster adjusts the spectral envelope of the recreated high-band signal according to the envelope data and time/frequency grid that was transmitted in the bit stream. Additionally, noise and sinusoid components are added as signaled in the bit stream. The output from the SBR decoder after envelope adjustment is depicted in the bottom right panel of Figure 7. In the following the decoding steps are examined in more detail.
The complex-valued subband signals obtained from the filter bank are processed in the high-frequency generation unit to obtain a set of high-band subband signals. The generation is performed by selecting low-band subband signals, according to specific rules, which are mirrored or copied to the high-band subband channels. The patches of QMF subband to be copied, their source range and target range, are derived from information on the borders of the SBR range, as indicated by the bit stream. The algorithm generating the patch structure has the following objectives.
(i)The patches should cover the frequency range up to 16 kHz with as few patches as possible, without using the QMF subband lowest in frequency (i.e., the subband including DC) in any patch.
(ii)If several patches constitute the high-band, a patch covering a lower frequency range should have a wider or equal bandwidth compared to a patch covering a higher frequency range. The motivation is that for lower frequencies the human hearing is more sensitive, and therefore patches with wide bandwidth are preferred for lower frequencies in order to move any potential discontinuity between the first and the second patch as high up in frequency as possible.
(iii)The source frequency range for the patches should be as high up in frequency as possible.
Creating the high-band in this way has several advantages and is the reason why SBR can be referred to as a semi-, or quasi-, parametric method. Although the high-band is synthetically generated and shaped by the SBR bit-stream data, the characteristics of the high-band are inherited from the low-band, and, which is the most important aspect, so is the temporal structure of the high-band. This makes the corrections of the high-band, in order to resemble the original, much more likely to succeed in the subsequent processing steps.
With the above in mind, the characteristics of the low-band and the high-band still vary for different audio signals. For example, the tonality is usually more pronounced in the low-band than in the high-band. Therefore, inverse filtering is applied to the generated high-band subband signals. The filtering is accomplished by in-band filtering of the complex-valued signals using adaptive low-order complex-valued FIR filters. The filter coefficients are determined through an analysis of the low-band in combination with control signals extracted from the SBR data stream. A second-order linear predictor is used to estimate the spectral whitening filter using the covariance method. The amount of inverse filtering is controlled by a chirp-factor given from the bit stream. Hence, the HF-generated signal for QMF subband and time slot in the high-band can be defined according to
Given that the subband signals are patched from the low-frequency region to region A and B in Figure 10, so are the prediction error filter coefficients for the low-frequency region. Thus, the suitable prediction error filter coefficients are available for all subbands within region A and B. Hence, for all the QMF subbands within the region 1 in Figure 10 an inverse filtering is done within each subband, given the corresponding prediction error filter estimated on the corresponding low-band subband samples and the chirp factor signaled in the bit stream for the specific region.
It should be noted that all the processing done in the HF Generation module is done frame-based on a time segment indicated by the outer borders of the SBR frame.
The generated high-band signals are subsequently fed to the envelope adjusting unit.
The most important, and also the largest part of the SBR data stream, is the spectrotemporal envelope representation of the high-band. This envelope representation is used to adjust the energy of the generated high-band subband signals. The envelope adjusting unit first performs an energy estimate of the high-band signals. An accurate estimate is possible because of the complex-valued subband signal representation. The resulting energy samples are subsequently averaged within segments according to control signals from the data stream. This averaging produces the estimated envelope samples. Based on the estimated envelope and the envelope representation extracted from the data stream, the energy of the high-band subband samples in the respective segments are adjusted.
As previously outlined sinusoids present in the original high-band signal that have no corresponding sinusoid in the generated high-band are synthesized in the decoder, and random white noise is added to the high-band signal to compensate for diverging tonal-to-noise ratios of the high-band and low-band.
A noise floor level is used to derive the level of noise to be added to the recreated high-band signal, it is defined as the energy ratio between the HF-generated (by means of patching in the HF generator) signal energy and the noise signal energy of the final output signal.
Given the calculated gain values, a limiting procedure is applied. This is designed to avoid the need to excessively high-gain values due to large differences in the transposed signal energy and the reference energy given by the original input signal. The limiter is operative to limit high narrowband gain values while ensuring that the correct wide-band energy is maintained.
The generated high-band signals and the delay-compensated (resulting from the HF generation process) low-band signals are finally supplied to the 64-channel synthesis filter bank, which usually operates at the sampling frequency of the original signal. The synthesis filter bank is just like the analysis filter bank complex-valued, however the imaginary part of the output signal is discarded. Thus, the filter bank generates a real-valued full bandwidth output signal having twice the sampling frequency of the core coder signal.
2.2.4. Other Aspects
Low Power SBR
The SBR tool as outlined in the previous sections is defined in two versions: a High Quality Version and a Low-Power version. The main difference is that the Low-Power version utilizes real-valued QMF filter banks, while the High Quality version utilizes complex-valued filter banks. In order to make the SBR Tool work in the real-valued domain, additional tools are included that strive to minimize the introduction of aliasing in the SBR processing. The main feature is an aliasing detection algorithm that identifies adjacent QMF subbands with strong tonal components in the overlapping range. The detection is done by studying the reflection coefficient of a first-order in-band linear predictor. By observing the signs of the reflection coefficients for adjacent subbands, the subbands prone to introduce aliasing can be identified. For the identified subbands restrictions are put on how much the gain adjustment is allowed to vary between the two subbands.
The upper panel of the Figure 11 illustrates a high-resolution frequency analysis of the input signal superimposed on a stylized visualization of the QMF frequency response. In the middle panel the gain values to be applied on every subband are displayed. As can be seen these vary from subband to subband. In the bottom panel the high-resolution frequency analysis is again displayed, albeit this time after application of the gain values. As can be observed from the figure, aliasing is introduced.
Apart from the application where a low sampling rate output is desired due to complexity constraints, the downsampled SBR mode also serves another purpose. When scaling towards higher bit rates it may be desirable to run the AAC core coder at a higher sampling frequency, for example, 44.1 kHz. Hence, an SBR encoder can operate on a 44.1 kHz input signal, and upsample the signal in the encoder to 88.2 kHz, thus enabling the dual-rate mode. The SBR decoder subsequently operates on the 44.1/88.2 kHz dual-rate signal, but does so in a downsampled mode, ensuring that the output signal has the 44.1 kHz sampling rate equal to that of the original input signal. More information on sampling rate modes in High Efficiency AAC is given in .
In the top left panel of Figure 15 a spectrum of the two AAC layers (the core layer and the enhancement layer ) is given. In the top right of the figure, the frequency range that can be recreated using the SBR data stored in the core layer is displayed, and a spectrum of the SBR signal available for this range is shown. It is clear that the SBR information covers the widest frequency range required for any combination of layers. In the bottom left figure, the bandwidth relation of the core coder and the SBR tool is illustrated for the scenario where only the core layer is available. In the bottom right figure, the bandwidth relation of the core coder and the SBR tool is illustrated for the scenario where the core layer and the first layer is available. As can be seen from the bottom right picture, the lowest part of the SBR range has been replaced by the core coder.
Apart from supporting bandwidth scalable core coders, the SBR tool can also work in conjunction with mono to stereo scalability. This means that the SBR data can be divided into two groups, one group representing the general SBR data and level information of the one or two channels, and the other group representing the stereo information. If the core coder employs mono/stereo scalability, that is, the base layer contains the mono signal, and the enhancement layer contains the stereo information, the SBR decoder can apply only the monorelevant SBR data to a mono signal and omit the stereo specific parts if only a monocore coder signal is available. If the enhancement layer is decoded, and the core coder outputs a stereo signal, the SBR tool operates on the stereo signal as normal using the complete SBR data in the stream.
MPEG-2 Bit Streams
Although the focus of the present paper is on the MPEG-4 version of SBR, it should be noted that the exact same tool is standardized in MPEG-2 as well. Hence, the MPEG-2 AAC and SBR combination is also defined. This is important for certain applications relying on MPEG-2 technology while still wanting to achieve state-of-the-art compression by using SBR in combination with AAC.
2.3. Listening Tests
At the end of the two-year standardization process a rigorous verification test was performed. Two types of tests were done, a (MUlti Stimulus test with Hidden Reference and Anchor) MUSHRA test  and a (Comparative Mean Opinion Score) CMOS test . The MUSHRA test compared the performance of MPEG-4 HE-AAC with that of MPEG-4 AAC when coding mono and stereo signals at bit rates in the range 24 kbps per channel, while the CMOS test was used to show the difference between High Quality SBR and Low Power SBR. Two test sets were selected, one for mono testing, and one for stereo testing. The items were selected from 50 potential candidates by a selection panel identifying ten items considered critical for all of the systems under test.
Codecs under test.
Typical audio bandwidth
MPEG-4 AAC profile
AAC 48/60 kbps
HE-AAC 32/48 kbps
Anchors and reference
16-bit PCM stereo
Anchors and reference
Anchor 3.5 kHz
16-bit PCM stereo
Anchors and reference
Anchor 7 kHz
16-bit PCM stereo
The SBR technology in combination with AAC as standardized in MPEG under the name High Efficiency AAC (also known as aacPlus) offers a substantial improvement in compression efficiency compared to previous state-of-the-art codecs. It is the first audio codec to offer full bandwidth audio at good quality at low bit-rate. This makes it the ideal codec (and enabler) for low bit-rate applications such as Digital Radio Mondiale and streaming to mobile phones.
3. MPEG-4 SSC
3.1. Parametric Mono Coding
Current standardized and proprietary coding schemes are primarily build based on waveform coding techniques. These coding algorithms translate the incoming signal to the frequency domain by use of a subband or transform technique. Furthermore, a psychoacoustic model analyzes the incoming signal as well and determines the number of bits for quantization of each of the subband or transform signals. For an overview, see .
The subband or transform audio coding schemes primarily exploit the destination (human ear) model; the psychoacoustic model tells us where signal distortions (quantization) are allowed such that these are inaudible or least annoying. In speech coding, on the other hand, source models are primarily used. The incoming signal is matched to the characteristics of a source model (the vocal tract model), and the parameters of this source model are transmitted. In the decoder, the source model and its parameters are used to reconstruct the signal. For an overview on speech coding, please refer to .
The speech coding approach guarantees that the reproduced signal is in accordance with the model. This implies that if the model is an accurate description, the generated signal will sound like stemming from a vocal tract and will therefore sound natural though not necessarily identical to the incoming signal.
For audio, it is not possible to directly follow an approach like in speech coding. There are many sources in audio and these have quite different characteristics. The consequences of using a too restrictive source models can be devastating to the sound quality. This is already demonstrated by speech coders operating at low bit-rates; input signals other than speech typically result in a poor quality of the decoded output signals.
Nevertheless, a model is used in parametric coding. This is called a signal model to distinguish it from source models as are used in speech coding. The origin of the signal model is more based on destination properties (i.e., the human hearing system) in the sense that it tries to describe perceptually-relevant acoustic events. Consequently, parametric coding is also related to musical synthesis. However, the distinction between source and destination models is arguable; for example, many musical instruments create tonal components and biological evolution presumably leads to a tight connection between destination and source characteristics.
The promises that the parametric approach holds are therefore as follows. First of all, the signal model should always lead to an impression of an agreeable sound even at low bit rates. Thus a graceful degradation of sound quality with bit rate should be feasible. This is a property which is difficult to attain in conventional audio coding techniques. Secondly, since the idea is to model acoustic events, we may be able to manipulate these events (like in musical synthesis), a feature clearly not feasible in conventional audio coding.
At various universities, prototype parametric audio coders have been developed [22–29]. Prior to the parametric coder described in this paper, there was only one standardized parametric audio coder: HILN  in MPEG-4 Audio Version 2 .
In the Sinusoidal Coder (SSC) that is described here and which is standardized in MPEG-4, three objects can be discerned. The first one comprises tonal components. These are modeled by sinusoids. This idea seems to be originated from speech coding [32–34]. The second one is a noise object. Also this object is present in speech coders, only there segments are typically denoted as either voiced or unvoiced, corresponding to noise and periodic excitations. In audio, an early reference to simultaneous use of sinusoidal and noise coding is .
Both sinusoidal modeling and noise modeling assume that the signal segment being modeled is stationary. In view of bit rate and frequency resolution, these segments may not be too short. Consequently, one can find audio segments that, given the analysis segment length, contain clearly instationary events. A famous example forms the castanets excerpt, which is therefore a critical item for almost any coder. In view of this, it was decided to introduce a third object which is the transients. The coder not only uses a separate transient object but also adapts the windowing for the sinusoidal and noise analysis and synthesis on basis of detected transients.
3.1.1. SSC Decoder
A detailed description of the decoder can be found in the MPEG-4 document . In the following, we merely outline the operations of the different modules.
The bit stream contains transient information. First of all, transient positions are transmitted together with a type parameter. There are two types: a step-like transient and a Meixner transient. In both cases, the transient position is used to generate adapted overlap-add windows for the sinusoidal and noise synthesis. Thus this information is shared by the three synthesizers TrS, SiS, and NoS.
with , and The parameters and define the rise and decay time of the transient envelope. In case of a step-like transient, no signal is created by the transient generator. However, due to the use of the adapted overlap-add windows in the sinusoidal and noise synthesizers, a transient phenomenon is created in the mono signal for the step transient as well.
The sinusoidal data is contained in so-called sinusoidal tracks. From these tracks, information on the number of sinusoids, their frequencies, amplitudes, and phases is available for each frame. These signals are generated to produce a waveform per frame. Typically, the frames are overlap-added using an amplitude-complementary Hanning window with 50% overlap. In case of a transient, fade-in or fade-out of these overlap-add windows are shortened and positioned around the pertinent transient position.
The noise synthesizer consists of a white noise generator with unit variance. The bit streams contains data concerning the temporal envelope per 4 frames. The envelope is generated  and applied to the noise. Next, this temporally shaped noise is an input to a linear prediction synthesis filter based on Laguerre filter . The data on the filter coefficients are contained in the bit stream per frame. The generated noise is overlap added using power complimentary windows.
3.1.2. SSC Encoder
The SSC encoder is not standardized by MPEG and as such several designs are possible. We will discuss the structure of the encoder we developed and different possible mechanisms within this structure.
The first residual is an input to a sinusoidal analyzer (SiA) which also uses the estimated transient positions. This information is exploited in order to prevent measuring over nonstationary data which is done by adaptation of the analysis windows. The sinusoidal parameters are fed to a sinusoidal synthesizer (SiS) which generates a waveform. This waveform is subtracted from the first residual signal thus generating a second residual signal .
The signal is fed to a noise analyzer (NoA). This analyzer tries to capture the spectral and temporal envelopes of the remaining signal ignoring its specific waveform. Also in this analysis module, the transient position estimates are used for window adaptation.
The parameter streams generated by the transient detector and the various analysis stages are fed to a bit stream formatter (BSF). At this stage, irrelevancy and redundancy of the parameter streams are exploited and the data is quantized. The quantized data is stored in a bit stream.
Though the concept of separation in these three different objects is similar to the work presented in , there are large differences between the approaches. This holds for the different models which are used for the noise and transient components, but also in the sense that  subdivides the input signal in time-frequency tiles where each tile is exclusively modeled by one of the three components.
The transient analysis is only performed when the transient detector signals the occurrence of a sudden change in the input signal. The detector can be build on basis of detection of changes of energy  where these changes are defined over the entire frequency range or over different frequency bands. Next to detection of a transient, the detector estimates the start position of the transient.
When the transient detector signals the occurrence of a transient in a frame, the transient analysis module becomes active. On basis of the input signal and the received transient start position, it first determines the character of the transient. If the transient phenomenon is shorter than the analysis frame lengths used in the sinusoidal and noise analysis (typically in the order of tens of milliseconds), a Meixner modeling stage becomes active. Otherwise, the transient is designated as a step transient and no separate modeling is applied. Instead, the transient position information is used in the sinusoidal and noise analysis for window adaptation.
For a short transient phenomenon, the Meixner modeling stage is employed. It determines a time-domain envelope and a number of sinusoids underneath the envelope. For a detailed description of the time-domain envelope modeling process, we refer to [37–39]. This transient is subtracted from the input signal in order to ensure that this intra-frame transient is removed as much as possible before entering the sinusoidal and noise analysis, since these stages operate under the assumption that the input signal is quasistationary.
Sinusoidal analysis is a well-known technique for which many algorithms exist. Of these we mention peak-picking, matching pursuit, and psychoacoustic weighted matching pursuit. Whatever method is used, a set of frequencies, amplitudes, and phases evolves as outcome. Extended models including amplitude and frequency variations [43, 44] for more accurate signal modeling have been proposed as well but are not used in the SSC coder.
In contrast to the HILN coder, SSC does not use harmonic complexes as an object. Though a harmonic object can act as a compaction of the sinusoidal data, it was decided not to use for several reasons. Firstly, harmonic complexes need to be detected which may involve wrong detection decisions. Secondly, the linking process becomes more complicated because linking has to be established not only between sinusoids and harmonic complexes separately but also in between these two. Lastly, the signaling of links between harmonic and individual sinusoids would lead to a much more complex structure of the bit stream.
The sinusoids from subsequent frames are linked in order to obtain sinusoidal tracks. Transmission of track data is relatively efficient since the characteristic property of a track is the slow evolution of the sinusoidal amplitude and frequency. Only the phase has a more complicated character. In principle, the phase can be constructed from the frequency since these are related by an integral relation. Thus in order to arrive at low bit rate sinusoidal coders, the phase is typically not transmitted. However, phase is an important property: phase relations between different tracks are relevant for the perception and can be severely distorted when not transmitting the phase. Therefore, a new phase transmission mechanism was conceived which transmits the unwrapped phase and thus implicitly the frequency parameter as well . This is slightly more expensive in terms of bits than discarding the phase but improves the perceived quality and is much more efficient than separate frequency and phase transmission.
In order to remain within a predefined bit budget, the estimated sinusoids are typically ordered in importance and the number of transmitted sinusoids is reduced when necessary. An overview of methods for doing so can be found in .
The noise analysis characterizes the incoming signal by two properties only: its spectral shape (spectral envelope) and its temporal envelope (power over time). As such, the analysis consists of two distinct stages. First, the spectral envelope is extracted. The spectral envelope is obtained by using linear prediction based on the Laguerre systems . The use of these filters is motivated by the fact that it allows modeling of spectral details in accordance with their relevance on a Bark frequency scale .
The resulting spectrally flattened signal is analyzed for its temporal structure. This structure is analyzed over several frames simultaneously in order to obtain a good balance between required bit rate and modeling capability. The envelope modeling is done by linear prediction in the frequency domain [48, 49].
Since both linear prediction stages yield normalized envelopes, a separate gain parameter is determined and fed to the BSF as well.
3.1.3. SSC Bit Stream
The bit stream formatter receives the data from the analyzers and puts them with headers into a bit stream. Details of the bit stream defined are described in . We will consider the main data only.
The transient data comprises the transient position, transient type, envelope data, and sinusoids. The transient position and type are directly encoded. The envelopes are restricted to a small dictionary. The sinusoids underneath the envelope are characterized by their amplitude, frequency, and phase. Amplitude and frequency quantization can be done with different levels of accuracy. The amplitudes are uniformly quantized on a dB scale with at least 1.5 dB accuracy. The frequencies are uniformly quantized on an ERB scale . For a 1 kHz frequency the accuracy is at least 0.75%. Both amplitude and frequency are Huffman encoded. The phases are encoded using 5 bit uniform quantization.
The sinusoidal data comprises sinusoidal tracks. This can be divided in start data and track data, that is, everything after the start of a sinusoid until and including its death. The start data are sorted according to ascending frequency, quantized uniformly on an ERB scale and differentially encoded. The amplitude data is sorted in correspondence with the frequencies, uniformly quantized on a dB scale and differentially encoded using Huffman tables. The accuracy of both the amplitudes and frequency quantization can be set to different levels. The start phases are encoded using 5 bits.
The sinusoidal track data consists of unwrapped phases and amplitudes. The unwrapped phase data along a track is a combination of the originally estimated frequency and phase per frame and those from the previous frame (as established by the linking). This unwrapped phase data is input to a 2-bit ADPCM mechanism . The amplitudes are quantized on a dB scale and differentially encoded along a track using Huffman coding.
The noise data consists of three parts: a gain, a spectral, and a temporal envelope. The gain is quantized uniformly on a dB scale and Huffman encoded. The prediction coefficients describing the spectral envelope are mapped onto Log Area Ratios (LARs) and quantized with an accuracy according to index number. The prediction coefficients describing the temporal envelope are mapped to Line Spectral Frequencies (LSFs) and quantized.
Most of the data is updated every 384 samples for 44.1 kHz input signal, other data has an update being a multiple of this. The update of 384 samples corresponds to a subframe. Eight consecutive subframes are stored into one frame of the bit stream.
3.2. Parametric Stereo Coding
Since most audio material is produced in stereo, an efficient coding tool should also exploit the redundancies and irrelevancies of both channels simultaneously. Since it is not straightforward to use standard stereo coding tools like mid/side stereo  and intensity stereo  in conjunction with parametric coding, and since the aim also was to develop a general stereo coding tool for low bit rates, the novel Parametric Stereo (PS) tool was developed where the stereo image is coded on the basis of spatial cues. The PS tool as standardized in MPEG was developed in 2003 and primarily aimed to enhance the performance of SSC and HE-AAC at low bit rates.
In the context of SSC, the spatial cues can be considered to form the fourth object. Here, we treat this issue separately since, basically, this coding tool can be used in conjunction with any mono coder. Depending on the mono coder, it may be worthwhile to integrate the PS tool with parts of the mono coder. This has been done with HE-AAC; by sharing infrastructural parts like the time/frequency transform, a lean implementation is enabled with great savings in complexity. Details on the combination of HE-AAC an PS can be found in Section 4, and more theoretical background on the PS tool is available in [53, 54].
3.2.1. Stereo Analysis
The PS encoder proceeds the SSC encoder (see Figure 20). The PS encoder compares the two input signals (left and right) for corresponding time/frequency tiles. The frequency bands are designed to approximate the psychoacoustically motivated ERB scale, while the length of the segments is closely matched to known limitations of the binaural hearing system (see [53, 54]). Essentially, three parameters are extracted per time/frequency tile, representing the perceptually most important spatial properties.
Interchannel Phase Difference (IPD), representing the phase difference between the channels. In the frequency domain this feature is mostly interchangeable with an Interchannel Time Difference (ITD). The IPD is augmented by an additional Overall Phase Difference (OPD), describing the distribution of the left and right phase adjustment.
(iii)Interchannel Coherence (ICC), representing the coherence or cross-correlation between the channels.
While the first two parameters are coupled to the direction of sound sources, the third parameter is more associated with a spatial diffuseness (or width) of the source.
Subsequent to parameter extraction, the input signals are down-mixed to form a mono signal. The down-mix can be made by trivial means of a summing process, but preferably more advanced methods incorporating time alignment and energy preservation techniques are incorporated to avoid potential phase cancellation (and hence resulting timbre changes) in the down-mix. The down-mix is subsequently encoded using a mono SSC encoder resulting in a mono bit stream. The PS data are properly quantized according to perceptual criteria , while redundancy is removed by means of Huffman coding. Finally, the mono SSC bit stream is combined with the PS data into a joint output bit stream.
3.2.2. Stereo Synthesis
The SSC decoder extended with a PS decoder is also outlined in Figure 20 and basically comprises the reverse process of the corresponding encoder. The SSC decoder generates a mono down-mix. Subsequently, the PS decoder reconstructs stereo output signals based on the PS parameters.
where and denote the frequency subband and the QMF time slot, respectively. The elements in the up-mix matrix, are the only up-mix variables actually derived from the stereo parameters. Details about the calculation of these matrix elements can be found in [36, 54, 57]. Finally, two hybrid QMF banks are used to generate the two output signals.
3.3. SSC Performance
In order to be included in the MPEG-4 standard, the developed high-quality parametric coder needed to pass the requirements that were set out at the start of the standardization process. These requirements were twofold.
(i)The coder should provide the same quality in the mean at a 25% less bit rate compared to the existing MPEG-4 state-of-the-art technology.
(ii)The coder should not provide less quality for any item when operating at the same bit rate as existing MPEG-4 state-of-the-art technology.
The existing MPEG-4 state-of-the-art technology at that moment in time was AAC.
These requirements have been assessed in a subjective verification test conducted by the Institüt für Rundfunktechnik (IRT). In this section we present some of the results that were reported at that point in time. The data that are discussed are taken from .
The listening tests performed for the MPEG-4 standardization were done in two stages. A set of 53 critical items were encoded using the SSC (from Philips) and the AAC encoder (Fraunhofer Gesellschaft, FhG), respectively. The encoding was done at different bit rates and for mono as well as stereo material. The Institüt für Rundfunktechnik (IRT) did a prescreening test to generate a set of 9 critical items to be included in the final listening test which was also executed at IRT. Here, we discuss only the results of this final listening test obtained with the stereo input material, since this is the more relevant data from an application point-of-view. Furthermore, the results from an additional listening test performed at Philips concerning all 53 tests items are presented.
Next to the 9 encoded items, the IRT test included the hidden reference and two anchors, being the original material band-limited at 3.5 and 7 kHz. The test was performed with headphones using 26 listeners. The MUSHRA tool was used and the listeners were instructed to give a Mean Opinion Score (MOS).
The test was supposed to be a blind test. However, since two completely different coding strategies were used, the test was effectively far from blind. The AAC coder reduces its bandwidth when operating at low bit rates and this is always immediately recognized unless there is band-limited material. The SSC encoder never uses a band limitation, this being an ingredient for reaching a high quality encoding. The completely different artifacts introduced by both coding schemes effectively not only prohibit a blind comparison, but also made the ranking of the different coders a complicated task. Also, the results tend to be subject-dependent. For example, in older experiments involving SSC and AAC performed at Philips we found that listeners that are well-acquainted with band-limitations (speech-coding experts) tended to perceive the AAC band limitation as less annoying than most other listeners.
From Figure 23 we see that SSC operating at 25% lower bit rate than AAC (32 kbps) yields a better score for 35 items, a (statistically) equal score for 16 items, and a lower score for only 3 items. We note that exactly these latter three items have been included in the selected items for the test presented earlier and, in that sense, the test presented earlier is rather critical for the SSC coder. If we look at the mean preference over all items, SSC at 24 kbps is rated as slightly better than AAC at 32 kbps, this difference being statistically significant.
4. MPEG-4 HE-AAC v2
The combination of MPEG-2/4 AAC with the SBR bandwidth extension tool, as presented in the Section 2 is also known as aacPlus and was standardized in MPEG-4 as HE-AAC .
Since the bandwidth extension enabled by SBR is in principle completely orthogonal to the channel extension provided by the Parametric Stereo (PS) tool introduced in Section 3.2, it is of interest to combine both tools in order to utilize the coding gain of both tools simultaneously.
4.2. Combining HE-AAC with PS
The PS tool allows for flexible configuration of the time and frequency resolution of the stereo parameters and supports different quantization accuracies. It is also possible to omit transmission of selected parameters completely. All this, in combination with time or frequency differential parameter coding and Huffman codebooks, makes it possible to operate this PS system over a large range of bit rates.
When an HE-AAC v2 coder is operated at target bit rate of 24 kbps, the PS parameters require an average side information bit rate of 2 to 2.5 kbps, assuming 20 stereo bands for ILD and ICC. For lower target bit rates, the PS frequency resolution can be decreased to 10 bands, reducing the PS side information bit rate accordingly. On the other hand, the PS tool permits to increase time and frequency resolution and to transmit IPD/OPD parameters, which improves the quality of the stereo reconstruction at the cost of 10 kbps or more PS side information.
Based on the already existing HE-AAC profile, an HE-AAC v2 profile was defined that, in addition to AAC and SBR, includes the PS tool . In level 2 of the HE-AAC v2 profile, only a "baseline" version of the PS tool is implemented in order to limit the computational complexity. This baseline version includes a simpler version of the hybrid filter bank and does not implement IPD/OPD synthesis, but is still capable of decoding all possible PS bit stream configurations.The HE-AAC v2 decoder implementing level 2 of the HE-AAC v2 profile was also standardized in 3GPP as part of Release 6 , where it is referred to as "Enhanced aacPlus."
4.3. Listening Tests
At both test sites, it was found that HE-AAC v2 at 24 kbps achieves an average subjective quality that is equal to HE-AAC v1 stereo at 32 kbps and that is significantly better than HE-AAC v1 stereo at 24 kbps. It is of interest to relate these results to the MPEG-4 verification test . There, it was found that HE-AAC v1 stereo at 32 kbps achieved a subjective quality that was significantly better than AAC stereo at 48 kbps and was similar to or slightly worse than AAC stereo at 64 kbps. This shows that HE-AAC v2 achieves approximately three times the coding efficiency of AAC for stereo signals. Further MUSHRA tests have shown that HE-AAC v2 achieves a significantly better subjective quality than HE-AAC v1 stereo also for 18 and 32 kbps.
An overview of the technology defined in the Amendments 1 and 2 to the 2001 edition of the MPEG-4 Audio standard has been given. The performance of these techniques is discussed on the basis of the delivered audio quality as indicated by listening tests. These show that the SBR, SSC, and PS technologies add so far unreached points in the quality/bit-rate plane. In particular for low bit rate applications the parametric coding techniques constitute valuable tools. This was essentially the basis for the acceptance by MPEG-4.
Since the finalization of the standard, the HE-AAC v2 codec has gained a wide market acceptance and is currently used in several mobile music download services, digital radio broadcasting systems, and Internet streaming applications.
- Audio Subgroup : Call for proposals for new tools for audio coding. ISO/IEC JTC1/SC29/WG11 N3794, 2001Google Scholar
- Makhoul J, Berouti M: Predictive and residual coding of speech. The Journal of the Acoustical Society of America 1979,66(6):1633-1640. 10.1121/1.383661View ArticleGoogle Scholar
- Makhoul J, Berouti M: High frequency regeneration in speech coding systems. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '79), April 1979, Washington, DC, USA 4: 428-431.View ArticleGoogle Scholar
- Epps J, Holmes W: A new technique for wideband enhancement of coded narrowband speech. Proceedings of IEEE Workshop on Speech Coding, June 1999, Porvoo, Finland 174-176.Google Scholar
- Liljeryd L, Ekstrand P, Henn F, Kjörling K: Source coding enhancement using spectral band replication. EP0940015B1, 2004Google Scholar
- Liljeryd L, Ekstrand P, Henn F, Kjörling K: Improved spectral translation. EP1285436B1, 2003Google Scholar
- Liljeryd L, Ekstrand P, Henn F, Kjörling K: Efficient spectral envelope coding using variable time/frequence resolution. EP1216474B1, 2004Google Scholar
- Liljeryd L, Ekstrand P, Henn F, Kjörling K: Enhancing perceptual performance of SBR and related HFR coding methods by adaptive noise-floor addition and noise substitution limiting. EP1157374B1, 2004Google Scholar
- Kjörling K, Ekstrand P, Henn F, Villemoes L: Enhancing perceptual performance of high frequency reconstruction coding methods by adaptive filtering. EP1342230B1, 2004Google Scholar
- Gröschel A, Schug M, Beer M, Henn F: Enhancing audio coding efficiency of MPEG Layer-2 with spectral band replication for digital radio (DAB) in a backwards compatible way. Proceedings of the 114th Convention of the Audio Engineering Society (AES '03), March 2003, Amsterdam, The Netherlands paper number 5850Google Scholar
- Ziegler T, Ehret A, Ekstrand P, Lutzky M: Enhancing mp3 with SBR: features and capabilities of the new mp3PRO algorithm. Proceedings of the 112th Convention of the Audio Engineering Society (AES '02), May 2002, Munich, Germany paper number 5560Google Scholar
- European Telecommunications Standards Institute : Digital Radio Mondiale (DRM); System Specification. ETSI ES 201 980 V2.2.1, October 2005Google Scholar
- European Telecommunications Standards Institute : Universal Mobile Telecommunications System (UMTS); General audio codec audio processing functions; Enhanced aacPlus general audio codec; Encoder specification; Spectral Band Replication (SBR) part (3GPP TS 26.404 version 6.0.0 Release 6). ETSI TS 126 404 V6.0.0, September 2004Google Scholar
- Ekstrand P: Bandwidth extension of audio signals by spectral band replication. Proceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio (MPCA '02), November 2002, Leuven, BelgiumGoogle Scholar
- ISO/IEC : Coding of audio-visual objects—part 3: audio, AMENDMENT 1: bandwidth extension. ISO/IEC Int. Std. 14496-3:2001/Amd.1:2003, 2003Google Scholar
- Ehret A, Kjörling K, Rödén J, Purnhagen H, Hörich H: aacPlus, only a low-bitrate codec? Proceedings of the 117th Convention of the Audio Engineering Society (AES '04), October 2004, San Francisco, Calif, USA paper number 6199Google Scholar
- ITU-R Recommend. BS.1534 : Method for the subjective assessment of intermediate quality level of coding systems (MUSHRA). 2001Google Scholar
- ITU-T Recommend. P800 : Methods for subjective determination of transmission quality. 1996Google Scholar
- ISO/IEC JTC1/SC29/WG11 : Report on the verification tests of MPEG-4 High Efficiency AAC. ISO/IEC JTC1/SC29/WG11 N6009, October 2003Google Scholar
- Painter T, Spanias A: Perceptual coding of digital audio. Proceedings of the IEEE 2000,88(4):451-512. 10.1109/5.842996View ArticleGoogle Scholar
- Spanias AS: Speech coding: a tutorial review. Proceedings of the IEEE 1994,82(10):1541-1582. 10.1109/5.326413View ArticleGoogle Scholar
- Ali M: Adaptive signal representation with application in audio coding, Ph. D. thesis. University of Minnesota, Minneapolis, Minn, USA; 1996.Google Scholar
- Masri P: Computer modelling of sound for transformation and synthesis of musical signals, Ph. D. thesis. University of Bristol, Bristol, UK; 1996.Google Scholar
- Goodwin MM: Adaptive signal models: theory, algorithms, and audio applications, Ph.D. thesis. University of California, Berkeley, Calif, USA; 1997.Google Scholar
- Levine SN: Audio representation for data compression and compressed domain processing, Ph. D. thesis. Stanford University, Stanford, Calif, USA; 1999.Google Scholar
- Anderson DV: Audio signal enhancement using multi-resolution sinusoidal modeling, Ph. D. thesis. Georgia Institute of Technology, Atlanta, Ga, USA; 1999.Google Scholar
- Myburg FP: Design of a scalable parametric audio coder, Ph. D. thesis. Technische Universiteit Eindhoven, Eindhoven, The Netherlands; 2004.Google Scholar
- Vafin R: Towards flexible audio coding, Ph. D. thesis. KTH (Royal Institute of Technology), Stockholm, Sweden; 2004.Google Scholar
- Christensen MG: Estimation and modeling problems in parametric audio coding, Ph. D. thesis. Aalborg University, Aalborg, Denmark; 2005.Google Scholar
- Purnhagen H, Meine N: HILN-the MPEG-4 parametric audio coding tools. Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS '00), May 2000, Geneva, Switzerland 3: 201-204.Google Scholar
- ISO/IEC : Coding of audio-visual objects—part 3: audio, AMENDMENT 1: audio extensions (MPEG-4 Audio Version 2). ISO/IEC Int. Std. 14496-3:1999/Amd.1:2000, 2000Google Scholar
- Hedelin P: A tone-oriented voice-excited vocoder. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '81), March 1981, Atlanta, Ga, USA 1: 205-208.View ArticleGoogle Scholar
- Almeida L, Tribolet J: Nonstationary spectral modeling of voiced speech. IEEE Transactions on Acoustics, Speech, and Signal Processing 1983,31(3):664-678. 10.1109/TASSP.1983.1164128View ArticleGoogle Scholar
- McAulay R, Quartieri T: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986,34(4):744-754. 10.1109/TASSP.1986.1164910View ArticleGoogle Scholar
- Serra X, Smith J III: Spectral modeling synthesis. A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal 1990,14(4):12-24. 10.2307/3680788View ArticleGoogle Scholar
- ISO/IEC : Coding of audio-visual objects. Part3: audio, AMENDMENT 2: parametric coding of high quality audio. ISO/IEC Int. Std. 14496-3:2001/Amd2:2004, July 2004Google Scholar
- den Brinker AC, Schuijers EGP, Oomen AWJ: Parametric coding for high-quality audio. Proceedings of the 112th Convention of the Audio Engineering Society (AES '02), May 2002, Munich, Germany paper number 5554Google Scholar
- Schuijers EGP, Oomen AWJ, den Brinker AC, Gerrits AJ: Advances in parametric coding for high-quality audio. Proceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio (MPCA '02), November 2002, Leuven, Belgium 73-79.Google Scholar
- Schuijers EGP, den Brinker AC, Oomen AWJ, Breebaart DJ: Advances in parametric coding for high-quality audio. Proceedings of the 114th Convention of the Audio Engineering Society (AES '03), March 2003, Amsterdam, The Netherlands paper number 5852Google Scholar
- den Brinker AC, Riera-Palou F: Pure linear prediction. Proceedings of the 115th Convention of the Audio Engineering Society (AES '03), October 2003, New York, NY, USA paper number 5924Google Scholar
- Levine SN, Smith JO III: A sines+transients+noise audio representation for data compression and time/pitch scale modifications. Proceedings of the 105th Convention of the Audio Engineering Society (AES '98), September 1998, San Francisco, Calif, USA paper number 4781Google Scholar
- Kliewer J, Mertens A: Audio subband coding with improved representation of transient signal segments. In Proceedings of the 9th European Signal Processing Conference (EUSIPCO '98), September 1998, Rhodos, Greece Edited by: Theodoridis S, Pitas I, Stouraitis A, Kalouptsidis N. 2345-2348.Google Scholar
- George EB, Smith MJT: A new speech coding model based on a least-squares sinusoidal representation. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '87), April 1987, Dallas, Tex, USA 1641-1644.View ArticleGoogle Scholar
- Goodwin M: Matching pursuit with damped sinusoids. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 3: 2037-2040.Google Scholar
- den Brinker AC, Gerrits AJ, Sluijter RJ: Phase transmission in sinusoidal audio and speech coding. Proceedings of the 115th Convention of the Audio Engineering Society (AES '03), October 2003, New York, NY, USA paper number 5983Google Scholar
- Purnhagen H, Meine N, Edler B: Sinusoidal coding using loudness-based selection. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 2: 1817-1820.Google Scholar
- Smith JO III, Abel JS: Bark and ERB bilinear transforms. IEEE Transactions on Speech and Audio Processing 1999,7(6):697-708. 10.1109/89.799695View ArticleGoogle Scholar
- Herre J, Johnston JD: Enhancing the performance of perceptual audio coders by using temporal noise shaping. Proceedings of the 101st Convention of the Audio Engineering Society (AES '96), November 1996, Los Angeles, Calif, USA paper number 4384Google Scholar
- Athineos M, Ellis DPW: Autoregressive modeling of temporal envelopes. IEEE Transactions on Signal Processing 2007,55(11):5237-5245.MathSciNetView ArticleGoogle Scholar
- Moore BCJ: An Introduction to the Psychology of Hearing. Academic Press, San Diego, Calif, USA; 1997.Google Scholar
- Johnston JD, Ferreira AJ: Sum-difference stereo transform coding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 569-572.Google Scholar
- Herre J, Brandenburg K, Lederer D: Intensity stereo coding. Proceedings of the 96th Convention of the Audio Engineering Society (AES '94), February-March 1994, Amsterdam, The Netherlands paper number 3799Google Scholar
- Breebaart J, van de Par S, Kohlrausch A, Schuijers E: Parametric coding of stereo audio. EURASIP Journal on Applied Signal Processing 2005,2005(9):1305-1322. 10.1155/ASP.2005.1305View ArticleMATHGoogle Scholar
- Breebaart J, Faller C: Spatial Audio Processing: MPEG Surround and Other Applications. John Wiley & Sons, Chichester; 2007.View ArticleGoogle Scholar
- Schuijers E, Breebaart J, Purnhagen H, Engdegård J: Low complexity parametric stereo coding. Proceedings of the 116th Convention of the Audio Engineering Society (AES '04), May 2004, Berlin, Germany paper number 6073Google Scholar
- Engdegård J, Purnhagen H, Rödén J, Liljeryd L: Synthetic ambience in parametric stereo coding. Proceedings of the 116th Convention of the Audio Engineering Society (AES '04), May 2004, Berlin, GermanyGoogle Scholar
- Purnhagen H: Low complexity parametric stereo coding in MPEG-4. Proceedings of Digital Audio Effects Workshop (DAFX), October 2004, Naples, ItalyGoogle Scholar
- ISO/IEC JTC1/SC29/WG11 : Report on the verification test of MPEG-4 parametric coding for high-quality audio. ISO/IEC JTC1/SC29/WG11 N6675, 2004Google Scholar
- Purnhagen H, Engdegård J, Oomen W, Schuijers E: Combining low complexity parametric stereo with High Efficiency AAC. ISO/IEC JTC1/SC29/WG11 MPEG2003/M10385, December 2003Google Scholar
- ISO/IEC : Coding of audio-visual objects—part 3: audio, AMENDMENT 2: audio lossless coding (ALS), new audio profiles and BSAC extensions. ISO/IEC Int. Std. 14496-3:2005/Amd.2:2006, 2006Google Scholar
- 3rd Generation Partnership Project : 3GPP TS 26.401 V6.2.0, 3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; General audio codec audio processing functions; Enhanced aacPlus general audio codec; General description. 2005.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.