- Research Article
- Open Access
Wide-Band Audio Coding Based on Frequency-Domain Linear Prediction
© Petr Motlicek et al. 2010
- Received: 30 May 2009
- Accepted: 8 November 2009
- Published: 16 February 2010
We revisit an original concept of speech coding in which the signal is separated into the carrier modulated by the signal envelope. A recently developed technique, called frequency-domain linear prediction (FDLP), is applied for the efficient estimation of the envelope. The processing in the temporal domain allows for a straightforward emulation of the forward temporal masking. This, combined with an efficient nonuniform sub-band decomposition and application of noise shaping in spectral domain instead of temporal domain (a technique to suppress artifacts in tonal audio signals), yields a codec that does not rely on the linear speech production model but rather uses well-accepted concept of frequency-selective auditory perception. As such, the codec is not only specific for coding speech but also well suited for coding other important acoustic signals such as music and mixed content. The quality of the proposed codec at 66 kbps is evaluated using objective and subjective quality assessments. The evaluation indicates competitive performance with the MPEG codecs operating at similar bit rates.
- Quantization Noise
- Tonal Signal
- Quadrature Mirror Filter
- Audio Code
- Modify Discrete Cosine Transform
Modern speech coding algorithms are based on source-filter model, wherein the model parameters are extracted using linear prediction principles applied in temporal domain . Most popular audio coding algorithms are based on exploiting psychoacoustic models in the spectral domain [2, 3]. In this work, we explore signal processing methods to code speech and audio signals in a unified approach.
In traditional applications of speech coding (i.e., for conversational services), the algorithmic delay of the codec is one of the most critical variables. However, there are many services, such as downloading the audio files, voice messaging, or push-to-talk communications, where the issue of the codec delay is much less critical. This allows for a whole set of different coding techniques that could be more effective than the conventional short-term frame-based coding techniques.
Due to the development of new audio services, there has been a need for new audio compression techniques which would provide sufficient generality, that is, the ability to encode any kind of the input signal (speech, music, signals with mixed audio sources, and transient signals). Traditional approaches to speech coding based on source-filter model have become very successful in commercial applications for toll quality conversational services. However, they do not perform well for mixed signals in many multimedia services. On the other hand, perceptual codecs have become useful in media coding applications, but are not as efficient for speech content. These contradictions have recently turned into new initiatives of standardization organizations (3GPP, ITU-T, and MPEG), which are interested in developing a codec for compressing mixed signals, for example, speech and audio content.
This paper describes a coding technique that revisits (similar to ) the original concept of the first speech coder , where the speech is seen as a carrier signal modulated by its temporal envelope. Our approach (first introduced in ) differs from  in use of frequency domain linear prediction (FDLP) [7–10], that allows for the approximation of temporal (Hilbert) envelopes of sub-band energies by an autoregressive (AR) model. Unlike temporal noise shaping (TNS) , which also uses FDLP and forms a part of the MPEG-2/4 AAC codec, where FDLP is applied to solve problems with transient attacks (impulses), the proposed codec employs FDLP to approximate relatively long (hundreds of milliseconds) segments of Hilbert envelopes in individual frequency sub-bands. Another approach, described in , exploits FDLP for sinusoidal audio coding using short-term segments.
The goal is to develop a novel wide-band (WB)-FDLP audio coding system that would explore new potentials in encoding mixed input including speech and audio by taking into account relatively long acoustic context directly in the initial step of encoding the input signal. Due to this acoustic context, the proposed coding technique is intended to be exploited in noninteractive audio services. Unlike interactive audio services such as VoIP or interactive games, the real-time constraints for the proposed codec are not stringent.
The paper is organized as follows. Section 2 discusses fundamental aspects of the FDLP technique. Section 3 mentions initial attempts to exploit FDLP for narrow-band speech coding. Section 4 describes the WB-FDLP audio codec in general and Section 5 gives the detailed description of the major blocks in the codec. Objective quality evaluations of the individual blocks are given in Section 6. Section 7 provides subjective quality assessment of the proposed codec compared with state-of-the-art MPEG audio codecs. Section 8 contains discussions and summarizes important aspects.
Inertia of the human vocal tract organs makes the modulations in the speech signal vary gradually. While short-term predictability within time spans of 10–20 ms and AR modeling of the signal have been used effectively [12, 13], there exists a longer-term predictability due to inertia of human vocal organs and their neural control mechanisms. Therefore, the temporal evolutions of vocal tract shapes (and subsequently also of the short-term spectral envelopes of the signal) are predictable. In terms of compression efficiency, it is desirable to capitalize on this predictability by processing longer temporal context for coding rather than processing every short-term segments independently. While such an approach obviously introduces longer algorithmic delay, the efficiency gained may justify its deployment in many evolving communications applications. Initial encouraging experimental results were achieved on very low bit-rate speech coding  and feature extraction for automatic speech recognition .
In the proposed audio codec, we utilize the concept of linear prediction in spectral domain on sub-band signals. After decomposing the signal into the individual critical bandwidth sub-bands, the sub-band signals are characterized by their envelope (amplitude) and carrier (phase) modulations. FDLP is then able to exploit the predictability of slowly varying amplitude modulations. Spectral representation of amplitude modulation in sub-bands, also called "Modulation Spectra'', has been used in many engineering applications. Early work done in  for predicting speech intelligibility and characterizing room acoustics is now widely used in the industry . Recently, there have been many applications of such concepts for robust speech recognition [17, 18], audio coding , noise suppression , and so forth. In order to use information in modulation spectrum (at important frequencies starting from low range), a signal over relatively long-time scales has to be processed. This is also the case of FDLP.
Defining the analytic signal in the sub-band as , where is the Hilbert transform of , the Hilbert envelope of is defined as (squared magnitude of the analytic signal) and the phase is represented by instantaneous frequency, denoted by the first derivative of (scaled by ). Often, the term Hilbert carrier denoted as is used for representing the phase. Here, denotes time samples.
2.1. Envelope Estimation
In this section, we describe, in detail, the approximation of temporal envelopes by AR model obtained using FDLP. To simplify the notation, we present the full-band version of the technique. The sub-band version is identical except that the technique is applied to the sub-band signal obtained by a filter bank decomposition.
where . stands for the -transform. Let the notation denote discrete Fourier transform (DFT) which is equivalent to -transform with .
It has been shown, for example, in , that the conventional TDLP fits the discrete power spectrum of an all-pole model to of the input signal. Unlike TDLP, where the time-domain sequence is modeled by linear prediction, FDLP applies linear prediction on the frequency-domain representation of the sequence. In our case, is first transformed by discrete cosine transform (DCT). It can be shown that the DCT type I odd (DCT-Io) needs to be used . DCT-Io can also be viewed as the symmetrical extension of so that a new time-domain sequence is obtained ( , and ) and then DFT projected (i.e., relationship between the DFT and the DCT-Io). We obtain the real-valued sequence , where . Process of symmetrical extension allows to avoid problems with continuity at boundaries of the time signal (often called Gibbs-type ringing).
stands for squared magnitude frequency response of the all-pole model. Equation (7) can be interpreted in such a way that the FDLP all-pole model fits Hilbert envelope of the symmetrically extended time-domain sequence . FDLP models the time-domain envelope in the same way as TDLP models the spectral envelope. Therefore, the same properties appear, such as accurate modeling of peaks rather than dips.
As a consequence, the new model will fit dips more accurately than the original model. This technique has been proposed for TDLP (called spectral transform linear prediction (STLP) ), and we apply this scheme for FDLP.
Initial experiments aiming at narrow-band (NB) speech coding ( kHz), reported in , suggest that FDLP applied on long temporal segments and excited with white noise signal provides a highly intelligible speech, but with whisper-like quality without any voicing at bit-rates below kbps. In these experiments, the input speech was split into nonoverlapping segments (hundreds of milliseconds long). Then, each segment was processed by DCT and partitioned into unequal frequency subsegments to obtain critical band-sized subbands. FDLP approximation was applied on each sub-band by carrying out autocorrelation linear prediction (LP) analysis on the subsegments of DCT transformed signals, yielding line spectral pair (LSP) descriptors of FDLP models. Resulting AR models approximate the Hilbert envelopes in critical band-sized sub-bands.
In case of very low bit-rate speech coding ( 1 kbps), a frequency decomposition into sub-bands was performed for every ms long input segment. In each sub-band, the FDLP model of order of was estimated. FDLP sub-band residuals (these signals represent sub-band Hilbert carriers for the sub-band FDLP encoded Hilbert envelopes) were substituted by white noise with uniform distribution. Such an algorithm provided subjectively much more natural signal than LPC10 standard (utilizing TDLP with order model equal to estimated every ms) operating at twice higher bit-rates .
For NB speech applications operating at kHz input signal, the sub-band residuals were split into equal length partially (5%—to avoid transient noise) overlapping segments. Each segment was heterodyned to DC range and Fourier transformed to yield spectral components of low-passed sub-band residuals. Commensurate number of spectral components in each sub-band was selected (using psychoacoustic model) and their parameters were vector quantized. In the decoder, the sub-band residuals were reconstructed and modulated with corresponding FDLP envelope. Individual DCT contributions from each critical sub-band were summed and inverse DCT was applied to reconstruct output signal .
The first experiment towards wide-band (WB)-FDLP audio coding ( kHz input signal) was motivated by the structure of the NB-FDLP speech codec operating at kHz. The initial frequency sub-band decomposition based on weighting of DCT transformed signal was extended (by adding more critical sub-bands) to encode wide-band input .
In , a more efficient FDLP-based version for WB audio coding was introduced. Initial critical bandwidth sub-band decomposition was replaced by quadrature mirror filter (QMF) bank. Then, FDLP was applied directly on QMF sub-band signals. Similar to the previous schemes, the sub-band FDLP residuals (the carrier signals for the FDLP-encoded Hilbert envelope) were further processed (more detailed description is given in Section 4.1). This WB-FDLP audio coding approach exploiting QMF-bank decomposition serves as a simplified version (base-line system) of the current WB-FDLP audio codec.
The WB-FDLP approach is further improved by several additional blocks, described later, to operate at kbps for audio signals sampled at kHz. The encoder and decoder sides are described in the following sections.
LSFs are restored back at the encoder side and resulting AR model computed from quantized parameters is used to derive FDLP sub-band residuals (analysis-by-synthesis). Due to this operation, the quantization noise present in the sub-band temporal envelopes does not influence the reconstructed signal quality.
Magnitudes: Since a full-search VQ in this high-dimensional space would be computationally demanding, split VQ approach is employed. Although the split VQ approach is suboptimal, it reduces computational complexity and memory requirements to manageable limits without severely affecting the VQ performance. We divide the input vector of spectral magnitudes into separate partitions of a lower dimension. Dimension of individual partitions varies with the frequency sub-band. In sub-bands 1–10, the input vector of spectral magnitudes is split into partitions (minimum codebook length is ). Spectral magnitudes in higher sub-bands are quantized less accurately, that is, the VQ codebook length increases. In overall, the VQ codebooks are trained (on a large audio database) for each partition using the LBG algorithm.
Phases: The distribution of the phase spectral components was found to be approximately uniform (having a high entropy). Their correlation across time is not significant. Hence a uniform SQ is employed for encoding the phase spectral components. The SQ resolution varies from to bits, depending on the energy levels given by the corresponding spectral magnitudes, and is performed by a technique called dynamic phase quantization (DPQ). DPQ is described in detail in Section 5.2.
To reduce bit rates, we apply an additional block, namely temporal masking (TM) which, together with DPQ, can efficiently control process of quantization. This block is described in Section 5.5. Furthermore, a technique called spectral noise shaping (SNS, Section 5.4) is applied for improving the quality of tonal signals by applying a TDLP filter prior to the FDLP processing. Detection of tonality followed by SNS is performed in each frequency sub-band independently.
More specifically, the transmitted VQ codebook indices are used to select appropriate codebook vectors for the magnitude spectral components. ms segments of the sub-band residuals are restored in the time domain from its spectral magnitude and phase information. Overlap-add (OLA) technique is applied to obtain ms sub-band residuals, which are then modulated by the FDLP envelope to obtain the reconstructed sub-band signal.
An additional step of bit-rate reduction is performed on the decoder side (see Section 5.3). FDLP residuals in frequency sub-bands above kHz are not transmitted, but they are substituted by white noise at the decoder. Subsequently, these residuals are modulated by corresponding sub-band FDLP envelopes.
Finally, a block of QMF synthesis is applied on the reconstructed sub-band signals to produce the output full-band signal.
This section describes, in detail, the major blocks employed in WB-FDLP audio codec mentioned in Section 4.
5.1. Nonuniform Sub-Band Decomposition
where and are frequency axes in Hertz and in bark, respectively.
Unlike NB-FDLP coder, DCT is applied on the ms long sub-band signal to obtain AR model in a given QMF sub-band. STLP technique (introduced in (8)) is used to control the fit of AR model.
Such nonuniform QMF decomposition provides good compromise between fine spectral resolution for low-frequency sub-bands and smaller number of FDLP parameters for higher bands. Furthermore, nonuniform QMF decomposition fits well into the perceptual audio coding scheme, where psychoacoustic models traditionally work in nonuniform (critical) sub-bands.
5.2. Dynamic Phase Quantization (DPQ)
To reduce bit rate for representing phase spectral components of the sub-band FDLP residuals, we perform DPQ. DPQ can be seen as a special case of magnitude-phase polar quantization applied to audio coding .
As DPQ follows an analysis-by-synthesis (AbS) scheme, no side information needs to be transmitted. This means that frequency positions of the phase components being dynamically quantized using different resolution do not have to be transmitted. Such information is available at the decoder side due to the perfect (lossless) reconstruction of magnitude components processed by AbS scheme.
5.3. White Noise Substitution
The detailed analysis of sub-band FDLP residuals shows that FDLP residuals from low-frequency sub-bands resemble FM modulated signals. However, in high-frequency sub-bands, the FDLP residuals have properties of white noise. According to these findings, we substitute FDLP residuals in frequency sub-bands above kHz (last bands) for white noise generated at the decoder side. These white noise residuals are then modulated by corresponding sub-band FDLP envelopes. White noise substitution of high-sub-band residuals has a minimum impact on the quality of reconstructed audio (even for tonal signals) while providing a significant bit-rate reduction.
5.4. Spectral Noise Shaping (SNS)
The FDLP codec is most suitable to encode signals, such as glockenspiel, having impulsive temporal content, that is, signals whose sub-band instantaneous energies can be characterized by an AR model. Therefore, FDLP is robust to "pre-echo'' [7, 26] (i.e., quantization noise is spread before the onset of the signal and may even exceed the original signal components in level during certain time intervals). However, for signals having impulsive spectral content, such as tonal signals, FDLP modeling approach is not appropriate. Here, most of the important signal information is present in the FDLP residual. For such signals, the quantization error in the FDLP codec spreads across all the frequencies around the tone. This results in significant degradation in the reconstructed signal quality.
This can be seen as the dual problem to encoding transients in the time domain, as done in many conventional codecs such as . This is efficiently solved by temporal noise shaping (TNS) . Specifically, coding artifacts arise mainly in handling transient signals (like the castanets) and pitched signals. Using spectral signal decomposition for quantization and encoding implies that a quantization error introduced in this domain will spread out in time after reconstruction by the synthesis filter bank. This phenomenon is called time/frequency uncertainty principle  and can cause "pre-echo'' artifacts (i.e., a short noise-like event preceding a signal onset) which can be easily perceived. TNS represents a solution to overcome this problem by shaping the quantization noise in the time domain according to the input transient.
The proposed WB-FDLP audio codec exploits SNS technique to overcome problems in encoding tonal signals . It is based on the fact that tonal signals are highly predictable in the time domain. If a sub-band signal is found to be tonal, it is analyzed using TDLP  and the residual of this operation is processed with the FDLP codec. At the decoder, the output of the FDLP codec is filtered by the inverse TDLP filter.
Tonality detector (TD): TD identifies the QMF sub-band signals which have strong tonal components. Since FDLP performs well on nontonal and partially tonal signals, TD ensures that only pure tonal signals are identified. For this purpose, global tonality detector (GTD) and local tonality detector (LTD) measures are computed and the tonality decision is taken based on both measures. GTD measure is based on the spectral flatness measure (SFM, defined as the ratio of the geometric mean to the arithmetic mean of the spectral magnitudes) of the full-band signal. If the SFM is below the threshold, that is, GTD has identified input frame as tonal, LTD is employed. LTD is defined based on the spectral autocorrelation of the sub-band signals (used for estimation of FDLP envelopes).
SNS processing: If GTD and LTD have identified a sub-band signal to have a tonal character, such sub-band signal is filtered through the TDLP filter followed by FDLP model. Model orders of both models are equal to as compared to a FDLP model order of for the nontonal signals. At the decoder side, inverse TDLP filtering on the FDLP decoded signal gives the sub-band signal back.
5.5. Temporal Masking (TM)
A perceptual model, which performs temporal masking, is applied in WB-FDLP codec to reduce bit-rates. Temporal masking is a property of the human ear, where the sounds appearing within a temporal interval of about ms after a signal component get masked. Such auditory masking property provides an efficient solution for quantization of a signal.
By processing relatively long temporal segments in frequency sub-bands, the FDLP audio codec allows for a straightforward exploitation of the temporal masking, while its implementation in more conventional short-term spectra based codecs has been so far quite limited; one notable exemption is the recently proposed wavelet-based codec .
To estimate the masking threshold at each sample index, we compute a short-term dB SPL so that the signal is divided into ms overlapping frames with frame shifts of sample.
The assumptions made in the applied linear model, such as sinusoidal nature of the masker and the signal, minimum duration of the masker ( ms), and minimum duration of the signal ( ms) may differ from real audio signal encoding conditions. Therefore, the actual masking thresholds are much below the thresholds obtained from the linear masking model. To obtain the actual thresholds, informal listening experiments were conducted to determine the correction factors .
These masking thresholds are then utilized in quantizing the sub-band FDLP residual signals. The number of bits required for representing the sub-band FDLP residuals is reduced in accordance with TM thresholds compared to the WB-FDLP codec without TM. Since the sub-band signal is the product of its FDLP envelope and residual (carrier), the masking thresholds for the residual signal are obtained by subtracting the dB SPL of the envelope from that of the sub-band signal. First, we estimate the quantization noise present in the WB-FDLP codec without TM. If the mean of the quantization noise (in ms sub-band signal) is above the masking threshold, no bit-rate reduction is applied. If the temporal mask mean is above the noise mean, then the amount of bits needed to encode that sub-band FDLP residual signal is reduced in such a way that the noise level becomes similar to the masking threshold.
To evaluate individual blocks employed in WB-FDLP codec, we perform quality assessment and provide achieved results with obtained bit-rate reductions. For the quality assessment, perceptual evaluation of audio quality (PEAQ) distortion measure  is used. PEAQ measure, based on the ITU-R BS.1387 standard, estimates the perceptual degradation of the test signal with respect to the reference signal. The output combines a number of model output variables (MOVs) into a single measure, the objective difference grade (ODG) score, which is an impairment scale with meanings shown in Table 2.
List of audio/speech recordings selected for objective quality assessment. denotes recordings used in MUSHRA subjective listening test. denotes recordings used in BS.1116 subjective listening test.
Chinese female, es02 , es03, louis raquin , te19
Brahms, dongwoo , es01 , phi1, phi2 -phi_3 -phi7, salvation, sc03 , te09 , te15 , trilogy
Speech and music
Arirang, Green, Wedding, te1_mg54
Speech over music
Noodleking , te16_fe49, twinkle
PEAQ (ODG) scores and their meanings.
Perceptible but not annoying
Base-Line System (170 kbps) This version of the codec employs uniform QMF decomposition and DFT magnitudes of sub-band residuals are quantized using split VQ. Quantization of the spectral magnitudes using the split VQ allocates about kbps for all the frequency sub-bands. DFT phases are uniformly quantized using bits. Such codec operates at kbps .
Nonuniform QMF Decomposition (104 kbps) Employment of nonuniform QMF bank decomposition (Section 5.1) significantly reduces the bit rates from kbps to kbps (about %), while the overall objective quality is degraded by about .
Dynamic Phase Quantization (DPQ) (84 kbps) Employment of DPQ (Section 5.2) provides bit-rate reduction about kbps (from to kbps), while the overall objective quality is degraded by about .
Noise Substitution (73 kbps) Subsequent white noise substitution of high-frequency sub-band FDLP residuals (Section 5.3) reduces the bit rate to kbps (by about kbps), while the overall objective quality is degraded by about .
Spectral Noise Shaping (SNS) (73 kbps) SNS block employed to improve encoding of highly tonal signals (Section 5.4) increases bit rates by bps (to transmit the binary decision about employment of SNS in each sub-band). SNS does not affect the encoding of nontonal signals. Overall quality was slightly improved by about .For the purpose of detailed evaluation, additional test signals with strong tonality structure, downloaded from , were used in the experiments. Due to the application of SNS, the objective quality of each of these recordings is improved, as shown in Figure 11. The average objective quality score (average PEAQ score) for these samples is improved by about .
Temporal Masking (TM) (66 kbps) Final block simulates temporal masking to modify quantization levels of spectral components of sub-band residuals according to perceptual significance (Section 5.5). The bit-rate reduction is about kbps for an average PEAQ degradation by about .
The final version of the WB-FDLP codec operating at kbps, which employs all the blocks described and evaluated in Sections 5 and 6, respectively, is compared with the state-of-the-art MPEG audio codecs. In our evaluations, the following two codecs are considered.
(2)MPEG-4 HE-AAC (V8.0.3), v1 at kbps . The HE-AAC coder is the combination of spectral band replication (SBR)  and advanced audio coding (AAC)  and was standardized as high-efficiency AAC (HE-AAC) in Extension 1 of MPEG-4 Audio .
A novel wide-band audio compression system for medium bit rates is presented. The audio codec is based on processing relatively long temporal segments of the input audio signal. Frequency-domain linear prediction (FDLP) is applied to exploit predictability of temporal evolution of spectral energies in nonuniform sub-bands of the signal. This yields sub-band residuals, which are quantized using temporal masking. The use of FDLP ensures that fine temporal details of the signal envelopes are captured with high temporal resolution. Several additional techniques are used to reduce the final bit rate. The proposed compression system is relatively simple and suitable for coding both speech and music.
Performances of some of some individual processing steps are evaluated using objective perceptual evaluation of audio quality, standardized by ITU-R (BS.1387). Final performances of the codec at kbps are evaluated using subjective quality evaluation (MUSHRA and BS.1116 standardized by ITU-R). The subjective evaluation results suggest that the proposed WB-FDLP codec provides better audio quality than LAME-MP3 codec at kbps and produces slightly worse results compared to MPEG-4 HE-AAC standard at kbps.
We stress that the codec processes each frequency sub-band independently without taking into account sub-band correlations, which could further reduce the bit rate. This strategy has been pursued intentionally to ensure robustness to packet losses. The drop-out of bit-packets in the proposed codec corresponds to loss of sub-band signals at the decoder. In , it has been shown that the degraded sub-band signals can be efficiently recovered from the adjacent sub-bands in time-frequency plane which are unaffected by the channel.
From computational complexity point of view, the proposed codec does not perform highly demanding operations. Linear prediction coefficients of the FDLP model are estimated using fast LBG algorithm. Most of the computational cost is due to the search for appropriate codewords to vector quantize magnitude spectral components of the sub-band residuals. However, codebook search limitations, which also applied in traditional speech codecs such as CELP, have been already overcome by various techniques (e.g., two-stage algebraic-stochastic quantization scheme).
QMF: it performs frequency-domain alias cancellation. This is a dual property to a technique called time-domain alias cancellation (TDAC). TDAC ensures perfect invertibility of the modified discrete cosine transform (MDCT) used in AAC codecs.
SNS: Unlike temporal noise shaping (TNS) employed in AAC codecs to outperform problems with transient signals, SNS improves quality of highly tonal signals compressed by the WB-FDLP codec.
TM: it is a psychoacoustic phenomenon implemented to significantly reduce bit-rates while maintaining the quality of the reconstructed audio. TM is often referred to as nonsimultaneous masking (part of auditory masking), where sudden stimulus sound makes inaudible other sounds which are present immediately preceding or following the stimulus. Since the effectiveness of TM lasts approximately ms (in case of the offset attenuation), TM is a powerful and easily implementable technique in the FDLP codec. Frequency masking (FM) (or simultaneous masking) is a dual phenomenon to TM, where a sound is made inaudible by a "masker'', a noise of the same duration as the original sound. FM is exploited in most of psychoacoustic models used by traditional audio codecs.
Modern audio codecs combine some of the previously mentioned dual techniques (e.g., QMF and MDT implemented in adaptive transform acoustic coding (ATRAC) developed by Sony ) to improve perceptual qualities/bit rates. Due to this, we believe that there is still a potential to improve the efficiency of the FDLP codec that has not been pursued yet. For instance, the proposed version of the codec does not utilize standard entropy coding. Further, neither SNRs in the individual sub-bands are evaluated nor signal dependent nonuniform quantization in different frequency sub-bands (e.g., module of frequency masking discussed above) and at different time instants (e.g., bit reservoir) are employed. Inclusion of these techniques should further reduce the required bit rates and provide bit rate scalability, which form part of our future work.
This work was partially supported by grants from ICSI Berkeley, USA and the Swiss National Center of Competence in Research (NCCR) on "Interactive Multi-modal Information Management (IM)2'' and managed by the IDIAP Research Institute on behalf of the Swiss Federal Authorities. The authors would like to thank Vijay Ullal and Marios Athineos for their active involvement in the development of the codec. They would also like to thank the reviewers for providing numerous helpful comments on the manuscript.
- Schroeder MR, Atal BS: Code-excited linear prediction (CELP): high-quality speech at very low bit rates. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '85), April 1985, Tampa, Fla, USA 10: 937-940.View ArticleGoogle Scholar
- Brandenburg K, et al.: The ISO/MPEG-audio codec: a generic standard for coding of high quality digital audio. Proceedings of the 92nd Convention of Audio Engineering Society (AES '92), 1992, New York, NY, USA preprint 3336Google Scholar
- Herre J, Dietz M: MPEG-4 high-efficiency AAC coding. IEEE Signal Processing Magazine 2008, 25(3):137-142. 10.1109/MSP.2008.918684View ArticleGoogle Scholar
- Vinton MS, Atlas LE: Scalable and progressive audio codec. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), April 2001, Salt Lake City, Utah, USA 5: 3277-3280.Google Scholar
- Dudley H: The carrier nature of speech. Bell System Technical Journal 1940, 19(4):495-515.View ArticleGoogle Scholar
- Motlicek P, Hermansky H, Garudadri H, Srinivasamurthy N: Speech coding based on spectral dynamics. In Proceedings of the 9th International Conference on Text, Speech and Dialogue (TSD '06), September 2006, Brno, Czech Republic, Lecture Notes in Computer Science. Volume 4188. Springer; 471-478.View ArticleGoogle Scholar
- Herre J, Johnston JH: Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS). Proceedings of the 101st Convention of Audio Engineering Society (AES '96), November 1996 preprint 4384Google Scholar
- Athineos M, Ellis DPW: Sound texture modelling with linear prediction in both time and frequency domains. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 648-651.Google Scholar
- Kumaresan R, Rao A: Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications. Journal of the Acoustical Society of America 1999, 105(3):1912-1924. 10.1121/1.426727View ArticleGoogle Scholar
- Ganapathy S, Motlicek P, Hermansky H, Garudadri H: Autoregressive modeling of hilbert envelopes for wide-band audio coding. Proceedings of the 124th Convention of Audio Engineering Society (AES '08), May 2008, Amsterdam, The NetherlandsGoogle Scholar
- Christensen MG, Jensen SH: Computationally efficient amplitude modulated sinusoidal audio coding using frequency-domain linear prediction. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulouse, France 189-192.Google Scholar
- Itakura F, Saito S: Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, August 1968 Edited by: Kohasi Y. C17-C20. paper no. C-5-5Google Scholar
- Atal BS, Schroeder MR: Adaptive predictive coding of speech signals. Bell System Technical Journal 1970, 49(8):1973-1986.View ArticleGoogle Scholar
- Ganapathy S, Thomas S, Hermansky H: Modulation frequency features for phoneme recognition in noisy speech. Journal of the Acoustical Society of America 2009, 125(1):EL8-EL12. 10.1121/1.3040022View ArticleGoogle Scholar
- Houtgast T, Steeneken HJM, Plomp R: Predicting speech intelligibility in rooms from the modulation transfer function, I. General room acoustics. Acustica 1980, 46(1):60-72.Google Scholar
- IEC 60268-16 : Sound system equipment—part 16: objective rating of speech intelligibility by speech transmission index. http://www.iec.ch
- Kingsbury BED, Morgan N, Greenberg S: Robust speech recognition using the modulation spectrogram. Speech Communication 1998, 25(1–3):117-132. 10.1016/S0167-6393(98)00032-6View ArticleGoogle Scholar
- Athineos M, Hermansky H, Ellis DPW: LP-TRAP: linear predictive temporal patterns. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju, South Korea 1154-1157.Google Scholar
- Falk TH, Stadler S, Kleijn WB, Chan W: Noise suppression based on extending a speech-dominated modulation band. Proceedings of the European Conference on Speech Communication and Technology (Interspeech '07), August 2007, Antwerp, Belgium 970-973.Google Scholar
- Makhoul J: Linear prediction: a tutorial review. Proceedings of the IEEE 1975, 63(4):561-580. 10.1109/PROC.1975.9792View ArticleGoogle Scholar
- Athineos M, Ellis DPW: Autoregressive modeling of temporal envelopes. IEEE Transactions on Signal Processing 2007, 55(11):5237-5245. 10.1109/TSP.2007.898783MathSciNetView ArticleGoogle Scholar
- Oppenheim AV, Schafer RW: Discrete-Time Signal Processing. 2nd edition. Prentice-Hall, Upper Saddle River, NJ, USA; 1998.Google Scholar
- Churchill RV, Brown W: Introduction to Complex Variables Applications. 5th edition. McGraw-Hill, New York, NY, USA; 1982.Google Scholar
- Hermansky H, Fujisaki H, Sato Y: Analysis and synthesis of speech based on spectral transform linear predictive method. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '83), April 1983, Boston, Mass, USA 8: 777-780.View ArticleGoogle Scholar
- Motlicek P, Hermansky H, Ganapathy S, Garudadri H: Non-uniform speech/audio coding exploiting predictability of temporal evolution of spectral envelopes. In Proceedings of the 10th International Conference on Text, Speech and Dialogue (TSD '07), September 2007, Pilsen, Czech Republic, Lecture Notes in Computer Science. Volume 4629. Springer; 350-357.View ArticleGoogle Scholar
- Motlicek P, Ullal V, Hermansky H: Wide-band perceptual audio coding based on frequency-domain linear prediction. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA 265-268.Google Scholar
- Motlicek P, Ganapathy S, Hermansky H, Garudadri H: Frequency domain linear prediction for QMF sub-bands and applications to audio coding. In Proceedings of the 4th International Workshop on Machine Learning for Multimodal Interaction (MLMI '07), June 2007, Brno, Czech Republic, Lecture Notes in Computer Science. Volume 4892. Springer; 248-258.View ArticleGoogle Scholar
- Itakura F: Line spectrum representation of linear predictive coefficients of speech signals. Journal of the Acoustical Society of America 1975, 57: S35. 10.1121/1.1995189View ArticleGoogle Scholar
- Charbonnier A, Rault J-B: Design of nearly perfect non-uniform QMF filter banks. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '88), April 1988, New York, NY, USA 1786-1789.Google Scholar
- Motlicek P, Ganapathy S, Hermansky H, Garudadri H, Athineos M: Perceptually motivated sub-band decomposition for FDLP audio coding. In Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD '08), September 2008, Brno, Czech Republic, Lecture Notes in Computer Science. Volume 5246. Springer; 435-442.View ArticleGoogle Scholar
- Vafin R, Kleijn WB: Entropy-constrained polar quantization and its application to audio coding. IEEE Transactions on Speech and Audio Processing 2005, 13(2):220-232. 10.1109/TSA.2004.840942View ArticleGoogle Scholar
- Pinsky MA: Introduction to Fourier Analysis and Wavelets. Brooks/Cole, Pacific Grove, Calif, USA; 2002.MATHGoogle Scholar
- Ganapathy S, Motlicek P, Hermansky H, Garudadri H: Spectral noise shaping: improvements in speech/audio codec based on linear prediction in spectral domain. Proceedings of the European Conference on Speech Communication and Technology (Interspeech '08), September 2008, Brisbane, AustraliaGoogle Scholar
- Sinaga F, Gunawan TS, Ambikairajah E: Wavelet packet based audio coding using temporal masking. Proceedings of the IEEE Conference on Information, Communications and Signal Processing, December 2003, Singapore 1380-1383.Google Scholar
- Jesteadt W, Bacon SP, Lehman JR: Forward masking as a function of frequency, masker level, and signal delay. Journal of the Acoustical Society of America 1982, 71(4):950-962. 10.1121/1.387576View ArticleGoogle Scholar
- Ganapathy S, Motlicek P, Hermansky H, Garudadri H: Temporal masking for bit-rate reduction in audio codec based on frequency domain linear prediction. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '08), April 2008, Las Vegas, Nev, USA 4781-4784.Google Scholar
- ITU-R Recommendation BS.1387 : Method for objective psychoacoustic model based on PEAQ to perceptual audio measurements of perceived audio quality. December 1998.Google Scholar
- ISO/IEC JTC1/SC29/WG11 : Framework for exploration of speech and audio coding. MPEG2007/N9254, Lausanne, Switzerland, July 2007Google Scholar
- Musical instrumental samples, http://theremin.music.uiowa.edu/MIS.html
- LAME-MP3 codec, http://lame.sourceforge.net
- Dietz M, Liljeryd L, Kjorling K, Kunz O: Spectral band replication, a novel approach in audio coding. Proceedings of the 112th Convention of Audio Engineering Society (AES '02), May 2002, Munich, Germany preprint 5553Google Scholar
- Bosi M, Brandenburg K, Quackenbush S, et al.: ISO/IEC MPEG-2 advanced audio coding. Journal of the Audio Engineering Society 1997, 45(10):789-814.Google Scholar
- ISO/IEC : Coding of audio-visual objects—part 3: audio, AMENDMENT 1: bandwidth extension. ISO/IEC Int. Std. 14496-3:2001/Amd.1:2003, 2003Google Scholar
- ITU-R Recommendation BS.1534 : Method for the subjective assessment of intermediate audio quality. June 2001.Google Scholar
- Homepage with encoded samples, http://www.idiap.ch/~pmotlic
- ITU-R Recommendation BS.1116 : Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. October 1997.Google Scholar
- Ganapathy S, Motlicek P, Hermansky H: Error resilient speech coding using sub-band hilbert envelopes. In Proceedings of the 12th International Conference on Text, Speech and Dialogue (TSD '09), September 2009, Pilsen, Czech Republic, Lecture Notes in Computer Science. Volume 5729. Springer; 355-362.View ArticleGoogle Scholar
- Tsutsui K, Suzuki H, Shimoyoshi O, Sonohara M, Akagiri K, Heddle RM: ATRAC: adaptive transform acoustic coding for MiniDisc. Proceedings of the 93rd Convention of Audio Engineering Society (AES '92), October 1992, San Francisco, Calif, USA preprint 3456Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.