- Research Article
- Open Access
Comparison of Linear Prediction Models for Audio Signals
© T. van Waterschoot and M. 2008
- Received: 12 June 2008
- Accepted: 18 December 2008
- Published: 18 March 2009
While linear prediction (LP) has become immensely popular in speech modeling, it does not seem to provide a good approach for modeling audio signals. This is somewhat surprising, since a tonal signal consisting of a number of sinusoids can be perfectly predicted based on an (all-pole) LP model with a model order that is twice the number of sinusoids. We provide an explanation why this result cannot simply be extrapolated to LP of audio signals. If noise is taken into account in the tonal signal model, a low-order all-pole model appears to be only appropriate when the tonal components are uniformly distributed in the Nyquist interval. Based on this observation, different alternatives to the conventional LP model can be suggested. Either the model should be changed to a pole-zero, a high-order all-pole, or a pitch prediction model, or the conventional LP model should be preceded by an appropriate frequency transform, such as a frequency warping or downsampling. By comparing these alternative LP models to the conventional LP model in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution, we obtain several new and promising approaches to LP-based audio modeling.
- Fundamental Frequency
- Audio Signal
- Linear Prediction
- Tonal Component
- Linear Prediction Model
Linear prediction (LP) is a widely used and well-understood technique for the analysis, modeling, and coding of speech signals . Its success can be attributed to its correspondence with the speech generation process. The vocal tract can be modeled as a slowly time-varying, low-order all-pole filter, while the glottal excitation can be represented either by a white noise sequence (for unvoiced sounds), or by an impulse train generated by periodic vibrations of the vocal chords (for voiced sounds). By using this so-called source-filter model, a speech segment can be whitened with a cascade of a formant predictor for removing short-term correlation, and a pitch predictor for removing long-term correlation .
The source-filter model is much less popular in audio analysis than in speech analysis. First of all, the generation of musical sounds is highly dependent on the instruments used, hence it is hard to propose a generic audio signal generation model. Second, from a physical point of view, polyphonic audio signals should be analyzed using multiple source-filter models, which seems to be rather impractical. Finally, the enormous success of perceptual audio coders  and the recent advent of parametric coders based on the sinusoidal model , originally proposed for speech analysis and synthesis , have shifted the research interest in audio analysis away from the LP approach. Nevertheless, some audio coding algorithms still rely on LP [6–15], which is then usually performed on a warped frequency scale . Also, in audio signal processing applications other than coding, prediction error filters obtained with LP are used for the whitening of audio signals, for example, to produce robust and fast converging acoustic echo and feedback cancelers [17–20].
Since many audio signals exhibit a large degree of tonality, that is, their frequency spectrum is characterized by a finite number of dominant frequency components, it is useful to analyze LP of audio signals in the frequency domain, that is, from a spectral estimation point of view. Intuitively, one could expect that performing LP using a model order that is twice the number of tonal components leads to a signal estimate in which each of the spectral peaks is modeled with a complex conjugate pole pair close to (but inside) the unit circle. In practice, however, this does not seem to be the case, and very often a poor LP signal estimate is obtained. The fundamental problem when performing LP of an audio signal is that apart from the tonal components, a broadband noise term should generally also be incorporated in the tonal model. The noise term can either account for imperfections in the signal tonal behavior, or for noise introduced when working with finite-length data windows. Whereas a sum of sinusoids can be perfectly modeled using an AR( ) model, that is, an autoregressive or all-pole model of order , a sum of sinusoids plus (white) noise should instead be modeled using an ARMA( ) model, that is, an autoregressive moving-average or pole-zero model with zeros and poles [21–25].
A first consequence of incorporating a noise term in the tonal signal model is that the LP spectral estimate is smoothed [22, 26] due to the fact that the estimated poles are drawn toward the origin of the -plane [22, 27]. A second consequence, which to our knowledge has not been recognized up till now, is that the estimated poles tend to be equally distributed around the unit circle when noise is present, even at high signal-to-noise ratios and for low-AR model orders. From this observation, it follows that signals with tonal components that are approximately equally distributed in the Nyquist interval can be better represented with an all-pole model than signals that have their tonal components concentrated in a selected region of the Nyquist interval. Unfortunately, audio signals tend to belong to the latter class of signals, since they are typically sampled at a sampling frequency that is much higher than the frequency of their dominating tonal components.
In , it was shown that audio signals having their dominating tonal components in a frequency region that is small compared to the entire signal bandwidth may exhibit a large autocorrelation matrix eigenvalue spread and hence tend to produce inaccurate LP models due to numerical instability. A stabilization method based on a selective LP (SLP) model  was proposed, which reduces the LP model bandwidth to the frequency region of interest. The influence of the signal frequency distribution on LP performance was also recognized with the development of the so-called frequency-warped linear prediction (WLP) [12, 16]. The warping operation is a nonuniform frequency transform which is usually designed to approximate the constant- frequency scale , and also provides a good match with the Bark or ERB psychoacoustic scales, provided that the warping parameter is chosen properly . In , WLP was shown to outperform conventional LP in terms of resolving adjacent peaks in the signal spectrum, however, no gain in spectral flatness of the LP residual was obtained. We will review the SLP and WLP models, as well as three other LP models that appear to be suited for tonal audio signals, and show how all of these models are capable of solving the frequency distribution issue described above. More specifically, we will also consider high-order all-pole models , constrained pole-zero models [24, 25, 31–37], and pitch prediction models. Pitch prediction (PLP), also known as long-term prediction, was originally proposed for speech modeling and coding, and was more recently applied to audio signal modeling in the context of the MPEG-4 advanced audio coder (AAC) [38, 39]. High-order (HOLP) and pole-zero (PZLP) linear prediction models have not been applied to audio modeling before, however, some speech analysis techniques rely on a PZLP model [40–42]. All considered approaches result in stable LP models, and some outperform the WLP model both in terms of conventional measures, such as frequency estimation error and residual spectral flatness [43, Chapter 6], and in terms of perceptually motivated measures, such as interpeak dip depth (IDD) . Moreover, many of these alternative models perform even better when cascaded with a conventional LP model. The LP models described in this paper were evaluated and compared experimentally for a synthetic audio signal in . This work is extended here by also performing a mathematical analysis of the different LP models, and describing additional simulation results for synthetic signals and true monophonic and polyphonic audio signals.
This paper is organized as follows. Section 2 provides some background material on the signal model and the LP criterion. In Section 3, we analyze the performance of the conventional LP model, and illustrate the influence of the distribution of the tonal components in the analyzed signal. In Section 4, five alternative LP models are reviewed and interpreted as potential solutions to the observed frequency distribution problem. The emphasis is on the influence of using models other than the conventional low-order all-pole model, and not on how the model parameters are estimated. However, for each LP model, references to existing estimation methods are provided. LP model pole-zero plots and magnitude responses for a synthetic audio signal are presented throughout Sections 3 and 4. A detailed analysis is only provided for the pole-zero LP model, since all other alternative LP models are all-pole models, which can be analyzed using an approach similar to the conventional LP model analysis in Section 3. In Section 5, we provide LP model pole-zero plots and magnitude responses for true monophonic and polyphonic audio signals. Furthermore, the conventional and alternative LP models are compared in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution, both for synthetic and true audio signals. Finally, Section 6 concludes the paper.
2.1. Tonal Audio Signal Model
We will only consider tonal audio signals, that is, signals having a continuous spectrum containing a finite number of dominant frequency components. In this way, the majority of audio signals is covered, except for the class of percussive sounds. The performance of the different LP models described below will be evaluated for three types of audio signals: synthetic audio signals consisting of a sum of harmonic sinusoids in white noise, true monophonic audio signals, and true polyphonic audio signals.
The fundamental frequency of monophonic audio signals is usually, that is, for most musical instruments, in the range 100–1000 Hz. The number of relevant harmonics (i.e., frequency components at multiples of the fundamental frequency, having a magnitude that is significantly larger than the average signal power) is typically between 10 and 20. It can, thus, be seen that most dominating frequency components in audio signals, sampled at kHz, lie in the lower half of the Nyquist interval, that is, between 0 and 11025 Hz (corresponding to the angular frequency range from 0 to ). This property will be a key issue in the rest of the paper.
Like for speech signals, we can also assume short-term stationarity for audio signals. Monophonic audio signals can typically be divided in musical notes of different durations. Each note can then be subdivided in four parts: the attack, decay, sustain, and release parts. The sustain part is usually the longest part of the note, and exhibits the highest degree of stationarity. The attack and decay parts are the shortest, and may show transient behavior, such that stationarity can only be assumed on very short time windows (a few milliseconds). Whereas LP of speech signals is typically performed on time windows of around 20 milliseconds, longer windows appear to be beneficial for LP of audio signals. In our examples, a time window of 46.4 milliseconds is used, corresponding to samples at kHz, or, in musical terms, 1/32 note at 161.5 beats per minute. In our theoretical derivations, however, we will assume to avoid window end effects.
where, for ease of notation, the time index has been normalized with respect to the sampling period . This signal model is referred to as the tonal signal model, and may differ from the sinusoidal model  used in speech and audio coding in that only the tonal components in the observed audio signal are modeled by sinusoids, while the nontonal components are contained in the noise term . The tonal components correspond to the fundamental frequencies and their relevant harmonics and are characterized by their amplitudes , (radial) frequencies and phases . The noise term will generally have a nonwhite, continuous spectrum, and may also contain low-power harmonics.
The monophonic signal model in (2) is a harmonic signal model, while the tonal and polyphonic signal models in (1) and (3) are not. We should stress that of all LP models described below, the pitch prediction model described in Section 4.3 is the only model in which the harmonicity property is exploited. The other models do not rely on harmonicity, although the calculation of the LP model parameters may be simplified by taking harmonicity into account.
Example 1 (synthetic audio signal).
2.2. Linear Prediction Criterion
where represents a vector that contains the LP model parameters, and denote the -transform of the observed and residual signal, respectively, and corresponds to the prediction error filter (PEF), which has the property of whitening the input signal . The PEF transfer function is required to be stable, while the LP model transfer function is not. In fact, when modeling sinusoidal components in the observed signal , an unstable LP model having poles on the unit circle can be very useful.
This approximation can be justified in the LP analysis by noting that the noise term in the tonal signal model is spectrally much flatter than the tonal part of the observed signal.
We will examine the effect of setting , since we know that an AR( ) model should be capable of perfectly modeling a noiseless sum of sinusoids . However, in the tonal signal model (1), a noise term is also present, hence the solution to the LP estimation problem will be a compromise of attenuating the tonal components, while increasing (or maintaining) the flatness of the noise spectrum. In , this compromise was analyzed with respect to its effect on the radii of the PEF zeros, while disregarding the effect on the PEF zero angles . In our analysis, we will focus on the effect of the noise on the estimated PEF zero angles.
The PEF, thus, behaves as a cascade of second-order all-zero notch filters, with all the zeros on the unit circle and with the notch frequencies equal to the frequencies of the tonal components. Note that the corresponding LP model transfer function is in this case unstable.
In this estimation problem, the squared norm of the PEF impulse response coefficient vector is minimized under a constraint that rules out the trivial solution . It is straightforward to see that the solution to (25) can be obtained by setting and with , which results in a PEF that behaves as a comb filter. The PEF zeros are then uniformly distributed on a circle with radius , and with an angle between the neighboring zeros. In case , the PEF zero angles in the Nyquist interval correspond to , while if , the PEF has zeros in the Nyquist interval, that is, . The latter case corresponds to a one-tap pitch prediction filter (see Section 4.3), which in fact deviates from the conventional LP model in (14), since the zeros at DC and at the Nyquist frequency do not have a corresponding complex conjugate zero.
We can, therefore, expect that when noise is present, the estimated PEF zeros are both shifted toward the origin and rotated around the origin, hence tending to a uniform angular distribution. The extent to which the zeros are displaced as compared to the noiseless solution depends on the noise power which determines the relative importance of the minimum norm constraint in the LP criterion (13). The angular effect described above can also be observed in the noiseless case when the LP model order , in which case the "extraneous" PEF zeros tend to be uniformly distributed around the unit circle if a minimum norm constraint is incorporated in the LP criterion .
Example 2 (conventional LP of synthetic audio signal).
In this section, we present five existing alternative LP models, and we illustrate how all these models attempt to compensate for the shortcomings of the conventional LP model, described in Section 3, when the input signal tonal components are concentrated in the lower half of the Nyquist interval. In the first three alternative LP models, namely, the constrained pole-zero LP (PZLP) model, the high-order LP (HOLP) model, and the pitch prediction (PLP) model, the influence of the input signal frequency distribution is decreased by using a model different from the conventional low-order all-pole model. In the last two alternative LP models, namely, the warped LP (WLP) model and the selective LP (SLP) model, the performance of the conventional low-order all-pole model is increased by first transforming the input signal such that its tonal components are spread in the entire Nyquist interval. As stated earlier, we will mainly focus on the alternative LP models, and not on how the model parameters can be estimated.
4.1. Constrained Pole-Zero LP Model
Substituting (39)–(41) in (37) and (38) and noting that the expression in (35) does not depend on the PEF pole-zero angles , we can see that all the terms in the system of (36)–(38) that are due to the noise component in the observed signal cancel out. In other words, if the PEF poles and zeros are close to the unit circle, then the solution to the LP estimation problem using the PZLP model is insensitive to (white) noise in the observed signal. This is the main strength of the PZLP model as compared to the conventional LP model, which was shown in Section 3 to be much more sensitive to noise when predicting tonal signals.
Example 3 (constrained pole-zero LP of synthetic audio signal).
The PZLP model parameters can be estimated, either using an adaptive notch filtering (ANF) algorithm, for which several implementations have been suggested [24, 25, 31–35], or using the constrained pole-zero linear prediction (CPZLP) algorithm for multitone frequency estimation [36, 37]. Alternatively, if the PEF pole and zero radii are fixed a priori, any existing frequency estimation algorithm may be used to estimate the unknown PEF angles. When harmonicity can be assumed, that is, for monophonic audio signals, an adaptive comb filter (ACF) may be a useful alternative to the ANF, as it relies on only one unknown parameter (i.e., the fundamental frequency) [32, 35]. Similarly, a comb filter-based variant of the CPZLP algorithm has been described in .
4.2. High-Order LP Model
It is well known that a pole-zero model can be arbitrarily closely approximated with an all-pole model, provided that the model order is chosen large enough. This means that a noisy sum of sinusoids can also be modeled using a high-order all-pole model instead of a pole-zero model . In Section 3, the LP minimization problem (13) was analyzed for the case of an all-pole model of order . When noise is present in the observed signal, the LP solution was shown to be a compromise between cancelling the tonal components and maintaining a flat high-frequency residual spectrum. By increasing the model order, the density of the zeros near the unit circle is increased accordingly, and hence the frequency resolution in the tonal components frequency range improves without sacrificing high-frequency residual spectral flatness. However, as the LP model order approaches the observation window length , the variance of the estimated model parameters may be unacceptably large, leading to spurious peaks in the signal spectral estimate . It has been suggested that the order of a high-order LP (HOLP) model should be chosen in the interval to obtain the best spectral estimate for a noisy sum of sinusoids [22, 46].
Example 4 (high-order LP of synthetic audio signal).
4.3. Pitch Prediction Model
It can be seen that at , which corresponds to a comb filter behavior, that is, the PEF zeros are positioned on and equally spaced around the unit circle, at angles corresponding to integer multiples of the fundamental frequency . In other words, referring to the analysis in Section 3, the requirements of having the PEF zeros on the unit circle at angles (for cancelling the tonal components) and uniformly distributed on the unit circle (for maintaining the LP residual spectral flatness) are both fulfilled with the PLP model in (46).
it can be derived that the desired spectral shaping for our application, that is, a decreasing notch depth for increasing frequency, is obtained when .
The fractional delay interpolation filter is a Hamming-windowed, truncated (length- ) approximation of the ideal sinc-like interpolation filter , with denoting the Hamming window (centered at ). In (50), is the interpolation ratio (where is referred to as the pitch resolution) and denotes the fractional phase.
Typically, for estimating the PLP model parameters, in a first step, the optimal pitch lag and fractional phase are estimated by an exhaustive search of the minimal fractional 1-tap PLP residual power over the interval and . In speech analysis, the pitch lag limits correspond to the highest-pitched (female) and lowest-pitched (male) voices being analyzed and are typically chosen in the range and samples, at kHz. For pitch analysis of audio signals, we propose to set the pitch lag range such that it corresponds to a fundamental frequency range of Hz, that is, at kHz, . In a second step, the fractional 3-tap PLP model parameters are estimated using the estimated pitch lag and fractional phase from the first step. Some useful approximations for efficiently calculating the 3-tap PLP model parameters from the input signal autocorrelation function have been suggested in .
Example 5 (pitch prediction of synthetic audio signal).
4.4. Warped LP Model
Since , the warping operation tends to spread out the tonal components in the observed signal over the entire Nyquist interval. From the conventional LP analysis in Section 3, it can hence be expected that applying a conventional, that is, low-order all-pole LP model to the warped signal will yield a better prediction than a conventional LP model of the original signal. The optimal prediction is obtained when the frequency transformation produces a uniform spreading of the tonal components in the Nyquist interval. For monophonic audio signals, this is never the case, since the bilinear frequency warping in (51)-(52) disturbs the harmonicity of the signal. For this class of signals, the frequency transformation of the selective LP model described in Section 4.5 appears to be better suited. However, for polyphonic audio signals, the above bilinear frequency warping may be a near-optimal mapping, since in this case the different fundamental frequencies are approximately related to each other according to the Bark scale (see also the simulation results in Section 5.3).
Example 6 (warped LP of synthetic audio signal).
The warped spectrum of the noisy synthetic audio signal defined before is shown in Figure 7(a) for . Figures 7(b) and 7(c) illustrate the PEF pole-zero plot and magnitude response on a warped frequency scale , when a th-order WLP model is calculated using the autocorrelation method. The frequency resolution of the signal WLP spectral estimate is very good for the five lowest tonal components , while the higher harmonics are modeled less accurately because they are too closely spaced on the warped frequency scale. The PEF transfer function can be unwarped to the original frequency scale, but then the PEF impulse response is of infinite duration. The PEF pole-zero plot and magnitude response on the original frequency scale, obtained by truncating the unwarped PEF impulse response to a length of samples, are shown in Figures 7(d) and 7(e). The pole-zero plot on the original frequency scale clearly illustrates that the WLP model succeeds both at cancelling the (low-frequency) tonal components (by placing a few zeros approximately on the unit circle at the lower tonal component frequencies) and at preserving the overall spectral flatness of the residual (by placing a large number of zeros uniformly spaced around and close to the unit circle).
4.5. Selective LP Model
which, when combined with a conventional LP model, is known as a selective LP (SLP) model .
Note that the optimal downsampling factor , given in (57), is highly signal-dependent, and noninteger downsampling is required in general. These difficulties can be easily avoided by using an approximate, integer downsampling factor (see Section 5) which is chosen to be fixed for the entire signal analysis. It should then typically be chosen in the range , if possible, using some prior knowledge about the frequency range of the instrument generating the audio signal being analyzed.
Example 7 (selective LP of synthetic audio signal).
In this section, we evaluate the conventional and alternative LP models described in Sections 3 and 4 in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution for a synthetic harmonic audio signal with varying fundamental frequency and SNR. Afterwards, we apply the different LP models to true monophonic and polyphonic audio signals, and we analyze the PEF behavior by examining the pole-zero plots and magnitude responses. Residual spectral flatness figures are given for true audio signals as a function of pitch and time offset of the analysis window within the signal.
We should stress that the aim is to compare different LP models, and not the algorithms that can be used to estimate the model parameters. Some models come with parameter estimation algorithms that are well established (e.g., covariance method or autocorrelation method with Levinson-Durbin algorithm [51, Chapter 6] for all-pole models), yet other models do not. In particular, PZLP models typically result in a nonconvex parameter estimation problem that is solved either in an adaptive or iterative way. As a consequence, the performance of the corresponding estimation algorithms (e.g., ANF or CPZLP) depends heavily on the initial conditions. In the simulation results presented below, the initial conditions are chosen in the neighborhood of the true fundamental frequencies in the observed audio signal, such that the PZLP estimation algorithms yield a solution that corresponds with high probability to the global solution. In this way, the emphasis is on the model performance rather than on the estimation algorithm performance. For the same reason, knowledge of the true fundamental frequencies is also assumed when determining the optimal downsampling factor in the SLP estimation algorithms, and for designing a PLP model for polyphonic audio signals. For the conventional LP model, the performance may differ substantially for the autocorrelation and covariance estimation methods, hence the results for both methods are included.
5.1. Synthetic Audio Signal
Throughout Examples 2–7, the performance of conventional and alternative LP models was illustrated by inspecting the PEF pole-zero plots and magnitude responses, resulting from the prediction of a noisy synthetic audio signal with fundamental frequency Hz and SNR = 25 dB. We also present a more quantitative evaluation of the different LP models, for a synthetic audio signal with variable fundamental frequency and SNR.
with the -point DFT of the LP residual . The SFM is a real number between 0 and 1, with SFM = 1 corresponding to a flat spectrum, and is often expressed on a dB-scale (0 dB corresponding to a flat spectrum). Monte Carlo simulation results of the residual SFM after prediction of the synthetic audio signals with varying fundamental frequency and SNR described above are shown in Figures 9(c) and 9(d). The residual SFM of the low-order all-pole models (L , L , WLP, and SLP) decreases with increasing fundamental frequency and increasing SNR. The first observation can be explained by noting that at low fundamental frequency values, the low-order all-pole models tend to model multiple tonal components with one complex conjugate pole pair, while the remaining poles are used to model the high-frequency noise spectrum. As a consequence, most of the poles are located relatively far away from the unit circle, hence resulting in a smoother spectral behavior. The residual SFM drop at high SNR values should not be surprising, since the low-order all-zero PEFs generally do not succeed at completely cancelling the tonal components from the observed signal. On the other hand, the residual SFM of the PLP and PZLP models can be seen to increase with increasing fundamental frequency and decreases (PLP) or remains quasiconstant (PZLP) with increasing SNR. The HOLP model residual SFM is the highest among all LP models, and appears to be independent of both fundamental frequency and SNR. The SFM of the synthetic audio signals before LP was on average −10 dB in the varying fundamental frequency case, and −35 dB in the varying SNR case. A relevant extension to the low-order alternative LP models described in Section 4 is to cascade them with a conventional LP model. Such a cascaded model can be motivated by noting that for true audio signals, the noise term in the tonal signal models (1)–(3) may be nonwhite. Hence, an alternative LP model could be applied first for predicting the tonal components, and in a second step a conventional LP model could be used for whitening both the noise and the unpredicted tonal components in the residual of the alternative LP model. This cascaded structure appears to be beneficial for the low-order alternative LP models (PZLP, PLP, WLP, and SLP) in terms of increasing the residual SFM, especially at high SNR values and, for the PZLP and PLP models, also at low fundamental frequency values.
5.2. Monophonic Audio Signal
5.3. Polyphonic Audio Signal
In this paper, we have analyzed the performance of the conventional LP model when applied to tonal audio signals, and illustrated how the quality of this model depends on the distribution of the signal tonal components in the Nyquist interval. It was shown that the conventional LP model, with a model order equal to two times the number of tonal components, and calculated by minimizing an LS criterion, produces a PEF that features a tradeoff between cancelling the tonal components and keeping the residual spectrum as flat as possible. This tradeoff occurs since the tonal components in an audio signal, sampled at kHz, are typically located in the lower half of the Nyquist interval.
Five existing alternative LP models were described, applied to tonal audio signals, and interpreted in terms of relieving the tradeoff inherent in the conventional LP model. The first three alternative LP approaches solve the frequency distribution problem by considering a model different from the low-order all-pole model, namely, a (constrained) pole-zero (PZLP) model, a high-order all-pole (HOLP) model, or a pitch prediction (PLP) model. Two other alternative approaches aim at improving the low-order all-pole model performance, by first transforming the input signal and hence altering the distribution of its tonal components. If an all-pass bilinear transform is used, we end up with the warped all-pole (WLP) model, whereas a linear frequency transform leads to the selective all-pole (SLP) model.
Extensive simulation results were reported with the aim of assessing the performance of the conventional and alternative LP models. Summarizing, we can state that a high-order all-pole model appears to be better suited to the audio LP problem than a conventional, low-order all-pole model. However, the HOLP model, which typically has half as many model parameters as the number of samples in the analysis window, is impractically complex in many applications. It could hence be expected that the PZLP model is a good alternative, since it can approximate the HOLP PEF impulse response with fewer parameters. This seems to be true only for monophonic audio signals, and even in this case, estimating the model parameters without prior knowledge on the fundamental frequency range is not a trivial task. Another good alternative to the HOLP model in the case of monophonic signals is the PLP model, especially when cascaded with a conventional LP model, as is common use in speech analysis. Finally, for polyphonic audio LP, the WLP model performance comes very close to the optimal HOLP model performance, however, the WLP model performs poorly in terms of perceptual frequency resolution, unless its model order is chosen to be an order of magnitude larger than the number of tonal components in the observed signal .
This research work was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven, in the frame of Katholieke Universiteit (KU) Leuven Research Council: CoE EF/05/006 Optimization in Engineering (OPTEC) and the Belgian Programme on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office IUAP P6/04 ("Dynamical systems, control and optimization" (DYSCO), 2007–2011) and the Concerted Research Action GOA-AMBioRICS, and was supported by the Institute for the Promotion of Innovation through Science and Technology, Flanders (IWT-Vlaanderen). The scientific responsibility is assumed by its authors.
- Makhoul J: Linear prediction: a tutorial review. Proceedings of the IEEE 1975,63(4):561-580.View ArticleGoogle Scholar
- Ramachandran RP, Kabal P: Pitch prediction filters in speech coding. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989,37(4):467-478. 10.1109/29.17527View ArticleGoogle Scholar
- Brandenburg K, Stoll G: ISO-MPEG-1 audio: a generic standard for coding of high-quality digital audio. Journal of the Audio Engineering Society 1994,42(10):780-792.Google Scholar
- ISO/IEC : IS 14496-4:2004/Amd 13:2007: parametric coding for high quality audio conformance. International Organization for Standardization, Geneva, Switzerland; January 2007.Google Scholar
- McAulay RJ, Quatieri TF: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986,34(4):744-754. 10.1109/TASSP.1986.1164910View ArticleGoogle Scholar
- Härmä A, Laine UK, Karjalainen M: Warped linear prediction in audio coding. Proceedings of IEEE Nordic Signal Processing Symposium (NORSIG '96), September 1996, Espoo, Finland 447-450.Google Scholar
- Iwakami N, Moriya T: Transform-domain weighted interleave vector quantization. Proceedings of the 101st AES Convention, November 1996, Los Angeles, Calif, USA AES preprint 4377Google Scholar
- Bessette B, Salami R, Laflamme C, Lefebvre R: A wideband speech and audio codec at 16/24/32 kbit/s using hybrid ACELP/TCX techniques. Proceedings of IEEE Workshop on Speech Coding, June 1999, Porvoo, Finland 7-9.Google Scholar
- Härmä A, Laine UK: Warped low delay CELP for wideband audio coding. Proceedings of the 17th AES International Conference on High-Quality Audio Coding, September 1999, Florence, Italy 207-215.Google Scholar
- Rongshan Y, Chung KC: High quality audio coding using a novel hybrid WLP-subband coding algorithm. Proceedings of the 5th International Symposium on Signal Processing and Its Applications (ISSPA '99), August 1999, Brisbane, Australia 1: 483-486.View ArticleGoogle Scholar
- Edler B, Faller C, Schuller G: Perceptual audio coding using a time-varying linear pre- and post-filter. Proceedings of the 109th AES Convention, September 2000, Los Angeles, Calif, USA AES preprint 5274Google Scholar
- Härmä A, Laine UK: A comparison of warped and conventional linear predictive coding. IEEE Transactions on Speech and Audio Processing 2001,9(5):579-588. 10.1109/89.928922View ArticleGoogle Scholar
- Deriche M, Ning D: A novel audio coding scheme using warped linear prediction model and the discrete wavelet transform. IEEE Transactions on Audio, Speech, and Language Processing 2006,14(6):2039-2048.View ArticleGoogle Scholar
- Biswas A, den Brinker AC: Perceptually biased linear prediction. Journal of the Audio Engineering Society 2006,54(12):1179-1188.Google Scholar
- Nakatoh Y, Matsumoto H: A low-bit-rate audio codec using mel-scaled linear predictive analysis. Acoustical Science and Technology 2007,28(3):147-152. 10.1250/ast.28.147View ArticleGoogle Scholar
- Strube HW: Linear prediction on a warped frequency scale. Journal of the Acoustical Society of America 1980,68(4):1071-1076. 10.1121/1.384992View ArticleGoogle Scholar
- van Waterschoot T, Rombouts G, Verhoeve P, Moonen M: Double-talk-robust prediction error identification algorithms for acoustic echo cancellation. IEEE Transactions on Signal Processing 2007,55(3):846-858.MathSciNetView ArticleGoogle Scholar
- Rombouts G, van Waterschoot T, Struyve K, Moonen M: Acoustic feedback cancellation for long acoustic paths using a nonstationary source model. IEEE Transactions on Signal Processing 2006,54(9):3426-3434.View ArticleGoogle Scholar
- van Waterschoot T, Moonen M: Adaptive feedback cancellation for audio signals using a warped all-pole near-end signal model. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '08), March-April 2008, Las Vegas, Nev, USA 269-272.Google Scholar
- van Waterschoot T, Moonen M: Adaptive feedback cancellation for audio applications. submitted to Signal Processing, ESAT-SISTA Technical Report TR 07-30, Katholieke Universiteit Leuven, Belgium, December 2008Google Scholar
- Pagano M: Estimation of models of autoregressive signal plus white noise. The Annals of Statistics 1974,2(1):97-108.View ArticleMATHGoogle Scholar
- Kay SM: The effects of noise on the autoregressive spectral estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1979,27(5):478-485. 10.1109/TASSP.1979.1163275MathSciNetView ArticleMATHGoogle Scholar
- Chan YT, Lavoie JMM, Plant JB: A parameter estimation approach to estimation of frequencies of sinusoids. IEEE Transactions on Acoustics, Speech, and Signal Processing 1981,29(2):214-219. 10.1109/TASSP.1981.1163543View ArticleMATHGoogle Scholar
- Rao DVB, Kung S-Y: Adaptive notch filtering for the retrieval of sinusoids in noise. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984,32(4):791-802. 10.1109/TASSP.1984.1164398View ArticleGoogle Scholar
- Fitzgerald WJ, Geere R: Class of constrained ARMA models for line enhancement using real-time QR implementation. Electronics Letters 1991,27(24):2230-2231. 10.1049/el:19911379View ArticleGoogle Scholar
- Pisarenko VF: The retrieval of harmonics from a covariance function. Geophysical Journal International 1973,33(3):347-366. 10.1111/j.1365-246X.1973.tb03424.xView ArticleMATHGoogle Scholar
- Jackson LB, Tufts DW, Soong FK, Rao RM: Frequency estimation by linear prediction. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '78), April 1978, Tulsa, Okla, USA 352-356.View ArticleGoogle Scholar
- Nam SH: Stabilizing discrete spectral modeling of audio signals. IEEE Signal Processing Letters 2002,9(9):292-294. 10.1109/LSP.2002.803406View ArticleGoogle Scholar
- Oppenheim AV, Johnson DH, Steiglitz K: Computation of spectra with unequal resolution using the fast Fourier transform. Proceedings of the IEEE 1971,59(2):299-301.View ArticleGoogle Scholar
- Smith JO III, Abel JS: Bark and ERB bilinear transforms. IEEE Transactions on Speech and Audio Processing 1999,7(6):697-708. 10.1109/89.799695View ArticleGoogle Scholar
- Nehorai A: A minimal parameter adaptive notch filter with constrained poles and zeros. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985,33(4):983-996. 10.1109/TASSP.1985.1164643View ArticleGoogle Scholar
- Nehorai A, Porat B: Adaptive comb filtering for harmonic signal enhancement. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986,34(5):1124-1138. 10.1109/TASSP.1986.1164952View ArticleGoogle Scholar
- Ng TS: Some aspects of an adaptive digital notch filter with constrained poles and zeros. IEEE Transactions on Acoustics, Speech, and Signal Processing 1987,35(2):158-161. 10.1109/TASSP.1987.1165114View ArticleGoogle Scholar
- Travassos-Romano JM, Bellanger M: Fast least squares adaptive notch filtering. IEEE Transactions on Acoustics, Speech, and Signal Processing 1988,36(9):1536-1540.View ArticleMATHGoogle Scholar
- Li G: A stable and efficient adaptive notch filter for direct frequency estimation. IEEE Transactions on Signal Processing 1997,45(8):2001-2009. 10.1109/78.611196View ArticleGoogle Scholar
- van Waterschoot T, Moonen M: Constrained pole-zero linear prediction: an efficient and near-optimal method for multi-tone frequency estimation. Proceedings of the 16th European Signal Processing Conference (EUSIPCO '08), August 2008, Lausanne, SwitzerlandGoogle Scholar
- van Waterschoot T, Diehl M, Moonen M: Constrained pole-zero linear prediction: optimization of cascaded biquadratic notch filters for multi-tone and multi-pitch estimation. Katholieke Universiteit Leuven, Leuven, Belgium; February 2008.Google Scholar
- Ojanperä J, Väänänen M, Yin L: Long term predictor for transform domain perceptual audio coding. Proceedings of the 107th AES Convention, September 1999, New York, NY, USA AES preprint 5036Google Scholar
- Herre J, Grill B: Overview of MPEG-4 audio and its applications in mobile communications. Proceedings of the 5th International Conference on Signal Processing Proceedings (WCCC-ICSP '00), August 2000, Beijing, China 11-20.Google Scholar
- Kopec GE, Oppenheim AV, Tribolet JM: Speech analysis homomorphic prediction. IEEE Transactions on Acoustics, Speech, and Signal Processing 1977,25(1):40-49. 10.1109/TASSP.1977.1162909View ArticleGoogle Scholar
- Steiglitz K: On the simultaneous estimation of poles and zeros in speech analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing 1977,25(3):229-234. 10.1109/TASSP.1977.1162939View ArticleGoogle Scholar
- Mitiche L, Derras B, Adamou-Mitiche ABH: Efficient low-order auto regressive moving average (ARMA) models for speech signals. Acoustic Research Letters Online 2004, 5: 75-81. 10.1121/1.1651193View ArticleGoogle Scholar
- Markel JD, Gray AH Jr.: Linear Prediction of Speech. Springer, New York, NY, USA; 1976.View ArticleMATHGoogle Scholar
- van Waterschoot T, Moonen M: Linear prediction of audio signals. Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), August 2007, Antwerp, Belgium 3: 518-521.Google Scholar
- Kumaresan R: On the zeros of the linear prediction-error filter for deterministic signals. IEEE Transactions on Acoustics, Speech, and Signal Processing 1983,31(1):217-220. 10.1109/TASSP.1983.1164021MathSciNetView ArticleGoogle Scholar
- Ulrych TJ, Bishop TN: Maximum entropy spectral analysis and autoregressive decomposition. Reviews of Geophysics and Space Physics 1975,13(1):183-200. 10.1029/RG013i001p00183View ArticleGoogle Scholar
- Qian Y, Chahine G, Kabal P: Pseudo-multi-tap pitch filters in a low bit-rate CELP speech coder. Speech Communication 1994,14(4):339-358. 10.1016/0167-6393(94)90027-2View ArticleGoogle Scholar
- Kroon P, Atal BS: Pitch predictors with high temporal resolution. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '90), April 1990, Albuquerque, NM, USA 2: 661-664.Google Scholar
- Laakso TI, Välimäki V, Karjalainen M, Laine UK: Splitting the unit delay [FIR/all pass filters design]. IEEE Signal Processing Magazine 1996,13(1):30-60. 10.1109/79.482137View ArticleGoogle Scholar
- Härmä A, Karjalainen M, Savioja L, Välimäki V, Laine UK, Huopaniemi J: Frequency-warped signal processing for audio applications. Journal of the Audio Engineering Society 2000,48(11):1011-1031.Google Scholar
- Haykin S: Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ, USA; 1996.MATHGoogle Scholar
- Zwicker E, Fastl H: Psychoacoustics, Facts and Models. Springer, Berlin, Germany; 1990.Google Scholar
- Moore BCJ, Glasberg BR: A revision of Zwicker's loudness model. Acta Acustica United with Acustica 1996,82(2):335-345.Google Scholar
- van Waterschoot T, Moonen M: A pole-zero placement technique for designing second-order IIR parametric equalizer filters. IEEE Transactions on Audio, Speech, and Language Processing 2007,15(8):2561-2565.View ArticleGoogle Scholar
- Opolko F, Wapnick J: McGill University Master Samples. DVD edition. McGill University, Montreal, Canada; 2006.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.