Skip to main content
  • Research Article
  • Open access
  • Published:

Comparison of Linear Prediction Models for Audio Signals

Abstract

While linear prediction (LP) has become immensely popular in speech modeling, it does not seem to provide a good approach for modeling audio signals. This is somewhat surprising, since a tonal signal consisting of a number of sinusoids can be perfectly predicted based on an (all-pole) LP model with a model order that is twice the number of sinusoids. We provide an explanation why this result cannot simply be extrapolated to LP of audio signals. If noise is taken into account in the tonal signal model, a low-order all-pole model appears to be only appropriate when the tonal components are uniformly distributed in the Nyquist interval. Based on this observation, different alternatives to the conventional LP model can be suggested. Either the model should be changed to a pole-zero, a high-order all-pole, or a pitch prediction model, or the conventional LP model should be preceded by an appropriate frequency transform, such as a frequency warping or downsampling. By comparing these alternative LP models to the conventional LP model in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution, we obtain several new and promising approaches to LP-based audio modeling.

1. Introduction

Linear prediction (LP) is a widely used and well-understood technique for the analysis, modeling, and coding of speech signals [1]. Its success can be attributed to its correspondence with the speech generation process. The vocal tract can be modeled as a slowly time-varying, low-order all-pole filter, while the glottal excitation can be represented either by a white noise sequence (for unvoiced sounds), or by an impulse train generated by periodic vibrations of the vocal chords (for voiced sounds). By using this so-called source-filter model, a speech segment can be whitened with a cascade of a formant predictor for removing short-term correlation, and a pitch predictor for removing long-term correlation [2].

The source-filter model is much less popular in audio analysis than in speech analysis. First of all, the generation of musical sounds is highly dependent on the instruments used, hence it is hard to propose a generic audio signal generation model. Second, from a physical point of view, polyphonic audio signals should be analyzed using multiple source-filter models, which seems to be rather impractical. Finally, the enormous success of perceptual audio coders [3] and the recent advent of parametric coders based on the sinusoidal model [4], originally proposed for speech analysis and synthesis [5], have shifted the research interest in audio analysis away from the LP approach. Nevertheless, some audio coding algorithms still rely on LP [6–15], which is then usually performed on a warped frequency scale [16]. Also, in audio signal processing applications other than coding, prediction error filters obtained with LP are used for the whitening of audio signals, for example, to produce robust and fast converging acoustic echo and feedback cancelers [17–20].

Since many audio signals exhibit a large degree of tonality, that is, their frequency spectrum is characterized by a finite number of dominant frequency components, it is useful to analyze LP of audio signals in the frequency domain, that is, from a spectral estimation point of view. Intuitively, one could expect that performing LP using a model order that is twice the number of tonal components leads to a signal estimate in which each of the spectral peaks is modeled with a complex conjugate pole pair close to (but inside) the unit circle. In practice, however, this does not seem to be the case, and very often a poor LP signal estimate is obtained. The fundamental problem when performing LP of an audio signal is that apart from the tonal components, a broadband noise term should generally also be incorporated in the tonal model. The noise term can either account for imperfections in the signal tonal behavior, or for noise introduced when working with finite-length data windows. Whereas a sum of sinusoids can be perfectly modeled using an AR() model, that is, an autoregressive or all-pole model of order , a sum of sinusoids plus (white) noise should instead be modeled using an ARMA() model, that is, an autoregressive moving-average or pole-zero model with zeros and poles [21–25].

A first consequence of incorporating a noise term in the tonal signal model is that the LP spectral estimate is smoothed [22, 26] due to the fact that the estimated poles are drawn toward the origin of the -plane [22, 27]. A second consequence, which to our knowledge has not been recognized up till now, is that the estimated poles tend to be equally distributed around the unit circle when noise is present, even at high signal-to-noise ratios and for low-AR model orders. From this observation, it follows that signals with tonal components that are approximately equally distributed in the Nyquist interval can be better represented with an all-pole model than signals that have their tonal components concentrated in a selected region of the Nyquist interval. Unfortunately, audio signals tend to belong to the latter class of signals, since they are typically sampled at a sampling frequency that is much higher than the frequency of their dominating tonal components.

In [28], it was shown that audio signals having their dominating tonal components in a frequency region that is small compared to the entire signal bandwidth may exhibit a large autocorrelation matrix eigenvalue spread and hence tend to produce inaccurate LP models due to numerical instability. A stabilization method based on a selective LP (SLP) model [1] was proposed, which reduces the LP model bandwidth to the frequency region of interest. The influence of the signal frequency distribution on LP performance was also recognized with the development of the so-called frequency-warped linear prediction (WLP)  [12, 16]. The warping operation is a nonuniform frequency transform which is usually designed to approximate the constant- frequency scale [29], and also provides a good match with the Bark or ERB psychoacoustic scales, provided that the warping parameter is chosen properly [30]. In [12], WLP was shown to outperform conventional LP in terms of resolving adjacent peaks in the signal spectrum, however, no gain in spectral flatness of the LP residual was obtained. We will review the SLP and WLP models, as well as three other LP models that appear to be suited for tonal audio signals, and show how all of these models are capable of solving the frequency distribution issue described above. More specifically, we will also consider high-order all-pole models [22], constrained pole-zero models [24, 25, 31–37], and pitch prediction models. Pitch prediction (PLP), also known as long-term prediction, was originally proposed for speech modeling and coding, and was more recently applied to audio signal modeling in the context of the MPEG-4 advanced audio coder (AAC)  [38, 39]. High-order (HOLP) and pole-zero (PZLP) linear prediction models have not been applied to audio modeling before, however, some speech analysis techniques rely on a PZLP model [40–42]. All considered approaches result in stable LP models, and some outperform the WLP model both in terms of conventional measures, such as frequency estimation error and residual spectral flatness [43, Chapter 6], and in terms of perceptually motivated measures, such as interpeak dip depth (IDD) [12]. Moreover, many of these alternative models perform even better when cascaded with a conventional LP model. The LP models described in this paper were evaluated and compared experimentally for a synthetic audio signal in [44]. This work is extended here by also performing a mathematical analysis of the different LP models, and describing additional simulation results for synthetic signals and true monophonic and polyphonic audio signals.

This paper is organized as follows. Section 2 provides some background material on the signal model and the LP criterion. In Section 3, we analyze the performance of the conventional LP model, and illustrate the influence of the distribution of the tonal components in the analyzed signal. In Section 4, five alternative LP models are reviewed and interpreted as potential solutions to the observed frequency distribution problem. The emphasis is on the influence of using models other than the conventional low-order all-pole model, and not on how the model parameters are estimated. However, for each LP model, references to existing estimation methods are provided. LP model pole-zero plots and magnitude responses for a synthetic audio signal are presented throughout Sections 3 and 4. A detailed analysis is only provided for the pole-zero LP model, since all other alternative LP models are all-pole models, which can be analyzed using an approach similar to the conventional LP model analysis in Section 3. In Section 5, we provide LP model pole-zero plots and magnitude responses for true monophonic and polyphonic audio signals. Furthermore, the conventional and alternative LP models are compared in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution, both for synthetic and true audio signals. Finally, Section 6 concludes the paper.

2. Preliminaries

2.1. Tonal Audio Signal Model

We will only consider tonal audio signals, that is, signals having a continuous spectrum containing a finite number of dominant frequency components. In this way, the majority of audio signals is covered, except for the class of percussive sounds. The performance of the different LP models described below will be evaluated for three types of audio signals: synthetic audio signals consisting of a sum of harmonic sinusoids in white noise, true monophonic audio signals, and true polyphonic audio signals.

The fundamental frequency of monophonic audio signals is usually, that is, for most musical instruments, in the range 100–1000 Hz. The number of relevant harmonics (i.e., frequency components at multiples of the fundamental frequency, having a magnitude that is significantly larger than the average signal power) is typically between 10 and 20. It can, thus, be seen that most dominating frequency components in audio signals, sampled at  kHz, lie in the lower half of the Nyquist interval, that is, between 0 and 11025 Hz (corresponding to the angular frequency range from 0 to ). This property will be a key issue in the rest of the paper.

Like for speech signals, we can also assume short-term stationarity for audio signals. Monophonic audio signals can typically be divided in musical notes of different durations. Each note can then be subdivided in four parts: the attack, decay, sustain, and release parts. The sustain part is usually the longest part of the note, and exhibits the highest degree of stationarity. The attack and decay parts are the shortest, and may show transient behavior, such that stationarity can only be assumed on very short time windows (a few milliseconds). Whereas LP of speech signals is typically performed on time windows of around 20 milliseconds, longer windows appear to be beneficial for LP of audio signals. In our examples, a time window of 46.4 milliseconds is used, corresponding to samples at  kHz, or, in musical terms, 1/32 note at 161.5 beats per minute. In our theoretical derivations, however, we will assume to avoid window end effects.

The underlying signal model that is assumed for all audio signals throughout this paper is as follows:

(1)

where, for ease of notation, the time index has been normalized with respect to the sampling period . This signal model is referred to as the tonal signal model, and may differ from the sinusoidal model [5] used in speech and audio coding in that only the tonal components in the observed audio signal are modeled by sinusoids, while the nontonal components are contained in the noise term . The tonal components correspond to the fundamental frequencies and their relevant harmonics and are characterized by their amplitudes , (radial) frequencies and phases . The noise term will generally have a nonwhite, continuous spectrum, and may also contain low-power harmonics.

Two special cases of the tonal signal model are of particular interest in audio signal modeling. In the monophonic signal model, it is assumed that all tonal components are harmonically related to a single fundamental frequency , that is,

(2)

In the polyphonic signal model, the signal is assumed to contain multiple sets of harmonically related sinusoids, with multiple fundamental frequencies :

(3)

Note that the number of relevant harmonics () may differ for each of the fundamental frequencies , and that only one overall noise term is added.

The monophonic signal model in (2) is a harmonic signal model, while the tonal and polyphonic signal models in (1) and (3) are not. We should stress that of all LP models described below, the pitch prediction model described in Section 4.3 is the only model in which the harmonicity property is exploited. The other models do not rely on harmonicity, although the calculation of the LP model parameters may be simplified by taking harmonicity into account.

Example 1 (synthetic audio signal).

A synthetic audio signal, generated from the monophonic signal model in (2), is well suited for examining the properties of the LP models presented below, since it provides exact knowledge of the fundamental frequency and the number of harmonics. In the examples throughout Sections 3 and 4, a synthetic audio signal is used with samples, tonal components and random, uniformly distributed amplitudes and phases . The synthetic audio signal and its magnitude spectrum are shown in Figures 1(a) and 1(b), respectively. The radial fundamental frequency was chosen to be , that is, with 64 samples per period , such that, at  kHz, the fundamental frequency  Hz is in the midrange of musical notes (i.e., slightly lower than F5). The fundamental frequency and its harmonics are then also in the discrete set of frequencies at which the length- discrete Fourier transform (DFT) is evaluated (see Figure 1(b)). The pitch period being equal to an integer number of sampling periods () will allow us to clearly illustrate the effect of pitch prediction in Section 4.3. Finally, also being an integer multiple of will yield an integer downsampling operation in the SLP method in Section 4.5.

Figure 1
figure 1

Synthetic audio signal: (a) time-domain waveform, (b) magnitude spectrum.

2.2. Linear Prediction Criterion

The aim of LP is to obtain a linear parametric model that predicts the observed signal up to an uncorrelated residual :

(4)

or

(5)

where represents a vector that contains the LP model parameters, and denote the -transform of the observed and residual signal, respectively, and corresponds to the prediction error filter (PEF), which has the property of whitening the input signal . The PEF transfer function is required to be stable, while the LP model transfer function is not. In fact, when modeling sinusoidal components in the observed signal , an unstable LP model having poles on the unit circle can be very useful.

The LP model is generally an infinite impulse response (IIR) model, that is,

(6)

with the numerator and denominator orders defined as and , respectively. While in conventional LP, is an all-pole model (i.e., ); in this paper, we also consider pole-zero LP models. For analyzing the LP performance for tonal input signals, it will be useful to consider the radial representation of :

(7)

with denoting the zero and pole radii, and the numerator and denominator resonance frequencies, respectively. In the sequel, we will assume , such that the LP model parameter vector can be defined as follows:

(8)

From a spectral estimation point of view, the parameter vector should be estimated such that the LP residual has an approximately flat spectrum [1]. In the case of audio LP, the residual does not have to be a white noise signal, as is often assumed in other LP applications, but it can also be a Dirac impulse, which also has a flat spectrum. The parameter vector estimate is the result of minimizing a least-squares (LSs) criterion, which can be expressed in the time domain as well as in the frequency domain, following the Parceval theorem:

(9)

with , the -point discrete Fourier transform (DFT) of the LP residual.

In the theoretical analysis, we will assume an infinitely long observation window (), such that (9) becomes

(10)

using (5) to obtain the second equality, in which denotes the PEF magnitude response and is the power spectrum of . From the tonal signal model in (1), and assuming that the cross-spectrum of the tonal part and the noise part of is zero, we obtain

(11)

such that (10) can be rewritten, using , as

(12)

To simplify the analysis, we assume that the noise term in the tonal signal model has a flat spectrum, that is, , such that

(13)

This approximation can be justified in the LP analysis by noting that the noise term in the tonal signal model is spectrally much flatter than the tonal part of the observed signal.

3. Conventional Linear Prediction Model

We now analyze the minimization of the LP criterion in (13) for a conventional, all-pole LP model. The PEF is in this case an all-zero filter:

(14)

We will examine the effect of setting , since we know that an AR() model should be capable of perfectly modeling a noiseless sum of sinusoids [25]. However, in the tonal signal model (1), a noise term is also present, hence the solution to the LP estimation problem will be a compromise of attenuating the tonal components, while increasing (or maintaining) the flatness of the noise spectrum. In [22], this compromise was analyzed with respect to its effect on the radii of the PEF zeros, while disregarding the effect on the PEF zero angles . In our analysis, we will focus on the effect of the noise on the estimated PEF zero angles.

The LP model parameters in can be obtained as the solution to a system of equations, that are obtained by differentiating the LP criterion in (13) with respect to and , that is,

(15)

We will first consider the case in which the noise term is equal to zero, that is, . In this case, the LP estimation problem can be formulated as follows:

(16)

which leads to the following system of equations:

(17)
(18)

From the PEF transfer function in (14), we can calculate the PEF magnitude response, and its partial derivatives with respect to the parameters :

(19)
(20)
(21)

The system of (17)-(18) with (20)-(21) generally has multiple solutions, even when the PEF zero angles are constrained to lie in , which correspond to (local) minima of the LP criterion. The global minimum in case is obtained for the parameter values

(22)

The PEF, thus, behaves as a cascade of second-order all-zero notch filters, with all the zeros on the unit circle and with the notch frequencies equal to the frequencies of the tonal components. Note that the corresponding LP model transfer function is in this case unstable.

Next, we will illustrate the influence of a nonzero noise term on the solution (22) obtained in the noiseless case. The second term in the LP criterion (13), which is due to the noise, can be rewritten using the Parceval theorem as follows:

(23)

It can, hence, be seen that this term acts as a minimum norm constraint in the LP criterion, in the sense that it penalizes the squared norm of the PEF impulse response coefficient vector:

(24)

This minimum norm constraint has two effects on the solution (22) that was obtained in the noiseless case. A first effect, which was investigated in [22], is that the estimated PEF zeros are drawn toward the origin of the -plane, and hence the estimated PEF zero radii are less than one. A second effect is related to the estimated PEF zero angles . Consider the following constrained estimation problem:

(25)

In this estimation problem, the squared norm of the PEF impulse response coefficient vector is minimized under a constraint that rules out the trivial solution . It is straightforward to see that the solution to (25) can be obtained by setting and with , which results in a PEF that behaves as a comb filter. The PEF zeros are then uniformly distributed on a circle with radius , and with an angle between the neighboring zeros. In case , the PEF zero angles in the Nyquist interval correspond to , while if , the PEF has zeros in the Nyquist interval, that is, . The latter case corresponds to a one-tap pitch prediction filter (see Section 4.3), which in fact deviates from the conventional LP model in (14), since the zeros at DC and at the Nyquist frequency do not have a corresponding complex conjugate zero.

We can, therefore, expect that when noise is present, the estimated PEF zeros are both shifted toward the origin and rotated around the origin, hence tending to a uniform angular distribution. The extent to which the zeros are displaced as compared to the noiseless solution depends on the noise power which determines the relative importance of the minimum norm constraint in the LP criterion (13). The angular effect described above can also be observed in the noiseless case when the LP model order , in which case the "extraneous" PEF zeros tend to be uniformly distributed around the unit circle if a minimum norm constraint is incorporated in the LP criterion [45].

Example 2 (conventional LP of synthetic audio signal).

When we estimate a conventional LP model of order for the synthetic audio signal defined in Example 1, using the covariance method [1] to calculate the model parameters, we obtain a PEF as illustrated by the pole-zero plot and magnitude response in Figures 2(a) and 2(b), respectively. The conventional LP model nearly succeeds at correctly modeling all the tonal components in the synthetic audio signal. However, if we add Gaussian white noise to the observed signal, the covariance method yields the estimated conventional LP model shown in Figures 3(a) and 3(b), for a signal-to-noise ratio (SNR) of 25 dB. The PEF zero configuration is in this case clearly a compromise between the LP solutions to the tonal part and the noise part of the signal. The PEF has 9 complex conjugate zero pairs in the sum of sinusoids frequency region, and another 6 complex conjugate zero pairs which are nearly uniformly distributed in the upper half of the Nyquist interval. A similar result is obtained when we use the autocorrelation method [1] instead of the covariance method to predict the noiseless synthetic audio signal. Indeed, the autocorrelation method introduces noise in the autocorrelation domain by distorting the signal periodicity due to zero padding. This example illustrates the above statement that for conventional LP models, the PEF zero configuration is a tradeoff between suppressing the tonal components and keeping the noise spectrum as flat as possible. Note that in the absence of noise (Figure 2(b)), the PEF high-frequency response may become extremely large.

Figure 2
figure 2

Conventional LP model of synthetic audio signal with order and covariance method: (a) PEF pole-zero plot, (b) PEF magnitude response.

Figure 3
figure 3

Conventional LP model of synthetic audio signal plus noise (SNR  = 25 dB) with order and covariance method: (a) PEF pole-zero plot, (b) PEF magnitude response.

4. Alternative Linear Prediction Models

In this section, we present five existing alternative LP models, and we illustrate how all these models attempt to compensate for the shortcomings of the conventional LP model, described in Section 3, when the input signal tonal components are concentrated in the lower half of the Nyquist interval. In the first three alternative LP models, namely, the constrained pole-zero LP (PZLP) model, the high-order LP (HOLP) model, and the pitch prediction (PLP) model, the influence of the input signal frequency distribution is decreased by using a model different from the conventional low-order all-pole model. In the last two alternative LP models, namely, the warped LP (WLP) model and the selective LP (SLP) model, the performance of the conventional low-order all-pole model is increased by first transforming the input signal such that its tonal components are spread in the entire Nyquist interval. As stated earlier, we will mainly focus on the alternative LP models, and not on how the model parameters can be estimated.

4.1. Constrained Pole-Zero LP Model

It is well known that whereas a sum of sinusoids can be exactly modeled using an AR() model, a sum of sinusoids plus white noise should be modeled using an ARMA() model [21–24] with equal coefficients in the AR and MA parts, that is, the zeros coinciding with the poles [23, 25]. This observation can be extended to a sum of (finite-bandwidth) damped sinusoids plus white noise, but in this case the zeros should be slightly displaced toward the origin, remaining on the same radial line as the poles [24, 25]. The LP model in (7) can then be simplified to a constrained pole-zero LP (PZLP) model with an equal number of poles and zeros:

(26)

with the constraint being that the poles and zeros are on the same radial lines, that is, , with the poles positioned between the zeros and the unit circle, that is, .

We now analyze the PZLP model performance for predicting tonal signals corresponding to the signal model (1), when , by substituting the PEF magnitude response , obtained by inverting the magnitude response of in (26), in the LP criterion (13). First, we evaluate the second term of the LP criterion (13). Using the direct-form representation of the PZLP model in (6), with and , the PEF magnitude response can be calculated as

(27)
(28)

with and the autocorrelation functions of the PEF numerator and denominator coefficients, respectively. Note that when predicting tonal signals, the PEF poles and zeros are typically very close to the unit circle, and the PEF zeros are allowed to lie on the unit circle. We can then approximately state that the PEF pole radii are equal, that is, and likewise that the PEF zero radii are equal, that is, . In this case, the numerator and denominator of the PEF transfer function admit a particular structure, as shown in [31]:

(29)

and, as a consequence, the autocorrelation function of the PEF numerator coefficients can be rewritten, for , as

(30)

and similarly for , by replacing with in (30). Since and are assumed to be close to 1, we can make the following approximations:

(31)

where denotes the floor function, which returns the highest integer less than or equal to . We can hence rewrite in (30) and as

(32)

with

(33)

Substituting (32) in (28) yields

(34)

which is expected to be a good approximation except in the close neighborhood of the PEF pole-zero angles , where the PEF magnitude response approaches zero because the PEF zeros are closer to the unit circle than the poles. However, when integrating the PEF magnitude response over the entire frequency range , the notches in at are negligible, such that the second term in the LP criterion (13) can be written as

(35)

We now consider the minimization of the LP criterion (13) for the PZLP model (26), assuming that and with and using the approximation (31) such that the result in (35) can be applied. Since and are close to each other, they cannot be treated as independent variables, and minimizing the LP criterion with respect to and can be achieved by setting the total derivative with respect to and to zero, which leads to the following system of equations:

(36)
(37)
(38)

with

(39)

Since and are close to each other, we can assume

(40)

Moreover,

(41)

Substituting (39)–(41) in (37) and (38) and noting that the expression in (35) does not depend on the PEF pole-zero angles , we can see that all the terms in the system of (36)–(38) that are due to the noise component in the observed signal cancel out. In other words, if the PEF poles and zeros are close to the unit circle, then the solution to the LP estimation problem using the PZLP model is insensitive to (white) noise in the observed signal. This is the main strength of the PZLP model as compared to the conventional LP model, which was shown in Section 3 to be much more sensitive to noise when predicting tonal signals.

It remains to show that the PEF angles calculated from (36)–(38) converge to the frequencies of the tonal components. The PZLP PEF magnitude response and its partial derivatives with respect to , , and can be calculated as

(42)

where denotes with

(43)

The global minimum of (13) with , corresponding to , is obtained when

(44)

or, equivalently,

(45)

and, hence, following the assumption that the PEF poles are close to the zeros, .

Example 3 (constrained pole-zero LP of synthetic audio signal).

The PZLP model parameters can be estimated, either using an adaptive notch filtering (ANF) algorithm, for which several implementations have been suggested [24, 25, 31–35], or using the constrained pole-zero linear prediction (CPZLP) algorithm for multitone frequency estimation [36, 37]. Alternatively, if the PEF pole and zero radii are fixed a priori, any existing frequency estimation algorithm may be used to estimate the unknown PEF angles. When harmonicity can be assumed, that is, for monophonic audio signals, an adaptive comb filter (ACF) may be a useful alternative to the ANF, as it relies on only one unknown parameter (i.e., the fundamental frequency) [32, 35]. Similarly, a comb filter-based variant of the CPZLP algorithm has been described in [37].

Figures 4(a) and 4(b) show the PEF pole-zero plot and magnitude response of a PZLP model of the synthetic audio signal introduced in Example 1, and with additive Gaussian white noise (SNR  = 25 dB). The PZLP model parameters were calculated using the CPZLP algorithm with a comb filter model [37] of order , pole radius , and zero radius , and with a numerical line search method using the BFGS quasi-Newton algorithm with initial fundamental frequency estimate and line search parameters as suggested in [36]. It can be seen that the PEF magnitude response exhibits a notch filter behavior at the frequencies of the tonal components, while being approximately flat in the remainder of the Nyquist interval.

Figure 4
figure 4

Constrained pole-zero LP model of synthetic audio signal plus noise (SNR  = 25 dB) with order and CPZLP algorithm: (a) PEF pole-zero plot, (b) PEF magnitude response.

4.2. High-Order LP Model

It is well known that a pole-zero model can be arbitrarily closely approximated with an all-pole model, provided that the model order is chosen large enough. This means that a noisy sum of sinusoids can also be modeled using a high-order all-pole model instead of a pole-zero model [22]. In Section 3, the LP minimization problem (13) was analyzed for the case of an all-pole model of order . When noise is present in the observed signal, the LP solution was shown to be a compromise between cancelling the tonal components and maintaining a flat high-frequency residual spectrum. By increasing the model order, the density of the zeros near the unit circle is increased accordingly, and hence the frequency resolution in the tonal components frequency range improves without sacrificing high-frequency residual spectral flatness. However, as the LP model order approaches the observation window length , the variance of the estimated model parameters may be unacceptably large, leading to spurious peaks in the signal spectral estimate [22]. It has been suggested that the order of a high-order LP (HOLP) model should be chosen in the interval to obtain the best spectral estimate for a noisy sum of sinusoids [22, 46].

Example 4 (high-order LP of synthetic audio signal).

Performing a th-order LP of the noisy synthetic audio signal fragment defined before, using the autocorrelation method to estimate the model parameters, we obtain a PEF pole-zero plot and magnitude response as shown in Figures 5(a) and 5(b). Examining the distribution of the PEF zeros in the complex plane reveals that this approach produces approximately zeros, lying on and nearly equally spaced around the unit circle (to provide overall spectral flatness of the PEF magnitude response), and additional zeros at the frequencies of the tonal components (to provide the notch filter behavior). Note that when applying the covariance method to the estimation of the HOLP model parameters, a similar result is obtained.

Figure 5
figure 5

High-order LP model of synthetic audio signal plus noise (SNR  = 25 dB) with order and autocorrelation method: (a) PEF pole-zero plot, (b) PEF magnitude response.

4.3. Pitch Prediction Model

In LP of speech signals, the conventional LP model is usually cascaded with the so-called pitch prediction (PLP) model, with the aim of removing the long-term correlation from the signal. This technique can also be used to remove the (quasi) periodicity from monophonic audio signals, since it implicitly relies on the harmonicity of the observed signal. If we consider a sum of harmonic sinusoids having a pitch period that corresponds to an integer number of sampling periods , where is referred to as the pitch lag, then perfect prediction can be obtained by using a one-tap pitch predictor, of which the PEF transfer function is given by

(46)

The PEF magnitude response corresponding to (46) is

(47)

It can be seen that at , which corresponds to a comb filter behavior, that is, the PEF zeros are positioned on and equally spaced around the unit circle, at angles corresponding to integer multiples of the fundamental frequency . In other words, referring to the analysis in Section 3, the requirements of having the PEF zeros on the unit circle at angles (for cancelling the tonal components) and uniformly distributed on the unit circle (for maintaining the LP residual spectral flatness) are both fulfilled with the PLP model in (46).

However, for the PLP model to be capable of producing a good spectral estimate of a monophonic audio signal, we should improve the model in (46) in two ways. First of all, in audio signals the amplitudes of the harmonics typically decrease with increasing (see, e.g., Figures 11(b) and 14(b) in Section 5). This effect requires the PEF magnitude response to be spectrally shaped such that the comb filter notch depth decreases for increasing frequency. This can be achieved by using a multitap PLP model [47] which features multiple nonzero filter coefficients centered around the pitch lag value. In speech processing, a 3-tap PLP model is often applied, since this configuration usually provides enough flexibility in terms of spectral shaping:

(48)

From the 3-tap PEF magnitude response

(49)

it can be derived that the desired spectral shaping for our application, that is, a decreasing notch depth for increasing frequency, is obtained when [47].

Secondly, the PLP model in (47) is based on the assumption that the pitch lag is an integer number, which is generally not the case. Noninteger pitch lags can be incorporated in the PLP model in two ways: either by using a multitap PLP model for interpolation (see, e.g., [2]) or by using a fractional delay filter [48], for which numerous design methods exist [49]. We prefer to combine both approaches, such that the multitap structure may be primarily used for spectral shaping, whereas interpolation for noninteger pitch lags is achieved with a fractional delay filter. A combined fractional multitap PLP model has been proposed in [47], with

(50)

The fractional delay interpolation filter is a Hamming-windowed, truncated (length-) approximation of the ideal sinc-like interpolation filter [49], with denoting the Hamming window (centered at ). In (50), is the interpolation ratio (where is referred to as the pitch resolution) and denotes the fractional phase.

Typically, for estimating the PLP model parameters, in a first step, the optimal pitch lag and fractional phase are estimated by an exhaustive search of the minimal fractional 1-tap PLP residual power over the interval and . In speech analysis, the pitch lag limits correspond to the highest-pitched (female) and lowest-pitched (male) voices being analyzed and are typically chosen in the range and samples, at  kHz. For pitch analysis of audio signals, we propose to set the pitch lag range such that it corresponds to a fundamental frequency range of  Hz, that is, at  kHz, . In a second step, the fractional 3-tap PLP model parameters are estimated using the estimated pitch lag and fractional phase from the first step. Some useful approximations for efficiently calculating the 3-tap PLP model parameters from the input signal autocorrelation function have been suggested in [2].

Example 5 (pitch prediction of synthetic audio signal).

The parameters of the fractional 3-tap PLP model given in (50) were estimated for the noisy synthetic audio signal defined earlier using the method proposed in [47], with an interpolation filter of length and an interpolation ratio , and by forcing the input correlation matrix to be Toeplitz [2]. The resulting PEF magnitude response and pole-zero plot are shown in Figures 6(a) and 6(b). Note the additional circle of zeros around the origin in Figure 6(a), which is due to the fractional part of the PEF transfer function, and the spectral shaping effect in Figure 6(b), which is obtained by using multiple taps in the PLP model.

Figure 6
figure 6

Fractional 3-tap PLP model of synthetic audio signal plus noise (SNR  = 25 dB): (a) PEF pole-zero plot, (b) PEF magnitude response.

4.4. Warped LP Model

Warped linear prediction (WLP) is probably the most well-known technique for LP of audio signals, see [12] and references therein. In WLP, the input signal undergoes a nonuniform frequency transformation before a conventional LP is performed, with the aim of enhancing the frequency resolution in certain frequency regions. The frequency transformation is usually defined by an all-pass bilinear transform in the -domain, which maps the unit circle onto itself:

(51)

The so-called warping parameter is typically chosen such that the corresponding frequency mapping

(52)

approximates the Bark auditory scale [30], that is, when the sampling rate is expressed in kHz:

(53)

Since , the warping operation tends to spread out the tonal components in the observed signal over the entire Nyquist interval. From the conventional LP analysis in Section 3, it can hence be expected that applying a conventional, that is, low-order all-pole LP model to the warped signal will yield a better prediction than a conventional LP model of the original signal. The optimal prediction is obtained when the frequency transformation produces a uniform spreading of the tonal components in the Nyquist interval. For monophonic audio signals, this is never the case, since the bilinear frequency warping in (51)-(52) disturbs the harmonicity of the signal. For this class of signals, the frequency transformation of the selective LP model described in Section 4.5 appears to be better suited. However, for polyphonic audio signals, the above bilinear frequency warping may be a near-optimal mapping, since in this case the different fundamental frequencies are approximately related to each other according to the Bark scale (see also the simulation results in Section 5.3).

Example 6 (warped LP of synthetic audio signal).

The warped spectrum of the noisy synthetic audio signal defined before is shown in Figure 7(a) for . Figures 7(b) and 7(c) illustrate the PEF pole-zero plot and magnitude response on a warped frequency scale , when a th-order WLP model is calculated using the autocorrelation method. The frequency resolution of the signal WLP spectral estimate is very good for the five lowest tonal components , while the higher harmonics are modeled less accurately because they are too closely spaced on the warped frequency scale. The PEF transfer function can be unwarped to the original frequency scale, but then the PEF impulse response is of infinite duration. The PEF pole-zero plot and magnitude response on the original frequency scale, obtained by truncating the unwarped PEF impulse response to a length of samples, are shown in Figures 7(d) and 7(e). The pole-zero plot on the original frequency scale clearly illustrates that the WLP model succeeds both at cancelling the (low-frequency) tonal components (by placing a few zeros approximately on the unit circle at the lower tonal component frequencies) and at preserving the overall spectral flatness of the residual (by placing a large number of zeros uniformly spaced around and close to the unit circle).

Note that the WLP residual can be calculated without unwarping the PEF transfer function, but instead by considering the PEF as a warped FIR filter [50]. Moreover, before feeding the WLP residual to a synthesis filter or calculating its spectral flatness (see Section 5), it should be postfiltered with a high-pass filter defined as [12]

(54)
Figure 7
figure 7

Warped LP model of synthetic audio signal plus noise (SNR  = 25 dB) with order , warping parameter , and autocorrelation method: (a) Noisy synthetic audio signal magnitude spectrum (warped scale), (b) PEF pole-zero plot (warped scale), (c) PEF magnitude response (warped scale), (d) PEF pole-zero plot (original scale), (e) PEF magnitude response (original scale).

4.5. Selective LP Model

In some cases, for example, when dealing with monophonic audio signals, a uniform frequency mapping may be more useful than a nonuniform mapping such as the warping operation described in Section 4.4, since it preserves the harmonic relation between the tonal components. A uniform mapping, which allows to "zoom in" on a certain frequency region , is accomplished by

(55)

which, when combined with a conventional LP model, is known as a selective LP (SLP) model [1].

To obtain a uniform spreading of the tonal components over the entire Nyquist interval, we should choose and , with and the frequencies of the lowest and highest tonal components, see (1). This leads to

(56)

with

(57)

In the -domain, this corresponds to the mapping:

(58)

which is a downsampling operation with downsampling factor . In the case of a monophonic audio signal, the downsampling factor can be rewritten using (2):

(59)

and in the polyphonic case, using (3):

(60)

Note that the optimal downsampling factor , given in (57), is highly signal-dependent, and noninteger downsampling is required in general. These difficulties can be easily avoided by using an approximate, integer downsampling factor (see Section 5) which is chosen to be fixed for the entire signal analysis. It should then typically be chosen in the range , if possible, using some prior knowledge about the frequency range of the instrument generating the audio signal being analyzed.

Example 7 (selective LP of synthetic audio signal).

The spectrum of the noisy synthetic audio signal defined before, downsampled with a factor (obtained from (59) with and ), is shown in Figure 8(a), and the PEF pole-zero plot and magnitude response, resulting from using a th-order SLP model, calculated with the autocorrelation method, are plotted on the downsampled frequency scale in Figures 8(b) and 8(c). The PEF zeros are nearly perfectly distributed in a uniform way around the unit circle with exactly one complex conjugate zero pair for each tonal component in the downsampled signal. After upsampling, the PEF pole-zero plot and magnitude response shown in Figures 8(d) and 8(e) are obtained. The PEF behavior on the original frequency scale is comparable to the PLP model PEF behavior, that is, nearly perfect cancellation of the tonal components is achieved, at the cost of having additional notches in the upper half of the Nyquist interval, which may result in a nonsmooth high-frequency residual spectrum. The LP residual can either be calculated on the downsampled or on the original time scale.

Figure 8
figure 8

Selective LP model of synthetic audio signal plus noise (SNR  = 25 dB) with order , downsampling factor , and autocorrelation method: (a) noisy synthetic audio signal magnitude spectrum (downsampled scale), (b) PEF pole-zero plot (downsampled scale), (c) PEF magnitude response (downsampled scale), (d) PEF pole-zero plot (original scale), (e) PEF magnitude response (original scale).

5. Simulation Results

In this section, we evaluate the conventional and alternative LP models described in Sections 3 and 4 in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution for a synthetic harmonic audio signal with varying fundamental frequency and SNR. Afterwards, we apply the different LP models to true monophonic and polyphonic audio signals, and we analyze the PEF behavior by examining the pole-zero plots and magnitude responses. Residual spectral flatness figures are given for true audio signals as a function of pitch and time offset of the analysis window within the signal.

We should stress that the aim is to compare different LP models, and not the algorithms that can be used to estimate the model parameters. Some models come with parameter estimation algorithms that are well established (e.g., covariance method or autocorrelation method with Levinson-Durbin algorithm [51, Chapter 6]  for all-pole models), yet other models do not. In particular, PZLP models typically result in a nonconvex parameter estimation problem that is solved either in an adaptive or iterative way. As a consequence, the performance of the corresponding estimation algorithms (e.g., ANF or CPZLP) depends heavily on the initial conditions. In the simulation results presented below, the initial conditions are chosen in the neighborhood of the true fundamental frequencies in the observed audio signal, such that the PZLP estimation algorithms yield a solution that corresponds with high probability to the global solution. In this way, the emphasis is on the model performance rather than on the estimation algorithm performance. For the same reason, knowledge of the true fundamental frequencies is also assumed when determining the optimal downsampling factor in the SLP estimation algorithms, and for designing a PLP model for polyphonic audio signals. For the conventional LP model, the performance may differ substantially for the autocorrelation and covariance estimation methods, hence the results for both methods are included.

5.1. Synthetic Audio Signal

Throughout Examples 2–7, the performance of conventional and alternative LP models was illustrated by inspecting the PEF pole-zero plots and magnitude responses, resulting from the prediction of a noisy synthetic audio signal with fundamental frequency  Hz and SNR  = 25 dB. We also present a more quantitative evaluation of the different LP models, for a synthetic audio signal with variable fundamental frequency and SNR.

A first performance measure is the mean square frequency error (MSFE), which is defined with the aim of evaluating the frequency estimation accuracy of the different LP models,

(61)

with

(62)

In other words, the MSFE is calculated as the mean square difference between each of the frequencies of the tonal components in the observed signal, and the angle of the PEF zero that is closest to the point in the complex plane. The MSFE was calculated for a synthetic audio signal with , , , , and as in Example 2, with additive Gaussian white noise resulting in an SNR  = 25 dB and with varying fundamental frequency equals to the first 11 center frequencies of the Bark scale [52]. A second experiment was conducted with similar signals having a fixed fundamental frequency  Hz, and an SNR varying between −50 dB and 50 dB in steps of 10 dB. The MSFE results, averaged over 100 Monte Carlo trials for different realizations of the Gaussian white noise sequence, are shown in Figures 9(a) and 9(b). The MSFE of the low-order all-pole models (L, L, WLP, and SLP) appears to be more or less invariant with respect to varying fundamental frequency and SNR, with MSFE values varying between −50 and −20 dB, the highest of which is obtained with the conventional LP model. It can be observed that models for which the PEF zeros are on (PLP and PZLP) or very close to (HOLP) the unit circle generally provide a higher frequency estimation accuracy. The HOLP model produces MSFE values between −70 and −50 dB, which are invariant with varying fundamental frequency, and slightly lower for high than for low SNRs. At sufficiently high fundamental frequency and SNR values, the PLP and PZLP models achieve an MSFE as low as −90 (PLP) and −100 (PZLP) dB. However, the PLP and PZLP models MSFE performance is seen to be worse for lower fundamental frequencies and SNR values. The sensitivity of these models to the fundamental frequency is presumably related to the fact that these are the only models that explicitly rely on the harmonicity of the observed signal (since in the PZLP case, the comb filter model is used). The performance drop of the PZLP model at low SNR values is due to the accuracy of the CPZLP algorithm, which is known to be relatively poor at SNR values below −5 dB [37].

Figure 9
figure 9

Mean square frequency error (MSFE) and residual SFM curves of Monte Carlo simulations for a synthetic audio signal with variable fundamental frequency and SNR. MSFE (dB) versus fundamental frequency (Hz)MSFE (dB) versus SNR (dB)Residual SFM (dB) versus fundamental frequency (Hz)Residual SFM (dB) versus SNR (dB)

A second performance measure is the spectral flatness measure (SFM) of the LP residual, defined as [43, Chapter 6]

(63)

with the -point DFT of the LP residual . The SFM is a real number between 0 and 1, with SFM  = 1 corresponding to a flat spectrum, and is often expressed on a dB-scale (0 dB corresponding to a flat spectrum). Monte Carlo simulation results of the residual SFM after prediction of the synthetic audio signals with varying fundamental frequency and SNR described above are shown in Figures 9(c) and 9(d). The residual SFM of the low-order all-pole models (L, L, WLP, and SLP) decreases with increasing fundamental frequency and increasing SNR. The first observation can be explained by noting that at low fundamental frequency values, the low-order all-pole models tend to model multiple tonal components with one complex conjugate pole pair, while the remaining poles are used to model the high-frequency noise spectrum. As a consequence, most of the poles are located relatively far away from the unit circle, hence resulting in a smoother spectral behavior. The residual SFM drop at high SNR values should not be surprising, since the low-order all-zero PEFs generally do not succeed at completely cancelling the tonal components from the observed signal. On the other hand, the residual SFM of the PLP and PZLP models can be seen to increase with increasing fundamental frequency and decreases (PLP) or remains quasiconstant (PZLP) with increasing SNR. The HOLP model residual SFM is the highest among all LP models, and appears to be independent of both fundamental frequency and SNR. The SFM of the synthetic audio signals before LP was on average −10 dB in the varying fundamental frequency case, and −35 dB in the varying SNR case. A relevant extension to the low-order alternative LP models described in Section 4 is to cascade them with a conventional LP model. Such a cascaded model can be motivated by noting that for true audio signals, the noise term in the tonal signal models (1)–(3) may be nonwhite. Hence, an alternative LP model could be applied first for predicting the tonal components, and in a second step a conventional LP model could be used for whitening both the noise and the unpredicted tonal components in the residual of the alternative LP model. This cascaded structure appears to be beneficial for the low-order alternative LP models (PZLP, PLP, WLP, and SLP) in terms of increasing the residual SFM, especially at high SNR values and, for the PZLP and PLP models, also at low fundamental frequency values.

Finally, the third performance measure we will use is the interpeak dip depth (IDD)  [12], a perceptually motivated measure which reflects the separability of spectral peaks for a certain model. It is defined for an LP model of a length- sum of two sinusoids at frequencies and Hz, separated by two times the equivalent rectangular bandwidth (ERB)  [53] at the lower frequency , that is, , as

(64)

with and corresponding to the amplitude of the two peaks in the LP model magnitude response, and to the minimal amplitude between the two peaks. The higher the IDD, the better the perceptual frequency resolution of the model is expected to be. The IDD was measured for all LP models except the PLP model, for 24 sets of two sinusoids, with corresponding to the center frequency of the 24 Bark scale bands [52]. The PLP model is not appropriate for this type of signal, since the sinusoid frequencies are not harmonically related. The IDD results for the conventional LP, PZLP, WLP, and SLP models with order and for the HOLP model with order are shown in Figure 10. The low-order all-pole models perform poorly, except for the conventional LP model with the covariance estimation method, which has a very high IDD even in the low-frequency region. For true audio signals, however, the L model will perform worse in terms of perceptual frequency resolution since the estimated model parameters can strongly differ for noise-free and noisy sinusoidal signals, see Figures 2(a) and 3(a). The HOLP model IDD exhibits a similar trend as the L model IDD, as it slightly increases with increasing frequency, remaining on average 14 dB below the L model IDD curve. The PZLP model can be seen to produce high IDD values at low and high frequencies, but performs poorly in the midfrequency range (250 to 1370 Hz), which is exactly the frequency range of interest in audio applications. Of course, the IDD performance of an LP model is strongly related to the bandwidth of the spectral peaks that it can produce. As a consequence, the PZLP model IDD performance can be improved by increasing the pole radius (e.g., , see Figure 10), which is equivalent to reducing the smallest achievable bandwidth [54], however, when dealing with true audio signals a lower value of the pole radius is expected to be more appropriate for taking into account the damping of the tonal components.

Figure 10
figure 10

IDD results for two-tone signal with frequencies and .

Figure 11
figure 11

Monophonic audio signal: (a) time-domain waveform, (b) magnitude spectrum.

5.2. Monophonic Audio Signal

A length- monophonic audio fragment was extracted from a Bb clarinet sound recording in the McGill University Master Samples collection [55]. The fragment, which corresponds to the samples 70001 to 72048 of the G4 note recording, is shown in Figure 11(a), along with its magnitude spectrum in Figure 11(b). The fundamental frequency corresponds to  Hz, and the number of relevant harmonics is chosen to be . A conventional LP model of order , calculated using the autocorrelation method, produces a PEF as illustrated in Figures 12(a) and 12(d), which is again a compromise between cancelling the tonal components and keeping the residual spectrum relatively flat. A better resolution is obtained using a PZLP model with , and , , as shown in Figures 12(b) and 12(e), and using an HOLP model with , see Figures 12(c) and 12(f). A fractional 3-tap PLP model was calculated using the method proposed in [47], with the algorithm parameters given in Example 5, resulting in the PEF shown in Figures 12(g) and 12(j), in which the spectral shaping capability of the 3-tap PLP model is clearly exploited. A WLP model with and produces an unwarped PEF as shown in Figures 12(h) and 12(k). Finally, the SLP model with , for which the optimal downsampling factor from (57) was rounded to , has a PEF after upsampling which is given in Figures 12(i) and 12(l).

Figure 12
figure 12

Monophonic audio signal: PEF pole-zero plots (first and third row) and PEF magnitude responses (second and fourth rows) for conventional and alternative LP models. Conventional LP modelPole-zero LP modelHigh-order LP modelConventional LP modelPole-zero LP modelHigh-order LP modelPitch prediction modelWarped LP modelSelective LP modelPitch prediction modelWarped LP modelSelective LP model

The residual SFM values obtained with the different LP models were calculated for 2048 sample fragments taken from the sustain part of the Bb clarinet recordings in [55] with varying pitch, ranging from D3 to D6 (corresponding to  Hz to 1174.7 Hz), and are shown in Figure 13(a). The original signal fragments have an average SFM value of −31 dB. The residual SFM curves for the PZLP and PLP models are not shown, as they are (partially) outside the displayed SFM range, with an average residual SFM of −12 and −19 dB, respectively. Figure 13(c) contains the residual SFM results when the analysis window time offset is varied in steps of 2048 samples from the onset till the end of the Bb clarinet G4 note in [55], which is plotted in Figure 13(b). Again, the PZLP and PLP curves are omitted, with an average residual SFM of −10 and −19 dB, respectively, while the original signal fragments have an average SFM of −29 dB. From Figure 13(a), we can observe that the residual SFM does not exhibit a notable trend with varying fundamental frequency for any of the LP models, which is somewhat contradictory with the results obtained for synthetic signals (see Figure 9(c)). This can be explained by suggesting that the residual SFM value for true audio signals is primarily determined by the (low-power) harmonics which are modeled as noise components instead of tonal components. This undermodelling effect is generally independent of the fundamental frequency, but rather depends on which musical instrument is considered. Figures 13(b) and 13(c) show that the LP model performance is comparable in the decay, sustain, and release part of the note, but somewhat worse in the attack part. This is mainly due to the fact that the attack part exhibits much less stationarity than the other signal parts. In both experiments, the HOLP model and the PZLP and PLP models cascaded with a conventional LP model, provide the best residual SFM results, which is consistent with the results obtained for synthetic signals (see Figures 9(c) and 9(d)). The WLP model, potentially cascaded with a conventional LP model, performs somewhat worse yet still outperforms the L model, while the SLP and L models yield significantly poorer results.

Figure 13
figure 13

Residual SFM curves for a true monophonic audio signal with variable fundamental frequency and analysis window time offset. Residual SFM (dB) versus fundamental frequency (Hz)Time-domain waveform of analyzed Bb clarinet G4 noteResidual SFM (dB) versus analysis window time offset (second)

Figure 14
figure 14

Polyphonic audio signal: (a) time-domain waveform, (b) magnitude spectrum.

5.3. Polyphonic Audio Signal

From the concert hall Steinway recordings in [55], a polyphonic audio signal was generated by adding four monophonic piano sounds. The samples 2001 to 4048 of the C4, E4, G4, and C5 note recordings were added to obtain a length- C major chord, plotted in Figures 14(a) and 14(b). The four fundamental frequencies are  Hz, and each of the monophonic components has 7 relevant harmonics, that is, . The PEF obtained with a conventional LP model of order is shown in Figures 15(a) and 15(d). It can be seen that the PEF has only one low-frequency notch and an overall high-pass shape. The PZLP model with , and , produces exactly as many PEF notches as there are nonoverlapping tonal components, as can be seen in Figures 15(b) and 15(e). The same holds true for the HOLP model with , of which the PEF is shown in Figures 15(c) and 15(f). The PLP model does not seem to be suited for predicting polyphonic signals since the tonal components do not obey an integer harmonic relation. An alternative PLP approach could exist in cascading as many PLP models as there are different fundamental frequencies in the polyphonic signal, but this does not yield good results. Another alternative PLP approach may be based on the fractional harmonic relations which exist between the fundamental frequencies in a musical chord, for example, for a major chord (consisting of dominant, third, fifth, and octave) it can be verified that , , and . As a consequence, a fractional PLP model with pitch lag samples would produce PEF notches at all the tonal components in the polyphonic signal. However, allowing such large pitch lags deteriorates the performance of the algorithm for calculating the PLP model parameters, since the allowable pitch lag search space becomes very large, rendering the algorithm slower and less reliable. Moreover, the large number of spurious notches in the PEF frequency response leads to an extremely nonsmooth residual spectrum. As an example, a fractional pseudo-3-tap PLP model [47], assuming knowledge of the pitch lag samples, was constructed by setting and . The resulting PEF when and is shown in Figures 15(g) and 15(j). Finally, the WLP and SLP models were applied to the polyphonic signal, both with , a warping parameter resulting in the unwarped PEF in Figures 15(h) and 15(k), and a downsampling factor (rounded from the optimal value in (60)) resulting in the upsampled PEF shown in Figures 15(i) and 15(l).

Figure 15
figure 15

Polyphonic audio signal: PEF pole-zero plots (first and third rows) and PEF magnitude responses (second and fourth rows) for conventional and alternative LP models. Conventional LP modelPole-zero LP modelHigh-order LP modelConventional LP modelPole-zero LP modelHigh-order LP modelPitch prediction modelWarped LP modelSelective LP modelPitch prediction modelWarped LP modelSelective LP model

Two similar experiments as in the monophonic case were performed, for calculating the residual SFM values after prediction of a polyphonic audio signal with varying pitch and analysis window time offset. Figure 16(a) shows the residual SFM results for LP of a 4-note major chord (consisting of dominant, third, fifth, and octave) created from the concert hall Steinway recordings in [55], in which the dominant varies from A0 to C7 (corresponding to  Hz to 2093 Hz), and the analysis window is in the release part of the chord. The L and PLP curves are not shown, since they are partially below the displayed residual SFM range, having a residual SFM value of −11 and −30 dB, respectively. The original polyphonic signals have an average SFM of −32 dB. At very low-pitched chords, the L, HOLP, WLP, SLP models and the PZLP and PLP models cascaded with a conventional LP model are quite competitive, however, toward higher pitch values, the HOLP and WLP models outperform the other models. The superior performance of the WLP model as compared to the other low-order models should not be a surprise. As noted in Section 4.4, the tonal components in a polyphonic signal are approximately distributed according to the Bark scale and are hence mapped to a nearly uniform frequency distribution after frequency warping. The L and SLP models still perform reasonably well for high-pitched chords, while the cascaded PZLP and PLP models perform worse. It appears that the approach of decomposing the polyphonic signal into a number of harmonic signals (which is what the PZLP and PLP models attempt to do) is not beneficial in terms of residual spectral flatness. In Figure 16(b), the 4-note major chord with dominant C4 is plotted, for which the residual SFM results of LP with a variable analysis window time offset are shown in Figure 16(c). During the attack part of the chord (analysis window offset  = 0 second), all LP models perform poorly. In the next 5 positions of the analysis window, which correspond to the decay and sustain parts, the residual SFM performance is the best. Again, the HOLP and WLP models yield better results than the L and SLP models, which in turn outperform the PZLP and PLP models, cascaded with a conventional LP model. In the release part of the chord (analysis window offset  = ca. 0.6 second to 9.8 second), the residual SFM performance is highly fluctuating for all models, and particularly, the cascaded PZLP model residual SFM curve exhibits a decreasing trend toward the end of the chord due to the decreasing SNR. The original C major chord has an average SFM of −37 dB, and the L and PLP models, resulting in an average residual SFM of −12 and −28 dB, respectively, are not shown in the graph.

Figure 16
figure 16

Residual SFM curves for a true polyphonic audio signal with variable fundamental frequency and analysis window time offset. Residual SFM (dB) versus lower fundamental chord frequency (Hz)Time-domain waveform of analyzed C major pianoResidual SFM (dB) versus analysis window time offset (second)

6. Conclusion

In this paper, we have analyzed the performance of the conventional LP model when applied to tonal audio signals, and illustrated how the quality of this model depends on the distribution of the signal tonal components in the Nyquist interval. It was shown that the conventional LP model, with a model order equal to two times the number of tonal components, and calculated by minimizing an LS criterion, produces a PEF that features a tradeoff between cancelling the tonal components and keeping the residual spectrum as flat as possible. This tradeoff occurs since the tonal components in an audio signal, sampled at  kHz, are typically located in the lower half of the Nyquist interval.

Five existing alternative LP models were described, applied to tonal audio signals, and interpreted in terms of relieving the tradeoff inherent in the conventional LP model. The first three alternative LP approaches solve the frequency distribution problem by considering a model different from the low-order all-pole model, namely, a (constrained) pole-zero (PZLP) model, a high-order all-pole (HOLP) model, or a pitch prediction (PLP) model. Two other alternative approaches aim at improving the low-order all-pole model performance, by first transforming the input signal and hence altering the distribution of its tonal components. If an all-pass bilinear transform is used, we end up with the warped all-pole (WLP) model, whereas a linear frequency transform leads to the selective all-pole (SLP) model.

Extensive simulation results were reported with the aim of assessing the performance of the conventional and alternative LP models. Summarizing, we can state that a high-order all-pole model appears to be better suited to the audio LP problem than a conventional, low-order all-pole model. However, the HOLP model, which typically has half as many model parameters as the number of samples in the analysis window, is impractically complex in many applications. It could hence be expected that the PZLP model is a good alternative, since it can approximate the HOLP PEF impulse response with fewer parameters. This seems to be true only for monophonic audio signals, and even in this case, estimating the model parameters without prior knowledge on the fundamental frequency range is not a trivial task. Another good alternative to the HOLP model in the case of monophonic signals is the PLP model, especially when cascaded with a conventional LP model, as is common use in speech analysis. Finally, for polyphonic audio LP, the WLP model performance comes very close to the optimal HOLP model performance, however, the WLP model performs poorly in terms of perceptual frequency resolution, unless its model order is chosen to be an order of magnitude larger than the number of tonal components in the observed signal [12].

References

  1. Makhoul J: Linear prediction: a tutorial review. Proceedings of the IEEE 1975,63(4):561-580.

    Article  Google Scholar 

  2. Ramachandran RP, Kabal P: Pitch prediction filters in speech coding. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989,37(4):467-478. 10.1109/29.17527

    Article  Google Scholar 

  3. Brandenburg K, Stoll G: ISO-MPEG-1 audio: a generic standard for coding of high-quality digital audio. Journal of the Audio Engineering Society 1994,42(10):780-792.

    Google Scholar 

  4. ISO/IEC : IS 14496-4:2004/Amd 13:2007: parametric coding for high quality audio conformance. International Organization for Standardization, Geneva, Switzerland; January 2007.

    Google Scholar 

  5. McAulay RJ, Quatieri TF: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986,34(4):744-754. 10.1109/TASSP.1986.1164910

    Article  Google Scholar 

  6. Härmä A, Laine UK, Karjalainen M: Warped linear prediction in audio coding. Proceedings of IEEE Nordic Signal Processing Symposium (NORSIG '96), September 1996, Espoo, Finland 447-450.

    Google Scholar 

  7. Iwakami N, Moriya T: Transform-domain weighted interleave vector quantization. Proceedings of the 101st AES Convention, November 1996, Los Angeles, Calif, USA AES preprint 4377

    Google Scholar 

  8. Bessette B, Salami R, Laflamme C, Lefebvre R: A wideband speech and audio codec at 16/24/32 kbit/s using hybrid ACELP/TCX techniques. Proceedings of IEEE Workshop on Speech Coding, June 1999, Porvoo, Finland 7-9.

    Google Scholar 

  9. Härmä A, Laine UK: Warped low delay CELP for wideband audio coding. Proceedings of the 17th AES International Conference on High-Quality Audio Coding, September 1999, Florence, Italy 207-215.

    Google Scholar 

  10. Rongshan Y, Chung KC: High quality audio coding using a novel hybrid WLP-subband coding algorithm. Proceedings of the 5th International Symposium on Signal Processing and Its Applications (ISSPA '99), August 1999, Brisbane, Australia 1: 483-486.

    Article  Google Scholar 

  11. Edler B, Faller C, Schuller G: Perceptual audio coding using a time-varying linear pre- and post-filter. Proceedings of the 109th AES Convention, September 2000, Los Angeles, Calif, USA AES preprint 5274

    Google Scholar 

  12. Härmä A, Laine UK: A comparison of warped and conventional linear predictive coding. IEEE Transactions on Speech and Audio Processing 2001,9(5):579-588. 10.1109/89.928922

    Article  Google Scholar 

  13. Deriche M, Ning D: A novel audio coding scheme using warped linear prediction model and the discrete wavelet transform. IEEE Transactions on Audio, Speech, and Language Processing 2006,14(6):2039-2048.

    Article  Google Scholar 

  14. Biswas A, den Brinker AC: Perceptually biased linear prediction. Journal of the Audio Engineering Society 2006,54(12):1179-1188.

    Google Scholar 

  15. Nakatoh Y, Matsumoto H: A low-bit-rate audio codec using mel-scaled linear predictive analysis. Acoustical Science and Technology 2007,28(3):147-152. 10.1250/ast.28.147

    Article  Google Scholar 

  16. Strube HW: Linear prediction on a warped frequency scale. Journal of the Acoustical Society of America 1980,68(4):1071-1076. 10.1121/1.384992

    Article  Google Scholar 

  17. van Waterschoot T, Rombouts G, Verhoeve P, Moonen M: Double-talk-robust prediction error identification algorithms for acoustic echo cancellation. IEEE Transactions on Signal Processing 2007,55(3):846-858.

    Article  MathSciNet  Google Scholar 

  18. Rombouts G, van Waterschoot T, Struyve K, Moonen M: Acoustic feedback cancellation for long acoustic paths using a nonstationary source model. IEEE Transactions on Signal Processing 2006,54(9):3426-3434.

    Article  Google Scholar 

  19. van Waterschoot T, Moonen M: Adaptive feedback cancellation for audio signals using a warped all-pole near-end signal model. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '08), March-April 2008, Las Vegas, Nev, USA 269-272.

    Google Scholar 

  20. van Waterschoot T, Moonen M: Adaptive feedback cancellation for audio applications. submitted to Signal Processing, ESAT-SISTA Technical Report TR 07-30, Katholieke Universiteit Leuven, Belgium, December 2008

    Google Scholar 

  21. Pagano M: Estimation of models of autoregressive signal plus white noise. The Annals of Statistics 1974,2(1):97-108.

    Article  MATH  Google Scholar 

  22. Kay SM: The effects of noise on the autoregressive spectral estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1979,27(5):478-485. 10.1109/TASSP.1979.1163275

    Article  MathSciNet  MATH  Google Scholar 

  23. Chan YT, Lavoie JMM, Plant JB: A parameter estimation approach to estimation of frequencies of sinusoids. IEEE Transactions on Acoustics, Speech, and Signal Processing 1981,29(2):214-219. 10.1109/TASSP.1981.1163543

    Article  MATH  Google Scholar 

  24. Rao DVB, Kung S-Y: Adaptive notch filtering for the retrieval of sinusoids in noise. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984,32(4):791-802. 10.1109/TASSP.1984.1164398

    Article  Google Scholar 

  25. Fitzgerald WJ, Geere R: Class of constrained ARMA models for line enhancement using real-time QR implementation. Electronics Letters 1991,27(24):2230-2231. 10.1049/el:19911379

    Article  Google Scholar 

  26. Pisarenko VF: The retrieval of harmonics from a covariance function. Geophysical Journal International 1973,33(3):347-366. 10.1111/j.1365-246X.1973.tb03424.x

    Article  MATH  Google Scholar 

  27. Jackson LB, Tufts DW, Soong FK, Rao RM: Frequency estimation by linear prediction. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '78), April 1978, Tulsa, Okla, USA 352-356.

    Chapter  Google Scholar 

  28. Nam SH: Stabilizing discrete spectral modeling of audio signals. IEEE Signal Processing Letters 2002,9(9):292-294. 10.1109/LSP.2002.803406

    Article  Google Scholar 

  29. Oppenheim AV, Johnson DH, Steiglitz K: Computation of spectra with unequal resolution using the fast Fourier transform. Proceedings of the IEEE 1971,59(2):299-301.

    Article  Google Scholar 

  30. Smith JO III, Abel JS: Bark and ERB bilinear transforms. IEEE Transactions on Speech and Audio Processing 1999,7(6):697-708. 10.1109/89.799695

    Article  Google Scholar 

  31. Nehorai A: A minimal parameter adaptive notch filter with constrained poles and zeros. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985,33(4):983-996. 10.1109/TASSP.1985.1164643

    Article  Google Scholar 

  32. Nehorai A, Porat B: Adaptive comb filtering for harmonic signal enhancement. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986,34(5):1124-1138. 10.1109/TASSP.1986.1164952

    Article  Google Scholar 

  33. Ng TS: Some aspects of an adaptive digital notch filter with constrained poles and zeros. IEEE Transactions on Acoustics, Speech, and Signal Processing 1987,35(2):158-161. 10.1109/TASSP.1987.1165114

    Article  Google Scholar 

  34. Travassos-Romano JM, Bellanger M: Fast least squares adaptive notch filtering. IEEE Transactions on Acoustics, Speech, and Signal Processing 1988,36(9):1536-1540.

    Article  MATH  Google Scholar 

  35. Li G: A stable and efficient adaptive notch filter for direct frequency estimation. IEEE Transactions on Signal Processing 1997,45(8):2001-2009. 10.1109/78.611196

    Article  Google Scholar 

  36. van Waterschoot T, Moonen M: Constrained pole-zero linear prediction: an efficient and near-optimal method for multi-tone frequency estimation. Proceedings of the 16th European Signal Processing Conference (EUSIPCO '08), August 2008, Lausanne, Switzerland

    Google Scholar 

  37. van Waterschoot T, Diehl M, Moonen M: Constrained pole-zero linear prediction: optimization of cascaded biquadratic notch filters for multi-tone and multi-pitch estimation. Katholieke Universiteit Leuven, Leuven, Belgium; February 2008.

    Google Scholar 

  38. Ojanperä J, Väänänen M, Yin L: Long term predictor for transform domain perceptual audio coding. Proceedings of the 107th AES Convention, September 1999, New York, NY, USA AES preprint 5036

    Google Scholar 

  39. Herre J, Grill B: Overview of MPEG-4 audio and its applications in mobile communications. Proceedings of the 5th International Conference on Signal Processing Proceedings (WCCC-ICSP '00), August 2000, Beijing, China 11-20.

    Google Scholar 

  40. Kopec GE, Oppenheim AV, Tribolet JM: Speech analysis homomorphic prediction. IEEE Transactions on Acoustics, Speech, and Signal Processing 1977,25(1):40-49. 10.1109/TASSP.1977.1162909

    Article  Google Scholar 

  41. Steiglitz K: On the simultaneous estimation of poles and zeros in speech analysis. IEEE Transactions on Acoustics, Speech, and Signal Processing 1977,25(3):229-234. 10.1109/TASSP.1977.1162939

    Article  Google Scholar 

  42. Mitiche L, Derras B, Adamou-Mitiche ABH: Efficient low-order auto regressive moving average (ARMA) models for speech signals. Acoustic Research Letters Online 2004, 5: 75-81. 10.1121/1.1651193

    Article  Google Scholar 

  43. Markel JD, Gray AH Jr.: Linear Prediction of Speech. Springer, New York, NY, USA; 1976.

    Book  MATH  Google Scholar 

  44. van Waterschoot T, Moonen M: Linear prediction of audio signals. Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), August 2007, Antwerp, Belgium 3: 518-521.

    Google Scholar 

  45. Kumaresan R: On the zeros of the linear prediction-error filter for deterministic signals. IEEE Transactions on Acoustics, Speech, and Signal Processing 1983,31(1):217-220. 10.1109/TASSP.1983.1164021

    Article  MathSciNet  Google Scholar 

  46. Ulrych TJ, Bishop TN: Maximum entropy spectral analysis and autoregressive decomposition. Reviews of Geophysics and Space Physics 1975,13(1):183-200. 10.1029/RG013i001p00183

    Article  Google Scholar 

  47. Qian Y, Chahine G, Kabal P: Pseudo-multi-tap pitch filters in a low bit-rate CELP speech coder. Speech Communication 1994,14(4):339-358. 10.1016/0167-6393(94)90027-2

    Article  Google Scholar 

  48. Kroon P, Atal BS: Pitch predictors with high temporal resolution. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '90), April 1990, Albuquerque, NM, USA 2: 661-664.

    Google Scholar 

  49. Laakso TI, Välimäki V, Karjalainen M, Laine UK: Splitting the unit delay [FIR/all pass filters design]. IEEE Signal Processing Magazine 1996,13(1):30-60. 10.1109/79.482137

    Article  Google Scholar 

  50. Härmä A, Karjalainen M, Savioja L, Välimäki V, Laine UK, Huopaniemi J: Frequency-warped signal processing for audio applications. Journal of the Audio Engineering Society 2000,48(11):1011-1031.

    Google Scholar 

  51. Haykin S: Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ, USA; 1996.

    MATH  Google Scholar 

  52. Zwicker E, Fastl H: Psychoacoustics, Facts and Models. Springer, Berlin, Germany; 1990.

    Google Scholar 

  53. Moore BCJ, Glasberg BR: A revision of Zwicker's loudness model. Acta Acustica United with Acustica 1996,82(2):335-345.

    Google Scholar 

  54. van Waterschoot T, Moonen M: A pole-zero placement technique for designing second-order IIR parametric equalizer filters. IEEE Transactions on Audio, Speech, and Language Processing 2007,15(8):2561-2565.

    Article  Google Scholar 

  55. Opolko F, Wapnick J: McGill University Master Samples. DVD edition. McGill University, Montreal, Canada; 2006.

    Google Scholar 

Download references

Acknowledgments

This research work was carried out at the ESAT laboratory of the Katholieke Universiteit Leuven, in the frame of Katholieke Universiteit (KU) Leuven Research Council: CoE EF/05/006 Optimization in Engineering (OPTEC) and the Belgian Programme on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office IUAP P6/04 ("Dynamical systems, control and optimization" (DYSCO), 2007–2011) and the Concerted Research Action GOA-AMBioRICS, and was supported by the Institute for the Promotion of Innovation through Science and Technology, Flanders (IWT-Vlaanderen). The scientific responsibility is assumed by its authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toon van Waterschoot.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

van Waterschoot, T., Moonen, M. Comparison of Linear Prediction Models for Audio Signals. J AUDIO SPEECH MUSIC PROC. 2008, 706935 (2009). https://doi.org/10.1155/2008/706935

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/2008/706935

Keywords