 Research Article
 Open access
 Published:
Adaptive LongTerm Coding of LSF Parameters Trajectories for LargeDelay/Very to UltraLow BitRate Speech Coding
EURASIP Journal on Audio, Speech, and Music Processing volume 2010, Article number: 597039 (2010)
Abstract
This paper presents a modelbased method for coding the LSF parameters of LPC speech coders on a "longterm" basis, that is, beyond the usual 20–30 ms frame duration. The objective is to provide efficient LSF quantization for a speech coder with large delay but very to ultralow bitrate (i.e., below 1 kb/s). To do this, speech is first segmented into voiced/unvoiced segments. A Discrete Cosine model of the time trajectory of the LSF vectors is then applied to each segment to capture the LSF interframe correlation over the whole segment. Bidirectional transformation from the model coefficients to a reduced set of LSF vectors enables both efficient "sparse" coding (using here multistage vector quantizers) and the generation of interpolated LSF vectors at the decoder. The proposed method provides up to 50% gain in bitrate over framebyframe quantization while preserving signal quality and competes favorably with 2Dtransform coding for the lower range of tested bit rates. Moreover, the implicit timeinterpolation nature of the longterm coding process provides this technique a high potential for use in speech synthesis systems.
1. Introduction
The linear predictive coding (LPC) model has known a considerable success in speech processing for forty years [1]. It is now widely used in many speech compression systems [2]. As a result of the underlying wellknown "sourcefilter" representation of the signal, LPCbased coders generally separate the quantization of the LPC filter, supposed to represent the vocal tract evolution, and the quantization of the residual signal, supposed to represent the vocal source signal. In modern speech coders, low rate quantization of the LPC filter coefficients is usually achieved by applying vector quantization (VQ) techniques to the Line Spectral Frequency (LSF) parameters [3, 4], which are an appropriate dual representation of the filter coefficients particularly robust to quantization and interpolation [5].
In speech coders, the LPC analysis and coding process is made on a shortterm framebyframe basis: LSF parameters (and excitation parameters) are usually extracted, quantized, and transmitted every 20 ms or so, following the speech timedynamics. Since the evolution of the vocal tract is quite smooth and regular for many speech sequences, high correlation between successive LPC parameters has been evidenced and can be exploited in speech coders. For example, the difference between LSF vectors is coded in [6]. Both intraframe and interframe LSF correlations are exploited in the 2D coding scheme of [7]. Alternately, matrix quantization was applied to jointly quantize up to three successive LSF vectors in [8, 9]. More generally, Recursive Coding, with application to LPC/LSF vector quantization, is described in [2] as a general source coding framework where the quantization of one vector depends on the result of the quantization of the previous vector(s).^{1} Recent theoretical and experimental developments on recursive (vector) coding are provided in, for example, [10, 11], leading to LSF vector coding at less than 20 bits/frame. In the same vein, Kalman filtering has been recently used to combine onestep tracking of LSF trajectories with GMMbased vector quantization [12]. In parallel, some studies have attempted to explicitly take into account the smoothness of spectral parameters evolution in speech coding techniques. For example, a target matching method has been proposed in [13]: The authors match the output of the LPC predictor to a target signal constructed using a smoothed version of the excitation signal, in order to jointly smooth both the residual signal and the frametoframe variation of LSF coefficients. This idea has been recently revisited in a different form in [14], by introducing a memory term in the widely used Spectral Distortion measure that is used to control the LSF quantization. This memory term penalizes "noisy fluctuations" of LSF trajectories, and conduces to "smooth" the quantization process across consecutive frames.
In all those studies, the interframe correlation has been considered "locally", that is, between only two (or three for matrix quantization) consecutive frames. This is mainly because the telephony target application requires limiting the coding delay. When the constraint on the delay can be relaxed, for example, in halfduplex communication, speech storage, or speech synthesis application, the coding process can be considered on larger signal windows. In that vein, the Temporal Decomposition technique introduced by Atal [15] and studied by several researchers (e.g., [16]) consists of decomposing the trajectory of (LPC) spectral parameters into "target vectors" which are sparsely distributed in time and linked by interpolative functions. This method has not much been applied to speech coding (though see an interesting example in [17]), but it remains a powerful tool for modeling the speech temporal structure. Following another idea, the authors of [18] proposed to compress timefrequency matrices of LSF parameters using a twodimension (2D) Discrete Cosine Transform (DCT). They provided interesting results for different temporal sizes, from 1 to 10 (10 msspaced) LSF vectors. A major point of this method is that it jointly exploits the time and frequency correlation of LSF values. An adaptive version of this scheme was implemented in [19], allowing a varying size from 1 to 20 vectors for voiced speech sections and 1 to 8 vectors for unvoiced speech. Also, the optimal KarunhenLoeve Transform (KLT) was tested in addition to the 2DDCT.
More recently, Dusan et al. have proposed in [20, 21] to model the trajectories of ten consecutive LSF parameters by a fourthorder polynomial model. In addition, they implemented a very low bit rate speech coder exploiting this idea. At the same time, we proposed in [22, 23] to model the longterm^{2} (LT) trajectory of sinusoidal speech parameters (i.e., phases and amplitudes) with a Discrete Cosine model. In contrast to [20, 21], where the length of parameter trajectories and the order of the model were fixed, in [22, 23] the longterm frames are continuously voiced (V) or continuously unvoiced (UV) sections of speech. Those sections result from preliminary V/UV segmentation, and they exhibit very variable size and "shape". For example, such a segment can contain several phonemes or syllables (it can even be a quite long allvoiced sentence in some cases). Therefore, we proposed a fitting algorithm to automatically adjust the complexity (i.e., the order) of the LT model according to the characteristics of the modeled speech segment. As a result, the trajectory size/model order could exhibit quite different (and often larger) combinations than the tentofour conversion of [20, 21]. Finally, we carried out in [24] a variablerate coding of the trajectory of LSF parameters by adapting our (sinusoidal) adaptive LT modeling approach of [22, 23] to the LPC quantization framework. The V/UV segmentation and the Discrete Cosine model are conserved,^{3} but the fitting algorithm is significantly modified to include quantization issues. For instance, the same bidirectional procedure as the one used in [20, 21] is used to switch from the LT model coefficients to a reduced set of LSF vectors at the coder, and viceversa at the decoder. The reduced set of LSF vectors is quantized by multistage vector quantizers, and the corresponding LT model is recalculated at the decoder from the quantized reduced set of LSFs. An extended set of interpolated LSF vectors is finally derived from the "quantized" LT model. The model order is determined by an iterative adjustment of the Spectral Distortion (SD) measure, which is classic in LPC filter quantization, instead of perceptual criteria adapted to the sinusoidal model used in [22, 23]. It can be noted that the implicit timeinterpolation nature of the longterm decoding process makes this technique a potentially very suitable technique for joint decodingtransformation in speech synthesis systems (in particular, in unitbased concatenative speech synthesis for mobile/autonomous systems). This point is not developed in this paper that focuses on coding, but it is discussed as an important perspective (see Section 5).
The present paper is clearly built on [24]. Its first objective is to present the adaptive longterm LSF quantization method in more details. Its second objective is to provide a series of additional material that were not developed in [24]: Some rate/distortion issues related to the adaptive variablerate aspect of the method are discussed; A new series of rate/distortion curves obtained with a refined LSF analysis step are presented. Furthermore, in addition to the comparison with usual framebyframe quantization, those results are compared with the ones obtained with an adaptive version (for fair comparison) of the 2Dbased methods of [18, 19]. The results show that the trajectories of the LSFs can be coded by the proposed method with much fewer bits than usual framebyframe coding techniques using the same type of quantizers. They also show that the proposed method significantly outperforms the 2Dtransform methods for the lower tested bit rates. Finally, the results of formal listening test are presented, showing that the proposed method can preserve a fair speech quality with LSF coded at verytoultra low bit rates.
This paper is organized as follows. The proposed longterm model is described in Section 2. The complete longterm coding of LSF vectors is presented in Section 3, including the description of the fitting algorithm and the quantization steps. Experiments and results are given in Section 4. Section 5 is a discussion/conclusion section.
2. The LongTerm Model for LSF Trajectories
In this section, we first consider the problem of modeling the timetrajectory of a sequence of K consecutive LSF parameters. These LSF parameters correspond to a given (all voiced or unvoiced) section of speech signal , running arbitrary from to . They are obtained from using a standard LPC analysis procedure applied on successive shortterm analysis windows, with a window size and a hop size within the range 10–30 ms (see Section 4.2). For the following, let us denote by the vector containing the sample indexes of the analysis frame centers. Each LSF vector extracted at time instant is denoted , for to ( denotes the transpose operator^{4}). is the order of the LPC model [1, 5], and we take here the standard value for 8kHz telephone speech. Thus, we actually have LSF trajectories of K values to model. For this aim, let us denote by the matrix of general entry : The LSF trajectories are the row vectors, denoted , for to .
Different kinds of models can be used for representing these trajectories. As mentioned in the introduction, a fourthorder polynomial model was used in [20] for representing ten consecutive LSF values. In [23], we used a sum of discrete cosine functions, close to the wellknown Discrete Cosine Transform (DCT), to model the trajectories of sinusoidal (amplitude and phase) parameters. We called this model a Discrete Cosine Model (DCM). In [25], we compared the DCM with a mixed cosinesine model and the polynomial model, still in the sinusoidal framework. Overall, the results were quite close, but the use of the polynomial model possibly led to numerical problems when the size of the modeled trajectory was large. Therefore, and because of the limitation of experimental configurations in Section 4, we consider only the DCM in the present paper. Note that, more generally, this model is known to be efficient in capturing the variations of a signal (e.g., when directly applied to signal samples as for the DCT, or when applied on logscaled spectral envelopes, as in [26, 27]). Thus, it should be well suited to capture the global shape of LSF trajectories.
Formally, the DCM model is defined for each of the LSF trajectories by
The model coefficients are all real. is a positive integer defining the order of the model. Here, it is the same for all LSFs (i.e., ), since this leads to significantly simplify the overall coding scheme presented next. Note that, although the LSF are initially defined framewise, the model provides an LSF value for each time index n. This property is exploited in the proposed quantization process of Section 3.1. It is also expected to be very useful for speech synthesis systems, as it provides a direct and simple way to proceed time interpolation of LSF vectors for timestretching/compression of speech: interpolated LSF vectors can be calculated using (1) at any arbitrary instant, while the general shape of the trajectory is preserved.
Let us now consider the calculation of the matrix of model coefficients, that is, the matrix of general term , given that is known. We will see in Section 3.2 how an optimal value is estimated for each LSF vector sequence to be quantized. Let denote by the model matrix that gathers the DCM terms evaluated at the entries of :
The modeled LSF trajectories are thus given by the lines of
is estimated by minimizing the mean square error (MSE) between the modeled and original LSF data. Since the modeling process aims at providing data dimension reduction for efficient coding, we assume that , and the optimal coefficient matrix is classically given by
Finally note that in practice, we used the "regularized" version of (4) proposed in [27]: a diagonal "penalizing" term is added to the inverted matrix in (4) to fix possible illconditioning problems. In our study, setting the regularizing factor of [27] to 0.01 gave very good results (no illconditioned matrix over the entire database of Section 4.2).
3. Coding of LSF Based on the LT Model
In this section, we present the overall algorithm for quantizing every sequence of K LSF vectors, based on the LT model presented in Section 2. As mentioned in the introduction, the shape of spectral parameter trajectories can vary widely, depending on, for example, the length of the considered section, the phoneme sequence, the speaker, the prosody, or the rank of the LSF. Therefore, the appropriate order P of the LT model can also vary widely, and it must be estimated: Within the coding context, a tradeoff between LT model accuracy (for an efficient representation of data) and sparseness (for bit rate limitation) is required. The proposed LT model will be efficiently exploited in low bit rate LSF coding if in practice P is significantly lower than K while the modeled and original LSF trajectories remain close enough.
For simplicity, the overall LSF coding process is presented in several steps. In Section 3.1, the quantization process is described given that the order P is known. Then in Section 3.2, we present an iterative global algorithm that uses the process of Section 3.1 as an analysisbysynthesis process to search for the optimal order P. The quantizer block that is used in the abovementioned algorithm is presented in Section 3.3. Eventually, we discuss in Section 3.4 some points regarding the ratedistortion relationship in this specific context of longterm coding.
3.1. LongTerm Model and Quantization
Let us first address the problem of quantizing the LSF information, that is, representing it with limited binary resource, given that P is known. Direct quantization of the DCM coefficients of (3) can be thought of, as in [18, 19]. However, in the present study the DCM is in one dimension,^{5} as opposed to the 2DDCT of [18, 19]. We thus prefer to avoid the quantization of DCM coefficients by applying a onetoone transformation between the DCM coefficients and a reduced set of LSF vectors, as was done in [20, 21].^{6} This reduced set of LSF vectors is quantized using vector quantization, which is efficient for exploiting the intraframe LSF redundancy. At the decoder, the complete "quantized" set of LSF vectors is retrieved from the reduced set, as detailed below. This approach has several advantages. First, it enables the control of correct global trajectories of quantized LSFs by using the reduced set as "breakpoints" for these trajectories. Second, it allows the use of usual techniques for LSF vector quantization. Third, it enables a fair comparison of the proposed method, which mixes LT modeling with VQ, with usual framebyframe LSF quantization using the same type of quantizers. Therefore, a quantitative assessment of the gain due to the LT modeling can be derived (see Section 4.4).
Let us now present the onetoone transformation between the matrix and the reduced set of LSF vectors. For this, let us first define an arbitrary function that uniquely allocates time positions, denoted , among the samples of the considered speech section. Let us also define , a new model matrix evaluated at the instants of (hence is a "reduced" version of , since ):
The reduced set of LSF vectors is the set of modeled LSF vectors calculated at the instants of , that is, the columns , to , of the matrix
The onetoone transformation of interest is based on the following general property of MMSE estimation techniques: The matrix C of (4) can be exactly recovered using the reduced set of LSF vectors by
Therefore, the quantization strategy is the following. Only the reduced set of LSF vectors are quantized (instead of the overall set of K original vectors, as would be the case in usual coding techniques) using VQ. The indexes of the codewords are transmitted. At the decoder, the corresponding quantized vectors are gathered in a matrix denoted , and the DCM coefficient matrix is estimated by applying (7) with this quantized reduced set of LSF vectors instead of the unquantized reduced set:
Eventually, the "quantized" LSF vectors at the original K indexes are given by applying a variant of (3) using (8):
Note that the resulting LSF vectors, which are the column of the above matrix, are abusively called the "quantized" LSF vectors, although they are not directly generated by VQ. This is because they are the LSF vectors used at the decoder for signal reconstruction. Note also that (8) implies that the matrix Q, or alternately the vector J, is available at the decoder. In this study, the positions are regularly spaced in the considered speech section (with rounding to the nearest integer if necessary). Thus can be generated at the decoder and need not be transmitted. Only the size K of the sequence and the order P must be transmitted in addition to the LSF vector codewords. A quantitative assessment of the corresponding additional bit rate is given in Section 4.4. We will see that it is very small compared to the bit rate gain provided by the LT coding method. The whole process is summarized in Figure 1.
3.2. Iterative Estimation of Model Order
In this subsection, we present the iterative algorithm that is used to estimate the optimal DCM order P for each sequence of K LSF vectors. For this, a performance criterion for the overall process is first defined. This performance criterion is the usual Average Spectral Distortion (ASD) measure, which is a standard in LPCbased speech coding [28]:
where and are the LPC power spectra corresponding to the original and quantized LSF vectors, respectively, for frame k (remind that K is the size of the quantized LSF vector sequence). In practice, the integral in (10) is calculated using a 512bins FFT.
For a given quantizer, an ASD target value, denoted , is set. Then, starting with , the complete process of Section 3.1 is applied. The ASD between the original and quantized LSF vector sequences is then calculated. If it is below , the order is fixed to , otherwise, is increased by one and the process is repeated. The algorithm is terminated for the first value of assuming that ASD is below , or otherwise, for since we must assume . All this can be formalized by the following algorithm:
(1) choose a value for . Set ;
(2) apply the LT coding process of Section 3.1, that is:
(i)calculate C with (4),
(ii)calculate ,
(iii)calculate with (6),
(iv)quantize to obtain ,
(v)calculate by combining (9) and (8);
(3) calculate ASD between and with (10);
(4) if and , set , and go to step (), else (if or ), terminate the algorithm.
3.3. Quantizers
In this subsection, we present the quantizers that are used to quantize the reduced set of LSF vectors in step () of the above algorithm. As briefly mentioned in the introduction, vector quantization (VQ) has been generalized for LSF coefficients quantization in modern speech coders [1, 3, 4]. However, for highquality coding, basic singlestage VQ is generally limited by codebook storage capacity, search complexity and training procedure. Thus different suboptimal but still efficient schemes have been proposed to reduce complexity. For example, splitVQ, which consists of splitting the vectors into several subvectors for quantization, has been proposed at 24 bits/frames and offered coding transparency [28].^{7}
In this study, we used multistage VQ (MSVQ)^{8} which consists in cascading several lowresolution VQ blocks [29, 30]: The output of a block is an error vector which is quantized by the next block. The quantized vectors are reconstructed by adding the outputs of the different blocks. Therefore, each additional block increases the quantization accuracy while the global complexity (in terms of codebook generation and search) is highly reduced compared to a singlestage VQ with the same overall bit rate. Also, different quantizers were designed and used for voiced and unvoiced LSF vectors, as in, for example, [31]. This is because we want to benefit from the V/UV signal segmentation to improve the quantization process by better fitting the general trends of voiced or unvoiced LSFs. Detailed information on the structure of the MSVQ used in this study, their design, and their performances, is given in Section 4.3.
3.4. RateDistortion Considerations
Now that the longterm coding method has been presented, it is interesting to derive an expression of the error between the original and quantized LSF matrices. Indeed, we have
Combining (11) with (8), and introducing , basic algebra manipulation leads to:
Equation (12) shows that the overall quantization error on LSF vectors can be seen as the sum of the contributions of the LT modeling and the quantization process. Indeed, on the right side of (12), we have the LT modeling error defined as the difference between the modeled and the original LSF vectors sequence. Additionally, is the quantization error of the reduced set of LSF vectors. It is "spread" over the K original time indexes by a linear transformation built from matrices M and Q. The modeling and quantization errors are independent. Therefore, the proposed method will be efficient if the bit rate gain resulting from quantizing only the reduced set of LSF vectors (compared to quantizing the whole K vectors in framebyframe quantization) compensate for the loss due to the modeling.
In the proposed LT LSF coding method, the bit rate b for a given section of speech is given by , where r is the resolution of the quantizer (in bits/vector) and h is the hop size of the LSF analysis window (h = 20 ms). Since the LT coding scheme is an intrinsic variablerate technique, we also define an average bit rate, which results from encoding a large number of LSF vector sequences:
where indexes each sequence of LSF vectors of the considered database, M being the number of sequences. In the LT coding process, increasing the quantizer resolution does not necessarily increase the bit rate, as opposed to usual coding methods, since it may lead to decrease the number of LT model coefficients (for the same overall ASD target). Therefore, an optimal LT coding configuration is expected to result from a tradeoff between quantizer resolution and LT modeling accuracy. In Section 4.4, we provide extensive distortionrate results by testing the method on a large speech database, and varying both the resolution of the quantizer and the ASD target value.
4. Experiments
In this section, we describe the set of experiments that were conducted to test the longterm coding of LSF trajectories. We first briefly describe in Section 4.1 the 2Dtransform coding techniques [18, 19] that we implemented in parallel for comparison with the proposed technique. The database used in the experiments is presented in Section 4.2. Section 4.3 presents the design of the MSVQ quantizers used in the LT coding algorithm. Finally, in Section 4.4, the results of the LSF longterm coding process are presented.
4.1. 2DTransform Coding Reference Methods
As briefly mentioned in the introduction, the basic principle of the 2Dtransform coding methods consists in applying either a 2DDCT or a KarhunenLoeve Transform (KLT) on the LSF matrices. In contrast to the present study, the resulting transform coefficients are directly quantized using scalar quantization (after being normalized though). Bit allocation tables, transform coefficients mean and variance, and optimal (nonuniform) scalar quantizers are determined during a training phase applied on a training corpus of data (see Section 4.2): Bit allocation among the set of transformed coefficients is determined from their variance [32] and the quantizers are designed using the LBG algorithm [33] (see [18, 19] for details). This is done for each considered temporal size K, and for a large range of bit rates (see Section 4.4).
4.2. Database
We used American English sentences from the TIMIT database [34]. The signals were resampled at 8 kHz and low and highpass filtered at the 300–3400 Hz telephone band. The LSF vectors were calculated every 20 ms using the autocorrelation method, with a 30 ms Hann window (hence a 33% overlap),^{9} highfrequency preemphasis with the filter , and 10 Hzbandwidth expansion. The voiced/unvoiced segmentation was based on the TIMIT label files which contain the phoneme labels and boundaries (given as sample indexes) for each sentence. A LSF vector was classified as voiced if at least 25% of the analysis frame was part of a voiced phoneme region. Otherwise, it was classified as an unvoiced LSF vector.
Eight sentences of each of 176 speakers (half male and half female) from the eight different dialect regions of the TIMIT database were used for building the training corpus. This represents about 47 mn of voiced speech and 16 mn of unvoiced speech. This resulted in 141,058 voiced vectors from 9,744 sections, and 45,220 unvoiced LSF vectors from 9,271 sections. This corpus was used to design the MSVQ quantizers used in the proposed LT coding technique (see Section 4.3). It was also used to design the bit allocation tables and associated optimal scalar quantizers for the 2Dtransform coefficients of the reference methods.^{10}
In parallel, eight other sentences from 84 other speakers (also 50% male, 50% female, and from the eight dialect regions) were used for the test corpus. It contains 67,826 voiced vectors from 4,573 sections (about 23 mn of speech), and 22,242 unvoiced vectors from 4,351 sections (about 8 mn of speech). This test corpus was used to test the LT coding method, and compare it with framebyframe VQ and the 2Dtransform methods.
The histogram of the temporal size K of the LSF (voiced and unvoiced) sequences for both training and test corpus are given on Figure 2. Note that the average size of an unvoiced sequence (about 5 vectors ≈ 100 ms) is significantly smaller than the average size of a voiced sequence (about 15 vectors ≈ 300 ms). Since there are almost as many voiced and unvoiced sections, the average number of voiced or unvoiced sections per second is about 2.5.
4.3. MSVQ Codebooks Design
As mentioned in Section 3.3, for quantizing the reduced set of LSF vectors, we implemented a set of MSVQ for both voiced LSF vectors and unvoiced LSF vectors. In this study, we used twostage and threestage quantizers, with a resolution ranging from 20 to 36 bits/vector, with a 2 bits step. Generally, a resolution of about 25 bits/vector is necessary to provide transparent or "close to transparent" quantization, depending on the structure of the quantizer [29, 30]. In parallel, it was reported in [31] that significantly fewer bits were necessary to encode unvoiced LSF vectors compared to voiced LSF vectors. Therefore, the large range of resolution that we used allowed to test a wide set of configurations, for both voiced and unvoiced speech.
The design of the quantizers was made by applying the LBG algorithm [33] on the (voiced or unvoiced) training corpus described in Section 4.1, using the perceptual weighted Euclidian distance between LSF vectors proposed in [28]. The two/threestage quantizers are obtained as follows. The LBG algorithm is first used to design the first codebook block. Then, the difference between each LSF vector of the training corpus and its associated codeword is calculated. The overall resulting set of vectors is used as a new training corpus for the design of the next block, again with the LBG algorithm. The decoding of a quantized LSF vector is made by adding the outputs of the different blocks. For resolutions ranging from 20 to 24, twostage quantizers were designed, with a balanced bit allocation between stages, that is, 1010, 1111, and 1212. For resolutions within the range 26–36, a third stage was added with 2 to 12 bits. This is because computational considerations limit the resolution of each block to 12 bits. Note that the ms structure does not guarantee that the quantized LSF vector is correctly conditioned (i.e., in some cases, LSF pairs can be too close to each other or even permuted). Therefore, a regularization procedure was added to ensure correct sorting and a minimal distance of 50 Hz between LSFs.
4.4. Results
In this subsection, we present the results obtained by the proposed method for LT coding of LSF vectors. We first briefly present a typical example of a sentence. We then give a complete quantitative assessment of the method over the entire test database, in terms of distortionrate. Comparative results obtained with classic framebyframe quantization and the 2Dtransform coding techniques are provided. Finally, we give perceptual evaluation of the proposed method.
4.4.1. A Typical Example of a TIMIT Sentence
We first illustrate the behavior of the algorithm of Section 3.2 on a given sentence of the corpus. The sentence is "Elderly people are often excluded" pronounced by a female speaker. It contains five voiced sections and four unvoiced sections (see Figure 3). In this experiment, the target was 2.1 dB for the voiced sections, and 1.9 dB for the unvoiced sections. For the voiced sections, setting , 22 and 24 bits/vector respectively, leads to a bit rate of 557.0, 515.2 and 531.6 bits/s respectively, for an actual ASD of 1.99, 2.01 and 1.98 dB respectively. The corresponding total number of model coefficients is 44, 37 and 35 respectively, to be compared with the total number of voiced LSF vectors which is 79. This illustrates the fact that, as mentioned in Section 3.4, for the LT coding method, the bit rate does not necessarily decrease as the resolution increases, since the number of model coefficients also varies. In this case, r = 22 bits/s seems to be the best choice. Note that in comparison, the framebyframe quantization provides 2.02 dB of ASD at 700 bits/s. For the unvoiced sections, the best results are obtained with r = 20 bits/vector: we obtain 1.82 dB of ASD at 620.7 bits/s (the framebyframe VQ provides 1.81 dB at 700 bits/s).
We can see on Figure 3 the corresponding original and LTcoded LSF trajectories. This figure illustrates the ability of the LT model of LSF trajectories to globally fit the original trajectories, even if the model coefficients are calculated from the quantized reduced set of LSF vectors.
4.4.2. Average DistortionRate Results
In this subsection, we generalize the results of the previous subsection by (i) varying the ASD target and the MSVQ resolution r within a large set of values, (ii) applying the LT coding algorithm on all sections of the test database, and averaging the bit rate (13) and the ASD (10) across either all 4,573 voiced sections or all 4,351 unvoiced sections of the test database, and (iii) comparing the results with the ones obtained with the 2Dtransform coding methods and the framebyframe VQ.
As already mentioned in Section 4.2, the resolution range for the MSVQ quantizers used in LT coding is within 20 to 36 bits/vector. The ASD target was being varied from 2.6 dB to a minimum value with a 0.2 dB step. The minimum value is 1.0 dB for r = 36, 34, 32 and 30 bits/vector, and then it is increased by 0.2 dB each time the resolution is decreased by 2 bits/vector (it is thus 1.2 dB for r = 28 bits/vector, 1.4 dB for r = 26 bits/vector, and so on). In parallel, the distortionrate values were also calculated for usual framebyframe quantization using the same quantizers than in the LT coding process, and using the same test corpus. In this case, the resolution range was extended to lower values for a better comparison. For the 2Dtransform coding methods, the temporal size was varied from 1 to 20 for voiced LSFs, and from 1 to 10 for unvoiced LSFs. This choice was made after the histograms of Figure 2 and after considerations on computational limitations.^{11} It is coherent with the values considered in [19]. We calculated the corresponding ASD for the complete test corpus, and for seven values of the optimal scalar quantizers resolution: 0.75, 1, 1.25, 1.5, 1.75, 2.0 and 2.25 bits/parameter. This corresponds to 375, 500, 625, 750, 875, 1,000 and 1,125 bits/s, respectively, (since the hop size is 20 ms). We also calculated for each of these resolutions a weighted average value of the spectral distortion (ASD), the weights being the bins of the histogram of Figure 2 (for the test corpus) normalized by the total size of the corpus. This enables one to take into account the distribution of the temporal size of the LSF sequences in the ratedistortion relationship, for a fair comparison with the proposed LT coding technique. This way, we assume that both the proposed method and 2Dtransform coding methods work with the same "adaptive" temporalblock configuration.
The results are presented in Figures 4 and 5 for the voiced sections, and in Figures 6 and 7 for the unvoiced sections. Let us begin the analysis of the results with the voiced sections. Figure 4 displays the results of the LT coding technique in terms of ASD as a function of the bit rate. Each one of the curves on the left of the figure corresponds to a fixed MSVQ resolution (which value is plotted), the ASD target being varied. It can be seen that the different resolutions provide an array of intertwined curves, each one following the classic general ratedistortion relationship: an increase of the ASD goes with a decrease of the bit rate. These curves are generally situated on the left of the curve corresponding to the framebyframe quantization, which is also plotted. They thus generally correspond to smaller bit rates. Moreover, the gain in bit rate for approximately the same ASD can be very large, depending on the considered region and the resolution (see more details below). In a general manner, the way the curves are intertwined involves that increasing the resolution of the MSVQ quantizer makes the bit rate increase for the left upper region of the curves, but it is no more the case in the right lower region, after the "crossing" of the curves. This illustrates the specific tradeoff that must be tuned between quantization accuracy and modeling accuracy, as mentioned in Section 3.4. The ASD target value has a strong influence on this tradeoff. For a given ASD level, the lower bit rate is obtained with the leftmost point, which depends on the resolution. The set of optimal points for the different ASD values, that is, the leftdown envelope of the curves, can be extracted and it forms what will be referred to as the optimal LT coding curve.
For easier comparison, we report this optimal curve on Figure 5, and we also plot on this figure the results obtained with the 2DDCT and KLT transform coding methods (and also again the framebyframe quantization curve). The curves of the 2DDCT transform coding are given for the temporal size 2, 5, 10 and 20, and also for the "adaptive" curve (i.e., the values averaged according to the distribution of the temporal size) which is the main reference in this variablerate study. We can see that for the 2DDCT transform coding, the longer is the temporal size, the lower is the ASD. The average curve is between the curves corresponding to K = 5 and K = 10. For clarity, the KLT transform coding curve is only given for the adaptive configuration. This curve is about 0.05 to 0.1 dB below the adaptive 2DDCT curve, which corresponds to about 23 bits/vector savings, depending on the bit rate (this is consistent with the optimal character of the KLT and with the results reported in [19]).
We can see on Figure 5 that the curves of the 2Dtransform coding techniques are crossing the optimal LT coding curve from topleft to bottomright. This implies that for the higher part of the considered bitrange (say above about 900 bits/s) the 2Dtransform coding techniques provide better performances than the proposed method. These performances tend toward the 1 dB transparency bound for bit rates above 1 kbits/s, which is consistent with the results of [18]. With the considered configuration, the LT coding technique is limited to about 1.1 dB of ASD, and the corresponding bit rate is not competitive with the bit rate of the 2Dtransform techniques (it is even comparable to the simple framebyframe quantization over 1.2 kbits/s). In contrast, for lower bit rates, the optimal LT coding technique clearly outperforms both 2Dtransform methods. For example, at 2.0 dB of ASD, the bit rates of the LT, KLT, and 2DDCT coding methods are about 489, 587, and 611 bits/s respectively. Therefore, the bit rate gain provided by the LT coding technique over the KLT and 2DDCT techniques is about 98 bits/s (i.e., 16.7%) and 122 bits/s (i.e., 20%) respectively. Note that for such ASD value, the framebyframe VQ requires about 770 bits/s. Therefore, compared to this method, the relative gain in bit rate of the LT coding is about 36.5%. Moreover, since the slope of the LT coding curve is smaller than the slope of the other curves, the relative gain in bit rate (or in ASD) provided by the LT coding significantly increases as we go towards lower bit rates. For instance, at 2.4 dB, we have about 346 bits/s for the LT coding, 456 bits/s for the KLT, 476 bits/s for the 2DDCT, and 630 bits/s for the framebyframe quantization. The relative bit rate gains are respectively 24.1% (110 out of 456), 27.3% (130 out of 476), and 45.1% (284 out of 630).
In terms of ASD, we have for example 1.76 dB, 1.90 dB, and 1.96 dB respectively for the LT coding, the KLT, and the 2DDCT at 625 bits/s. This represents a relative gain of 7.4% and 10.2% for the LT coding over the two 2Dtransform coding techniques. At 375 bits/s this gain reaches respectively 15.8% and 18.1% (2.30 dB for the LT coding, 2.73 dB for the KLT, and 2.81 dB for the 2DDCT).
For unvoiced sections, the general trends of the LT quantization technique discussed in the voiced case can be retrieved in Figure 6. However, at a given bit rate, the ASD obtained in this case is generally slightly lower than in the voiced case, especially for the framebyframe quantization. This is because unvoiced LSF vectors are easier to quantize than voiced LSF vectors, as pointed out in [31]. Also, the LT coding curves are more "spread" than for the voiced sections of speech. As a result, the bit rates gains compared to the framebyframe quantization are positive only below, say, 900 bits/s, and they are generally lower than in the voiced case, although they remain significant for the lower bit rates. This can be seen more easily on Figure 7, where the optimal LT curve is reported for unvoiced sections. For example, at 2.0 dB the LT quantization bit rate is about 464 bits/s, while the framebyframe quantizer bit rate is about 618 bits/s (thus the relative gain is 24.9%). Compared to the 2Dtransform techniques, the LT coding technique is also less efficient than in the voiced case. The "crossing point" between LT coding and 2Dtransform coding is here at about 700–720 bits/s, 1.6 dB. On the right of this point, the 2Dtransform techniques clearly provide better results than the proposed LT coding technique. In contrast, below 700 bits/s, the LT coding provides better performances, even if the gains are lower than in the voiced case. An idea of the maximum gain of LT coding over 2Dtransform coding is given at 1.8 dB: the LT coding bit rate is 561 bits/s, although it is 592 bits/s for the KLT, and 613 bits/s for the 2DDCT (the corresponding relative gains are 5.2% and 8.5%, resp.).
Let us close this subsection with a calculation of the approximate bit rate which is necessary to encode the pair (see Section 3.1). It is a classical result that any finite alphabet can be encoded with a code of average length L, with , where is the entropy of the alphabet [1]. We estimated the entropy of the set of pairs obtained on the test corpus after termination of the LT coding algorithm. This was done for the set of configurations corresponding to the optimal LT coding curve. Values within the interval and were obtained for the voiced sections and unvoiced sections respectively. Since the average number of voiced or unvoiced sections is about 2.5 per second (see Section 4.2), the additional bit rate is about bits/s for the voiced sections and about bits/s for the unvoiced sections. Therefore, it is quite small compared to the bit rate gain provided by the proposed LT coding method over the framebyframe quantization. Besides, the 2Dtransform coding methods require the transmission of the size K of each section. Following the same idea, the entropy for the set of K values was found to be 5.1 bits for the voiced sections, and 3.4 bits for the unvoiced section. Therefore, the corresponding coding rates are bits/s and bits/s respectively. The difference between encoding K and the pair is less than 5 bits/s in any case. This shows that (i) the values of K and P are significantly correlated, and (ii) because of this correlation, the additional cost for encoding P in addition to K is very small compared to the bit rate difference between the proposed method and the 2Dtransform methods within the bit rate range of interest.
4.4.3. Listening Tests
To confirm the efficiency of the longterm coding of LSF parameters from a subjective point of view, signals with quantized LSFs were generated by filtering the original signals with the filter , where is the LPC analysis filter derived from the quantized LSF vector, and A(z) is the original (unquantized) LPC filter (this implies that the residual signal is not modified). The sequence of filters was generated with both the LT method and 2DDCT transform coding. Ten sentences of TIMIT were selected for a formal listening test (5 by a male speaker and 5 by a female speaker, from different dialect regions). For each of them, the following conditions were verified for both voiced and unvoiced sections: (i) the bit rate was lower than 600 bits/s; (ii) the ASD was between 1.8 dB and 2.2 dB; (iii) the ASD absolute difference between LTcoding and 2DDCT coding was less than 0.02 dB; and (iv) the LT coding bit rate was at least 20% (resp., 7.5%) lower than the 2DDCT coding bit rate for the voiced (resp., unvoiced) sections. Twelve subjects with normal hearing listened to the 10 pairs of sentences coded with the two methods and presented in random order, using a highquality PC soundcard and Sennheiser HD280 Headphones, in a quiet environment. They were asked to make a forced choice (i.e., perform an AB test), based on the perceived best quality.
The overall preference score across sentences and subjects is 52.5% for the longterm coding versus 47.5% for the 2DDCT transform coding. Therefore, the difference between the two overall scores does not seem to be significant. Considering the scores sentence by sentence reveals that, for two sentences, the LT coding is significantly preferred (83.3% versus 16.7%, and 66.6% versus 33.3%). For one other sentence, the 2DDCT coding method is significantly preferred (75% versus 25%). In those cases, both LT coded signal and 2DDCT coded signal exhibit audible (although rather small) artifacts. For the seven other sentences, the scores vary between 41.7%–58.3% to the inverse 58.3%–41.7%, thus indicating that for these sentences, the two methods provide very close signals. In this case, and for both methods, the quality of the signals, although not transparent, is quite fairly good for such low rates (below 600 bits/s): the overall sounding quality is preserved, and there is no significant artifact.
These observations are confirmed by extended informal listening tests on many other signals of the test database: It has been observed that the quality of the signals obtained by the LT coding technique (and also by the 2DDCT transform coding) at rates as low as 300−500 bits/s varies a lot. Some coded sentences are characterized by quite annoying artifacts, whereas some others exhibit surprisingly good quality. Moreover, in many cases, the strength of the artifacts does not seem to be directly correlated with the ASD value. This seems to indicate that the quality of verytoultra low bit rate LSF quantization may largely depend on the signal itself (e.g., speaker and phonetic content). The influence of such factors is beyond the scope of this paper, but it should be considered more carefully in future works.
4.4.4. A Few Computational Considerations
The complete LT LSF coding and decoding process is done in approximately half realtime using MATLAB on a PC with a processor at 2.3 GHz (i.e., 0.5 s is necessary to process 1 s of speech).^{12} Experiments were conducted with the "raw" exhaustive search of optimal order P in the algorithm of Section 3.2. A refined (e.g., dichotomous) search procedure would decrease the computational cost and time by a factor of about 4 to 5. Therefore, an optimized C implementation would run within several ranges of order below realtime. Note that the decoding time is only a small fraction (typically 1/10 to 1/20) of the coding time since decoding consists in applying only (8) and (9) only once, using the reduced set of decoded LSF vectors and decoded pair.
5. Summary and Perspectives
In this paper, a variablerate longterm approach to LSF quantization has been proposed for offline or largedelay speech coding. It is based on the modeling of the timetrajectories of LSF parameters with a Discrete Cosine model, combined with a "sparse" vector quantization of a reduced set of LSF vectors. An iterative algorithm has been shown to provide joint efficient shaping of the model and estimation of its optimal order. As a result, the method generally provides a very large gain in bit rate (up to 45%) compared to short term (framebyframe) quantization, at an equivalent coding quality. Also, for the lower range of tested bit rates (i.e., below 600–700 bits/s), the method compares favorably with transform coding techniques that also exploit the interframe correlation of LSFs across many frames. This has been demonstrated by extensive distortion/rate benchmark and listening tests. The bit rate gain is up to about 7.5% for unvoiced speech, and it is up to about 25% for voiced speech, depending on coding accuracy. Of course, at the considered low bit rates, the ASD is significantly above the 1.0 dB bound which is correlated with transparency quality. However, the proposed method provides a new bound of attainable performances for LSF quantization at very to ultralow bit rates. It can also be used as a first stage in a refined LSF coding scheme at higher rates: the difference between original and LTcoded LSF can be coded by other techniques after that the longterm interframe correlation has been removed.
It must be mentioned here that although efficient, the MSVQs used in this study are not the best quantizers available. For instance, we have not used fully optimized (i.e., using treillis search as in [30]) MSVQ, but basic (i.e., sequential search) MSVQ. Also, more sophisticated framewise methods have been proposed to obtain transparent LSF quantization at rates lower than the ones required for MSVQ, but at the cost of increased complexity [35, 36]. Refined versions of splitVQ are also good candidates for improved performances. We restricted ourselves with a relatively simple VQ technique because the goal of the present study was primarily to show the interest of the longterm approach. Therefore, it is very likely that the performances of the proposed LT coding algorithm can be significantly improved by using highperformance (but more complex) quantizers,^{13} since the reduced set of LSF vectors may be quantized with lower ASD/resolution compared to the MSVQ. In contrast, it seems very difficult to improve the performances of the reference 2Dtransform methods, since we used optimal (nonuniform) quantizers to encode the corresponding 2D coefficients.
As mentioned before, the analysis settings have been shown to noticeably influence the performance of the proposed method. As pointed out in [13], "it is desirable for the formant filter parameters to evolve slowly, since their [shortterm] fluctuations may be accentuated under quantization, creating audible distortions at update instants". Hence it may be desirable to carefully configure the analysis, or to preprocess the LSF with a smoothing method (such as [13, 14] or a different one) before longterm quantization, to obtain trajectories freed from undesirable local fluctuations partly due to analysis (see Figure 3). This is likely to enable the proposed fitting algorithm to significantly lower the LT model order and hence lower the bit rate, without impairing signal quality. A deeper investigation of this point is needed.
Beyond those potential improvements, future work may focus on the elaboration of several complete speech coders functioning at very to ultralow bit rates and exploiting the longterm approach. This requires an appropriate adaptation of the proposed algorithm to the coding of the excitation (residual signal). For example, ultralow bit rate coding with acceptable quality may be attainable with the longterm coding of basic excitation parameters such as fundamental frequency, voicing frequency (i.e., the frequency that "separates" the voiced region and the unvoiced region for mixed V/UV sounds), and corresponding gains. Also, we intend to test the proposed longterm approach within the framework of (unitbased concatenative) speech synthesis. As mentioned in Section 2, the longterm model that is used here to exploit the predictability of LSF trajectories can also be directly used for time interpolation of those trajectories (a property that is not assumed by 2Dtransform coding; see Endnote 5). In other words, the proposed method offers an efficient framework for direct combination of decoding and time interpolation, as required for speech transformation in (e.g., TTS) synthesis systems. It can be used to interpolate LSF (and also source parameters) "natural" trajectories, to be compared in future works with more or less complex existing interpolation schemes. Note that the proposed method is particularly suitable for unitbased synthesis, since it is naturally frame length and bitrateadaptive. Therefore, an appropriate mapping between speech units and longterm frames can be defined.^{14} As suggested by [13], the interaction between filter parameters and source parameters should be carefully examined within this longterm coding and interpolating framework.
Endnotes

1.
The differential VQ and other schemes such as predictive VQ and finitestate VQ can be seen as special cases of recursive VQ [2, 10], depending on the configuration.

2.
In the following, the term "longterm" refers to considering long sections of speech, including several to many shortterm frames of about 20?ms. Hence, it has a different meaning than in the "longterm (pitch) predictor" of speech coders.

3.
The V/UV segmentation is compliant with the expectation of somewhat "coherent" LSF trajectories on a given longterm section. Indeed, it is well known that these parameters have a different general behavior for voiced or unvoiced sounds (see, e.g., [31]).

4.
In the following, all vectors of consecutive values in time are row vectors, while vectors of simultaneous values taken at a given time instant are column vectors. Matrices are organized accordingly.

5.
This means that, despite of matrix formalism, each line of (3) is a modeled trajectory of one LSF coefficient that is modeled independently of the trajectory of the other coefficients (except for common model order). Accordingly, the regression of (4) can be calculated separately for each line, that is, each set of model coefficients of (1). Hence, the coefficients of C are time model coefficients. In contrast, 2Dtransform coefficients jointly concentrate both time and frequency information from data (and those 2D models cannot be directly interpolated in one dimension).

6.
For the fixedsize 10to4 conversion of LSF into polynomial coefficients. Let us remind that in the present study, the KtoP conversion is of variable dimension.

7.
"Coding transparency" means that speech signals synthesized with the quantized and unquantized LSFs are perceptually undistinguishable.

8.
The methods [6–14] exploiting interframe LSF correlation are not pertinent in the present study. Indeed, the LSF vectors of the reduced set are sparsely distributed in the considered section of speech, and their correlation is likely to be poor.

9.
The analysis settings have been shown to slightly influence the performance of the proposed method, since they can provide successive LSF vectors with slightly different degrees of correlation. The present settings are different from the ones used in [24], and they provided slightly better results. They were partly suggested by [37]. Also, this suggests that the proposed method is likely to significantly benefit from a preprocessing of the LSF with "shortterm" smoothing methods, such as [13, 14] (see Section 5).

10.
Note that for the 2DDCT the coefficients are fixed whereas they depend on the data for the KLT; thus, for each tested temporal size, the KLT coefficients are also determined from the training data.

11.
We must ensure (i) a sufficient number of (voiced or unvoiced) sections of a given size to compute the corresponding bit allocation tables and optimal scalar quantizers (and transform coefficients for the KLT), and (ii) a reasonable calculation time for experiments on such extended corpus. Note that for the 2Dtransform coding methods, voiced (resp., unvoiced) sequences larger than 20 (resp., 10) vectors are split into subsequences.

12.
In comparison, the adaptive (variablesize) 2Dtransform coding methods require only approximately 1/10th of realtime, hence 1/5th of the proposed method resource. This is mainly because they do not require inverse matrix calculation but only direct matrix products.

13.
The proposed method is very flexible in the sense that it can be directly applied with any type of framewise quantizer.

14.
In the present study we used V/UV segmentation (and adapted coding), but other segmentation, more adapted to concatenative synthesis, can be considered (e.g., "CV" or "VCV"). Alternately, all voiced or all unvoiced (subsets of) units could be considered in synthesis system using the proposed method.
References
Markel JD, Gray AH Jr.: Linear Prediction of Speech. Springer, New York, NY, USA; 1976.
Gray RM, Gersho A: Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, Mass, USA; 1992.
Pan J, Fischer TR: Vector quantization of speech line spectrum pair parameters and reflection coefficients. IEEE Transactions on Speech and Audio Processing 1998, 6(2):106115. 10.1109/89.661470
Hedelin P: Single stage spectral quantization at 20 bits. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '94), 1994, Adelaide, Australia 525528.
Sugamura N, Itakura F: Speech analysis and synthesis method developed at ACL in NTT—from LPC to LSP. Speech Communication 1986, 5(2):199215. 10.1016/01676393(86)900087
Yong M, Davidson G, Gersho A: Encoding of LPC spectral parameters using switchedadaptive interframe vector prediction. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '88), 1988, New York, NY, USA 402405.
Jean FR, Wang HC: Transparent quantization of speech LSP parameters based on KLT and 2Dprediction. IEEE Transactions on Speech and Audio Processing 1996, 4(1):6066. 10.1109/TSA.1996.481453
Tsao C, Gray RM: Matrix quantizer design for LPC speech using the generalized Lloyd algorithm. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(3):537545. 10.1109/TASSP.1985.1164584
Xydeas CS, Papanastasiou C: Split matrix quantization of LPC parameters. IEEE Transactions on Speech and Audio Processing 1999, 7(2):113125. 10.1109/89.748117
Samuelsson J, Hedelin P: Recursive coding of spectrum parameters. IEEE Transactions on Speech and Audio Processing 2001, 9(5):492502. 10.1109/89.928914
Subramaniam AD, Gardner WR, Rao BD: Lowcomplexity source coding using Gaussian mixture models, lattice vector quantization, and recursive coding with application to speech spectrum quantization. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(2):524532.
Subasingha S, Murthi MN, Andersen SV: Gaussian mixture kalman predictive coding of line spectral frequencies. IEEE Transactions on Audio, Speech and Language Processing 2009, 17(2):379391.
ZadIssa MR, Kabal P: Smoothing the evolution of the spectral parameters in linear prediction of speech using target matching. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '97), 1997, Munich, Germany 3: 16991702.
Nordén F, Eriksson T: Time evolution in LPC spectrum coding. IEEE Transactions on Speech and Audio Processing 2004, 12(3):290301. 10.1109/TSA.2004.825664
Atal BS: Efficient coding of LPC parameters by temporal decomposition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '83), 1983, Boston, Mass, USA 1: 8184.
Van DijkKappers AML, Marcus SM: Temporal decomposition of speech. Speech Communication 1989, 8(2):125135. 10.1016/01676393(89)900393
Cheng YM, O'Shaughnessy D: On 450–600 b/s natural sounding speech coding. IEEE Transactions on Speech and Audio Processing 1993, 1(2):207220. 10.1109/89.222879
Farvardin N, Laroia R: Efficient encoding of speech LSP parameters using the discrete cosine transformation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '89), 1989, Glasgow, UK 1: 168171.
Mudugamuwa DJ, Bradley AB: Optimal transform for segmented parametric speech coding. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), 1998, Seattle, Wash, USA 1: 525528.
Dusan S, Flanagan J, Karve A, Balaraman M: Speech coding using trajectory compression and multiple sensors. Proceedings of the International Conference on Speech & Language Processing, 2004, Jeju, South Korea
Dusan S, Flanagan J, Karve A, Balaraman M: Speech compression by polynomial approximation. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(2):387395.
Girin L, Firouzmand M, Marchand S: Longterm modeling of phase trajectories within the speech sinusoidal model framework. Proceedings of the International Conference on Speech & Language Processing, 2004, Jeju, South Korea
Girin L, Firouzmand M, Marchand S: Perceptual longterm variablerate sinusoidal modeling of speech. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(2):851861.
Girin L: Longterm quantization of speech LSF parameters. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), 2007, Honolulu, Hawaii, USA 4: 845848.
Girin L, Firouzmand M, Marchand S: Comparing several models for perceptual longterm modeling of amplitude and phase trajectories of sinusoidal speech. Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech '05), 2005, Lisboa, Portugal 357360.
Galas T, Rodet X: An improved cepstral method for deconvolution of sourcefilter systems with discrete spectra: application to musical sound signals. Proceedings of the International Computer Music Conference (ICMC '90), 1990, Glasgow, UK 8284.
Cappé O, Laroche J, Moulines E: Regularized estimation of cepstrum envelope from discrete frequency points. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '95), October 1995, New Paltz, NY, USA
Paliwal KK, Atal BS: Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing 1993, 1(1):314. 10.1109/89.221363
Phamdo N, Favardin N, Moriya T: Combined sourcechannel coding of LSP parameters using multistage vector quantization. Proceedings of the IEEE Workshop on Speech Coding for Telecommunications, 1991 3638.
LeBlanc WP, Bhattacharya B, Mahmoud SA, Cuperman V: Efficient search and design procedures for robust multistage VQ of LPC parameters for 4 kb/s speech coding. IEEE Transactions on Speech and Audio Processing 1993, 1(4):373385. 10.1109/89.242483
Hagen R, Paksoy E, Gersho A: Voicingspecific LPC quantization for variablerate speech coding. IEEE Transactions on Speech and Audio Processing 1999, 7(5):485494. 10.1109/89.784101
Jayant NS, Noll P: Digital Coding of Waveforms: Principles and Applications to Speech and Video. PrenticeHall, Englewood Cliffs, NJ, USA; 1984.
Linde Y, Buzo A, Gray RM: An algorithm for vector quantizer design. IEEE Transactions on Communications Systems 1980, 28(1):8495. 10.1109/TCOM.1980.1094577
Garofolot JS, Lamel LF, Fisher WM, et al.: TIMIT AcousticPhonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia, Pa, USA; 1993.
FerrerBallester MA, FigueirasVidal AR: Efficient adaptive vector quantization of LPC parameters. IEEE Transactions on Speech and Audio Processing 1995, 3(4):314317. 10.1109/89.397097
Subramaniam AD, Rao BD: PDF optimized parametric vector quantization of speech line spectral frequencies. IEEE Transactions on Speech and Audio Processing 2003, 11(2):130142. 10.1109/TSA.2003.809192
Kabal P: Personal communication.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Girin, L. Adaptive LongTerm Coding of LSF Parameters Trajectories for LargeDelay/Very to UltraLow BitRate Speech Coding. J AUDIO SPEECH MUSIC PROC. 2010, 597039 (2010). https://doi.org/10.1155/2010/597039
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1155/2010/597039