In this section, we describe the set of experiments that were conducted to test the long-term coding of LSF trajectories. We first briefly describe in Section 4.1 the 2D-transform coding techniques [18, 19] that we implemented in parallel for comparison with the proposed technique. The database used in the experiments is presented in Section 4.2. Section 4.3 presents the design of the MS-VQ quantizers used in the LT coding algorithm. Finally, in Section 4.4, the results of the LSF long-term coding process are presented.
4.1. 2D-Transform Coding Reference Methods
As briefly mentioned in the introduction, the basic principle of the 2D-transform coding methods consists in applying either a 2D-DCT or a Karhunen-Loeve Transform (KLT) on the
LSF matrices. In contrast to the present study, the resulting transform coefficients are directly quantized using scalar quantization (after being normalized though). Bit allocation tables, transform coefficients mean and variance, and optimal (non-uniform) scalar quantizers are determined during a training phase applied on a training corpus of data (see Section 4.2): Bit allocation among the set of transformed coefficients is determined from their variance [32] and the quantizers are designed using the LBG algorithm [33] (see [18, 19] for details). This is done for each considered temporal size K, and for a large range of bit rates (see Section 4.4).
4.2. Database
We used American English sentences from the TIMIT database [34]. The signals were resampled at 8 kHz and low- and high-pass filtered at the 300–3400 Hz telephone band. The LSF vectors were calculated every 20 ms using the autocorrelation method, with a 30 ms Hann window (hence a 33% overlap),9 high-frequency pre-emphasis with the filter
, and 10 Hz-bandwidth expansion. The voiced/unvoiced segmentation was based on the TIMIT label files which contain the phoneme labels and boundaries (given as sample indexes) for each sentence. A LSF vector was classified as voiced if at least 25% of the analysis frame was part of a voiced phoneme region. Otherwise, it was classified as an unvoiced LSF vector.
Eight sentences of each of 176 speakers (half male and half female) from the eight different dialect regions of the TIMIT database were used for building the training corpus. This represents about 47 mn of voiced speech and 16 mn of unvoiced speech. This resulted in 141,058 voiced vectors from 9,744 sections, and 45,220 unvoiced LSF vectors from 9,271 sections. This corpus was used to design the MS-VQ quantizers used in the proposed LT coding technique (see Section 4.3). It was also used to design the bit allocation tables and associated optimal scalar quantizers for the 2D-transform coefficients of the reference methods.10
In parallel, eight other sentences from 84 other speakers (also 50% male, 50% female, and from the eight dialect regions) were used for the test corpus. It contains 67,826 voiced vectors from 4,573 sections (about 23 mn of speech), and 22,242 unvoiced vectors from 4,351 sections (about 8 mn of speech). This test corpus was used to test the LT coding method, and compare it with frame-by-frame VQ and the 2D-transform methods.
The histogram of the temporal size K of the LSF (voiced and unvoiced) sequences for both training and test corpus are given on Figure 2. Note that the average size of an unvoiced sequence (about 5 vectors ≈ 100 ms) is significantly smaller than the average size of a voiced sequence (about 15 vectors ≈ 300 ms). Since there are almost as many voiced and unvoiced sections, the average number of voiced or unvoiced sections per second is about 2.5.
4.3. MS-VQ Codebooks Design
As mentioned in Section 3.3, for quantizing the reduced set of LSF vectors, we implemented a set of MS-VQ for both voiced LSF vectors and unvoiced LSF vectors. In this study, we used two-stage and three-stage quantizers, with a resolution ranging from 20 to 36 bits/vector, with a 2 bits step. Generally, a resolution of about 25 bits/vector is necessary to provide transparent or "close to transparent" quantization, depending on the structure of the quantizer [29, 30]. In parallel, it was reported in [31] that significantly fewer bits were necessary to encode unvoiced LSF vectors compared to voiced LSF vectors. Therefore, the large range of resolution that we used allowed to test a wide set of configurations, for both voiced and unvoiced speech.
The design of the quantizers was made by applying the LBG algorithm [33] on the (voiced or unvoiced) training corpus described in Section 4.1, using the perceptual weighted Euclidian distance between LSF vectors proposed in [28]. The two/three-stage quantizers are obtained as follows. The LBG algorithm is first used to design the first codebook block. Then, the difference between each LSF vector of the training corpus and its associated codeword is calculated. The overall resulting set of vectors is used as a new training corpus for the design of the next block, again with the LBG algorithm. The decoding of a quantized LSF vector is made by adding the outputs of the different blocks. For resolutions ranging from 20 to 24, two-stage quantizers were designed, with a balanced bit allocation between stages, that is, 10-10, 11-11, and 12-12. For resolutions within the range 26–36, a third stage was added with 2 to 12 bits. This is because computational considerations limit the resolution of each block to 12 bits. Note that the ms structure does not guarantee that the quantized LSF vector is correctly conditioned (i.e., in some cases, LSF pairs can be too close to each other or even permuted). Therefore, a regularization procedure was added to ensure correct sorting and a minimal distance of 50 Hz between LSFs.
4.4. Results
In this subsection, we present the results obtained by the proposed method for LT coding of LSF vectors. We first briefly present a typical example of a sentence. We then give a complete quantitative assessment of the method over the entire test database, in terms of distortion-rate. Comparative results obtained with classic frame-by-frame quantization and the 2D-transform coding techniques are provided. Finally, we give perceptual evaluation of the proposed method.
4.4.1. A Typical Example of a TIMIT Sentence
We first illustrate the behavior of the algorithm of Section 3.2 on a given sentence of the corpus. The sentence is "Elderly people are often excluded" pronounced by a female speaker. It contains five voiced sections and four unvoiced sections (see Figure 3). In this experiment, the target
was 2.1 dB for the voiced sections, and 1.9 dB for the unvoiced sections. For the voiced sections, setting
, 22 and 24 bits/vector respectively, leads to a bit rate of 557.0, 515.2 and 531.6 bits/s respectively, for an actual ASD of 1.99, 2.01 and 1.98 dB respectively. The corresponding total number of model coefficients is 44, 37 and 35 respectively, to be compared with the total number of voiced LSF vectors which is 79. This illustrates the fact that, as mentioned in Section 3.4, for the LT coding method, the bit rate does not necessarily decrease as the resolution increases, since the number of model coefficients also varies. In this case, r = 22 bits/s seems to be the best choice. Note that in comparison, the frame-by-frame quantization provides 2.02 dB of ASD at 700 bits/s. For the unvoiced sections, the best results are obtained with r = 20 bits/vector: we obtain 1.82 dB of ASD at 620.7 bits/s (the frame-by-frame VQ provides 1.81 dB at 700 bits/s).
We can see on Figure 3 the corresponding original and LT-coded LSF trajectories. This figure illustrates the ability of the LT model of LSF trajectories to globally fit the original trajectories, even if the model coefficients are calculated from the quantized reduced set of LSF vectors.
4.4.2. Average Distortion-Rate Results
In this subsection, we generalize the results of the previous subsection by (i) varying the ASD target and the MS-VQ resolution r within a large set of values, (ii) applying the LT coding algorithm on all sections of the test database, and averaging the bit rate (13) and the ASD (10) across either all 4,573 voiced sections or all 4,351 unvoiced sections of the test database, and (iii) comparing the results with the ones obtained with the 2D-transform coding methods and the frame-by-frame VQ.
As already mentioned in Section 4.2, the resolution range for the MS-VQ quantizers used in LT coding is within 20 to 36 bits/vector. The ASD target was being varied from 2.6 dB to a minimum value with a 0.2 dB step. The minimum value is 1.0 dB for r = 36, 34, 32 and 30 bits/vector, and then it is increased by 0.2 dB each time the resolution is decreased by 2 bits/vector (it is thus 1.2 dB for r = 28 bits/vector, 1.4 dB for r = 26 bits/vector, and so on). In parallel, the distortion-rate values were also calculated for usual frame-by-frame quantization using the same quantizers than in the LT coding process, and using the same test corpus. In this case, the resolution range was extended to lower values for a better comparison. For the 2D-transform coding methods, the temporal size was varied from 1 to 20 for voiced LSFs, and from 1 to 10 for unvoiced LSFs. This choice was made after the histograms of Figure 2 and after considerations on computational limitations.11 It is coherent with the values considered in [19]. We calculated the corresponding ASD for the complete test corpus, and for seven values of the optimal scalar quantizers resolution: 0.75, 1, 1.25, 1.5, 1.75, 2.0 and 2.25 bits/parameter. This corresponds to 375, 500, 625, 750, 875, 1,000 and 1,125 bits/s, respectively, (since the hop size is 20 ms). We also calculated for each of these resolutions a weighted average value of the spectral distortion (ASD), the weights being the bins of the histogram of Figure 2 (for the test corpus) normalized by the total size of the corpus. This enables one to take into account the distribution of the temporal size of the LSF sequences in the rate-distortion relationship, for a fair comparison with the proposed LT coding technique. This way, we assume that both the proposed method and 2D-transform coding methods work with the same "adaptive" temporal-block configuration.
The results are presented in Figures 4 and 5 for the voiced sections, and in Figures 6 and 7 for the unvoiced sections. Let us begin the analysis of the results with the voiced sections. Figure 4 displays the results of the LT coding technique in terms of ASD as a function of the bit rate. Each one of the curves on the left of the figure corresponds to a fixed MS-VQ resolution (which value is plotted), the ASD target being varied. It can be seen that the different resolutions provide an array of intertwined curves, each one following the classic general rate-distortion relationship: an increase of the ASD goes with a decrease of the bit rate. These curves are generally situated on the left of the curve corresponding to the frame-by-frame quantization, which is also plotted. They thus generally correspond to smaller bit rates. Moreover, the gain in bit rate for approximately the same ASD can be very large, depending on the considered region and the resolution (see more details below). In a general manner, the way the curves are intertwined involves that increasing the resolution of the MS-VQ quantizer makes the bit rate increase for the left upper region of the curves, but it is no more the case in the right lower region, after the "crossing" of the curves. This illustrates the specific trade-off that must be tuned between quantization accuracy and modeling accuracy, as mentioned in Section 3.4. The ASD target value has a strong influence on this trade-off. For a given ASD level, the lower bit rate is obtained with the leftmost point, which depends on the resolution. The set of optimal points for the different ASD values, that is, the left-down envelope of the curves, can be extracted and it forms what will be referred to as the optimal LT coding curve.
For easier comparison, we report this optimal curve on Figure 5, and we also plot on this figure the results obtained with the 2D-DCT and KLT transform coding methods (and also again the frame-by-frame quantization curve). The curves of the 2D-DCT transform coding are given for the temporal size 2, 5, 10 and 20, and also for the "adaptive" curve (i.e., the values averaged according to the distribution of the temporal size) which is the main reference in this variable-rate study. We can see that for the 2D-DCT transform coding, the longer is the temporal size, the lower is the ASD. The average curve is between the curves corresponding to K = 5 and K = 10. For clarity, the KLT transform coding curve is only given for the adaptive configuration. This curve is about 0.05 to 0.1 dB below the adaptive 2D-DCT curve, which corresponds to about 2-3 bits/vector savings, depending on the bit rate (this is consistent with the optimal character of the KLT and with the results reported in [19]).
We can see on Figure 5 that the curves of the 2D-transform coding techniques are crossing the optimal LT coding curve from top-left to bottom-right. This implies that for the higher part of the considered bit-range (say above about 900 bits/s) the 2D-transform coding techniques provide better performances than the proposed method. These performances tend toward the 1 dB transparency bound for bit rates above 1 kbits/s, which is consistent with the results of [18]. With the considered configuration, the LT coding technique is limited to about 1.1 dB of ASD, and the corresponding bit rate is not competitive with the bit rate of the 2D-transform techniques (it is even comparable to the simple frame-by-frame quantization over 1.2 kbits/s). In contrast, for lower bit rates, the optimal LT coding technique clearly outperforms both 2D-transform methods. For example, at 2.0 dB of ASD, the bit rates of the LT, KLT, and 2D-DCT coding methods are about 489, 587, and 611 bits/s respectively. Therefore, the bit rate gain provided by the LT coding technique over the KLT and 2D-DCT techniques is about 98 bits/s (i.e., 16.7%) and 122 bits/s (i.e., 20%) respectively. Note that for such ASD value, the frame-by-frame VQ requires about 770 bits/s. Therefore, compared to this method, the relative gain in bit rate of the LT coding is about 36.5%. Moreover, since the slope of the LT coding curve is smaller than the slope of the other curves, the relative gain in bit rate (or in ASD) provided by the LT coding significantly increases as we go towards lower bit rates. For instance, at 2.4 dB, we have about 346 bits/s for the LT coding, 456 bits/s for the KLT, 476 bits/s for the 2D-DCT, and 630 bits/s for the frame-by-frame quantization. The relative bit rate gains are respectively 24.1% (110 out of 456), 27.3% (130 out of 476), and 45.1% (284 out of 630).
In terms of ASD, we have for example 1.76 dB, 1.90 dB, and 1.96 dB respectively for the LT coding, the KLT, and the 2D-DCT at 625 bits/s. This represents a relative gain of 7.4% and 10.2% for the LT coding over the two 2D-transform coding techniques. At 375 bits/s this gain reaches respectively 15.8% and 18.1% (2.30 dB for the LT coding, 2.73 dB for the KLT, and 2.81 dB for the 2D-DCT).
For unvoiced sections, the general trends of the LT quantization technique discussed in the voiced case can be retrieved in Figure 6. However, at a given bit rate, the ASD obtained in this case is generally slightly lower than in the voiced case, especially for the frame-by-frame quantization. This is because unvoiced LSF vectors are easier to quantize than voiced LSF vectors, as pointed out in [31]. Also, the LT coding curves are more "spread" than for the voiced sections of speech. As a result, the bit rates gains compared to the frame-by-frame quantization are positive only below, say, 900 bits/s, and they are generally lower than in the voiced case, although they remain significant for the lower bit rates. This can be seen more easily on Figure 7, where the optimal LT curve is reported for unvoiced sections. For example, at 2.0 dB the LT quantization bit rate is about 464 bits/s, while the frame-by-frame quantizer bit rate is about 618 bits/s (thus the relative gain is 24.9%). Compared to the 2D-transform techniques, the LT coding technique is also less efficient than in the voiced case. The "crossing point" between LT coding and 2D-transform coding is here at about
700–720 bits/s, 1.6 dB
. On the right of this point, the 2D-transform techniques clearly provide better results than the proposed LT coding technique. In contrast, below 700 bits/s, the LT coding provides better performances, even if the gains are lower than in the voiced case. An idea of the maximum gain of LT coding over 2D-transform coding is given at 1.8 dB: the LT coding bit rate is 561 bits/s, although it is 592 bits/s for the KLT, and 613 bits/s for the 2D-DCT (the corresponding relative gains are 5.2% and 8.5%, resp.).
Let us close this subsection with a calculation of the approximate bit rate which is necessary to encode the
pair (see Section 3.1). It is a classical result that any finite alphabet
can be encoded with a code of average length L, with
, where
is the entropy of the alphabet [1]. We estimated the entropy of the set of
pairs obtained on the test corpus after termination of the LT coding algorithm. This was done for the set of configurations corresponding to the optimal LT coding curve. Values within the interval
and
were obtained for the voiced sections and unvoiced sections respectively. Since the average number of voiced or unvoiced sections is about 2.5 per second (see Section 4.2), the additional bit rate is about
bits/s for the voiced sections and about
bits/s for the unvoiced sections. Therefore, it is quite small compared to the bit rate gain provided by the proposed LT coding method over the frame-by-frame quantization. Besides, the 2D-transform coding methods require the transmission of the size K of each section. Following the same idea, the entropy for the set of K values was found to be 5.1 bits for the voiced sections, and 3.4 bits for the unvoiced section. Therefore, the corresponding coding rates are
bits/s and
bits/s respectively. The difference between encoding K and the pair
is less than 5 bits/s in any case. This shows that (i) the values of K and P are significantly correlated, and (ii) because of this correlation, the additional cost for encoding P in addition to K is very small compared to the bit rate difference between the proposed method and the 2D-transform methods within the bit rate range of interest.
4.4.3. Listening Tests
To confirm the efficiency of the long-term coding of LSF parameters from a subjective point of view, signals with quantized LSFs were generated by filtering the original signals with the filter
, where
is the LPC analysis filter derived from the quantized LSF vector, and A(z) is the original (unquantized) LPC filter (this implies that the residual signal is not modified). The sequence of
filters was generated with both the LT method and 2D-DCT transform coding. Ten sentences of TIMIT were selected for a formal listening test (5 by a male speaker and 5 by a female speaker, from different dialect regions). For each of them, the following conditions were verified for both voiced and unvoiced sections: (i) the bit rate was lower than 600 bits/s; (ii) the ASD was between 1.8 dB and 2.2 dB; (iii) the ASD absolute difference between LT-coding and 2D-DCT coding was less than 0.02 dB; and (iv) the LT coding bit rate was at least 20% (resp., 7.5%) lower than the 2D-DCT coding bit rate for the voiced (resp., unvoiced) sections. Twelve subjects with normal hearing listened to the 10 pairs of sentences coded with the two methods and presented in random order, using a high-quality PC soundcard and Sennheiser HD280 Headphones, in a quiet environment. They were asked to make a forced choice (i.e., perform an A-B test), based on the perceived best quality.
The overall preference score across sentences and subjects is 52.5% for the long-term coding versus 47.5% for the 2D-DCT transform coding. Therefore, the difference between the two overall scores does not seem to be significant. Considering the scores sentence by sentence reveals that, for two sentences, the LT coding is significantly preferred (83.3% versus 16.7%, and 66.6% versus 33.3%). For one other sentence, the 2D-DCT coding method is significantly preferred (75% versus 25%). In those cases, both LT coded signal and 2D-DCT coded signal exhibit audible (although rather small) artifacts. For the seven other sentences, the scores vary between 41.7%–58.3% to the inverse 58.3%–41.7%, thus indicating that for these sentences, the two methods provide very close signals. In this case, and for both methods, the quality of the signals, although not transparent, is quite fairly good for such low rates (below 600 bits/s): the overall sounding quality is preserved, and there is no significant artifact.
These observations are confirmed by extended informal listening tests on many other signals of the test database: It has been observed that the quality of the signals obtained by the LT coding technique (and also by the 2D-DCT transform coding) at rates as low as 300−500 bits/s varies a lot. Some coded sentences are characterized by quite annoying artifacts, whereas some others exhibit surprisingly good quality. Moreover, in many cases, the strength of the artifacts does not seem to be directly correlated with the ASD value. This seems to indicate that the quality of very-to-ultra low bit rate LSF quantization may largely depend on the signal itself (e.g., speaker and phonetic content). The influence of such factors is beyond the scope of this paper, but it should be considered more carefully in future works.
4.4.4. A Few Computational Considerations
The complete LT LSF coding and decoding process is done in approximately half real-time using MATLAB on a PC with a processor at 2.3 GHz (i.e., 0.5 s is necessary to process 1 s of speech).12 Experiments were conducted with the "raw" exhaustive search of optimal order P in the algorithm of Section 3.2. A refined (e.g., dichotomous) search procedure would decrease the computational cost and time by a factor of about 4 to 5. Therefore, an optimized C implementation would run within several ranges of order below real-time. Note that the decoding time is only a small fraction (typically 1/10 to 1/20) of the coding time since decoding consists in applying only (8) and (9) only once, using the reduced set of decoded LSF vectors and decoded
pair.