The algorithm described previously was implemented and tested on 10 mono audio pieces of 6.4 seconds duration each. These audio pieces comprise of four generic music pieces and six single instrument pieces from the EBU-SQAM [29] testing database. The generic music pieces are one Rock piece with a male singing voice, one Jazz piece, one Electronic piece with very high energy at high frequencies, and one Classical music symphony piece. The six SQAM pieces are taken from a flute, a violin, a piano, a double bass, drums, and a harp instrument. In all 10 cases, we attempt to enhance a 6.4-second source segment which is MP3 encoded (with LAME) at a 32 kbps constant bit rate or a 64 kbps constant bit rate. The testing target segments are the corresponding uncompressed (WAV) versions of the same audio segments. The source and target pieces in all examples are time-aligned and since the algorithm is applied in a post-processing stage, the MP3 sources are also converted to a WAV format (PCM data).
5.1. Critical Band Analysis
The first stage of the algorithm implementation is to apply a subband analysis on the source and target signals as well as on the training set. A popular method for subband analysis uses wavelet filters which achieve perfect reconstruction [30] but other perfect reconstruction filterbanks can be used instead. A suitable wavelet candidate is the Daubechies [31] wavelet filter of order 40, which achieves a very efficient subband separation.
For the source and target signals, several different wavelet tree structures were tested (e.g., equidistant subbands) but the most successful structure proved to be one that emulates the critical bands of the human hearing system as in [32]. It assigns increased frequency resolution to the range 20 Hz–5.5 kHz in which human hearing is most sensitive. In addition, the large number of subbands selected allows us, as we show later, to take advantage of the interband redundancy and also to process accurately the subbands that are the most significant (i.e., the ones that are more degraded or carry the perceptually important parts of the signal). The actual wavelet filterbank is shown in Figure 3 and is applied to both source (compressed) and target (uncompressed) signals leading to 17 source and target subbands. Note that each time a signal is wavelet-filtered, the resulting two signals are decimated by a factor of two (i.e., critically sampled). The training set is also separated in subbands but for reasons explained in the next subsection, the subband tree is different than that of the source and target signals.
5.2. The Training Set
An important part of the algorithm is to derive a generalized Gaussian mixture pdf that does not have to adjust to the particular testing music piece. This probabilistic model should be global in the sense that it will include the statistical properties of all possible audio pieces and both transmitting and receiving ends will have access to it (e.g., prestored in both sides). This does not mean that the mixture pdf will accurately model any of the particular testing music pieces but rather capture the main (subband-specific) statistical properties of the testing source cepstral vectors. In essence, we ensure that the conversion function acquires appropriate mixture model parameters (i.e., cluster means and variances) so that the conversion parameters derivation is not ill-conditioned.
Several candidate training sets were processed to produce a mixture pdf among which were the multichannel training set of [2] (1 minute of an orchestra recording), a white noise training set, a Brownian noise training set, and a pink noise training set. Pink noise proved to be the most suitable training set and led to smaller cepstral reconstruction errors (up to 5% less in all subbands compared to the other sets) during enhancement of the 4 generic music pieces. The power spectrum of pink noise is proportional to
, where
is the frequency. An approximation to pink noise can be created by starting from the discrete Fourier transform (DFT) magnitude (taken as proportional to
), adding uniformly distributed random phase and applying the inverse DFT (real part).
In order to reduce the training model size and allow for the data diversity needed in the case of many mixture components ML estimation, we divide the training data set into 4 large equidistant subbands (instead of 17 subbands) covering the frequency range 20 Hz–22 kHz. Each training subband consists of 12 000 cepstral vectors of cepstral order 30. ML parameters estimation, as described in Section 2, is performed on each training subband separately. The training procedure is shown in the flow diagram of Figure 4.
In Figure 5, the validity of the estimation algorithm, as described in Section 2, is illustrated. Even though a generalized Gaussian mixture model of 40 groups is recommended (as determined by the MDL [33] and AIC [34, 35] criteria), we decrease this number to 4 for all 4 training subbands. The fitting of the mixture pdf to the histogram is still accurate.
5.3. Signal Enhancement
The signal enhancement procedure that takes place at the transmitter side for a source (compressed) and target (uncompressed) signal is shown in Figure 6. The source and target signals are separated into 17 subbands as mentioned before. The resulting subband signals are LPC analyzed and the LPC cepstral, and residual vectors are extracted. The cepstral order for subbands 1–13 is 8 while for subbands 13–17 it is 15, accounting for the larger frequency bandwidth of the high subbands. Generally, not all subband signals and not all vectors within each subband require cepstral or residual conversion as explained later and this is the role of the subband and vector selection that occurs right after the LPC analysis. The selected source and target vectors (cepstral or residual) of the selected subband are then sent for statistical conversion. However, the statistical conversion process requires the mixture model parameters and each selected subband acquires these model parameters from one of the 4 larger training subbands that it is part of. This is shown in Figure 6 as the role of the training subband switch. Statistical conversion can now be performed for the selected subband
and the corresponding conversion parameters and sorting information are derived as described in Sections 2 and 3. These are finally transmitted to the receiver as the transmitted parameters. During cepstral conversion, the cepstral order of the training model is truncated appropriately for each source/target subband to adjust to the lower cepstral order of the particular source and target cepstral vectors (8 or 15). The reason for this is that the source cepstral vectors are assumed to be generated by the mixture pdf derived during training and the dimensionality of the training and particular source cepstral vectors should be the same.
At the receiver side, the compressed source signal is separated into 17 subbands and LPC analyzed in the same way as at the transmitter side. The transmitted parameters are applied oneach source subband signal that was selected for statistical conversion during the parameters extraction procedure of Figure 6. Specifically, the conversion parameters are used to convert each of the selected sorted source subband vectors (cepstral or residual) to the resynthesized ones and the sorting information is used to rearrange them to their correct order. After the resynthesized subband vectors are created, we perform LPC synthesis on each resynthesized subband to produce the time domain subband signals. These signals along with the source subband signals that were not converted are combined through wavelets (i.e., inverse of the 17 subbands separation) and the final time domain resynthesized signal is created.
5.3.1. Phase Redundancy
In the low subbands particularly, it has been observed that the signs of the source and target vectors are mostly the same. To take advantage of this observation, we take the absolute value of the target vectors before the derivation of the conversion function. The absolute value of the target vectors can be estimated more accurately compared to the raw target data. To recover the lost sign of the resynthesized vectors at the receiver, a 1-bit sequence is included along with the sorting information. For every source-target coefficient pair, a 0 is sent if the signs are equal, otherwise a 1 is sent. For subbands below 5.5 kHz, the proportion of 1's in each sign sequence is between 1% and 10% while in the higher subbands it tends towards the expected 50% value. This means that in the high frequencies there is little phase redundancy.
5.3.2. Intraband Redundancy
Further redundancies can be found in the time domain for each subband signal pair that goes through cepstral or residual conversion. The differences between the samples of the source and target subband signals do not carry the same significance in terms of resulting audio quality and in some cases the source and target subband samples of a particular time frame are almost identical. For instance, the first subband of Figure 3 for a 32-kbps bit rate MP3 signal is usually not severely degraded and many source-target sample pairs can be neglected during conversion. Another example is the silence regions that occur in a speech signal.
For the subbands that preserve their energy in the source signal (and thus are not severely degraded), we adopt a threshold rule based solely on the source signal information. According to this rule, the source subband samples that have absolute value below a certain threshold are neglected from conversion. The rationale behind this is that source subband samples with relatively smaller amplitude either correspond to less audible parts of the subband target signal or they have been suppressed as the particular codec that carried the compression of the source classified them to be perceptually insignificant. The advantage of this method is that the receiver knows which samples have been discarded, as long as the threshold for each processed subband is transmitted, which is side information of negligible size. The disadvantage of this method is that other samples that are perceptually irrelevant are not detected. This is a more general problem of our algorithm since, as mentioned before, it does not use a psychoacoustic model.
The samples that pass the threshold test will now form new source and target subband signals on which cepstral and residual extraction is applied for the derivation of the particular conversion function. Nevertheless, the residual vectors incur significant conversion overhead compared to cepstral conversion and thus a second, less strict, threshold rule can now be applied on the residual vectors only to further reduce the residual conversion overhead. The most significant source-target residual vector pairs are selected by locating the pairs that yield a high quadratic vector distance between them. Consequently, little side information has to be transmitted to the receiver which indicates which residual vectors in each subband are selected for conversion.
In the case that a source-target sample pair is determined to be insignificant, the source sample of that pair is used directly for signal resynthesis at the receiver, bypassing the conversion process. However, note that in the case where a source subband has lost most of its energy, and it is perceptually important, all of its samples pass through cepstral and residual conversion and no threshold rules are applied. The LPC target gains for the selected subbands are transmitted as side information since they are crucial in recovering the lost energy of the source subbands.
5.3.3. Interband Redundancy
Naturally, not all subbands are expected to be severely degraded by the compression process and not all of them are perceptually important. A 32-kbps MP3 signal will usually sustain moderate distortion in the frequency range 20 Hz–5.5 kHz and some of the corresponding subbands can be completely neglected from residual or even cepstral conversion. On the other hand, the higher subbands are the most distorted, mainly because they are less perceptible to the human ear. The 7 highest subbands, as seen from Figure 3, are large and therefore require longer cepstral and residual vectors during LPC analysis compared to the first 10 subbands. These subbands, if selected for cepstral and residual conversion, will add considerable transmission overhead. A simple method to determine which high-frequency subbands of the source signal to process is to compare them with the subband energies of a signal that has been compressed with the same codec as the source file but at a higher bit rate such that its audio quality is roughly comparable to the expected quality of the enhanced signal. This should give us an insight into which subbands are perceptually important with the use of the codec's own psychoacoustic model.
The selected high subbands require only an approximate reconstruction such that the overall envelope of the desired subband signal is preserved. The human ear is more sensitive to the low subbands but even for these we determined that the cepstral and residual conversion, as described in the previous sections, is extremely accurate at the cost of high overhead size. For this reason, we apply a more subband-adaptive technique by increasing the degree of sorting similarity between the source vectors
and target vectors
according to the perceptual significance of the subband they belong to. The straightforward way to achieve this is to add the source vectors to the target vectors multiple times creating a modified target set
which, combined with the phase redundancy observation, is shown below:
After sorting the modified target set
, the original positions of its coefficients will be more similar to the original positions of the sorted source set
depending on how many times the source set was added to the target set (i.e., the constant
). We call constant
the multiplier. This modification is easily reversible at the receiver because the source set is always available and the multiplier can be transmitted as side information. The resulting sorting information size, as derived in Section 4, will be now less than
bits and can be adjusted through the multiplier depending on the degree of enhancement desired, enabling scalable overhead transmission. As a rule of thumb, we increase the multiplier as we move to higher subbands so that we progressively decrease the reconstruction accuracy and the sorting information size.