Compared with a conventional ECVQ, such as the quantization methods of USAC, whose architecture is to be described in details in Section 4, the FS-ECVQ can be taken as a super-ECVQ, in which multiple component ECVQs are contained and all of them are combined to a FSVQ. Thus, the FS-ECVQ is composed of two steps. The first step is to split the source sequence into multiple clusters using the FSVQ (main FSVQ), and the second step is to apply a dedicated conventional ECVQ to each cluster. Suppose that the largest available vector dimension is 8, the whole coding scheme of a FS-ECVQ can be demonstrated as Figure 1.

### 3.1 Main FSVQ

The major function of the main FSVQ is to partition the input space into four non-overlapped clusters according to the four states contained. For each resulting cluster, a component ECVQ is constructed holding a different vector dimension and different memory requirements. By this means, the total memory requirements could be efficiently decreased. The state transition is determined by a next-state function, which is the key component of the main FSVQ. In the following part of this section, we will mainly discuss the construction of the next-state function.

The next-state function of the main FSVQ is built on the dependencies among audio MDCT coefficients. In practice, audio signals are usually divided into a series of time intervals (often referred to as ‘frames’) due to their long-term time-varying property, and over a specified frame they are assumed to be stationary. Thus, as an expression of the audio signal in MDCT domain, there are high dependencies among the MDCT coefficients of the adjacent frames as well as among the coefficients within one frame. As a result, based on the inter- and intra-frame correlations, the MDCT coefficients of current frame could be estimated through prediction methodology. In our work, the audio MDCT coefficient frames are further divided into small blocks, and then, by estimating the shape properties of these blocks among multiple sequential frames, the next-state function is constructed in order to exploit both the intra- and inter-frame dependencies. In fact, within an audio MDCT coefficient sequence, the occurrence frequency of a block is highly related to its shape features. Suppose the block size is 4, then the relationship of the shape feature and the occurrence of a block are demonstrated in Figure 2.

To characterize the shape of a block, three statistical parameters block energy, *e*, block deviation, *σ*, and block skewness, *g*, are employed in our work. Let *μ* be the mean value of block **x**. Then the parameters *e*, *σ*, and *g* can be written as

\begin{array}{ll}e& =\frac{1}{N}\sum _{i=1}^{N}{x}_{i}^{2}\phantom{\rule{2em}{0ex}}\end{array}

(18)

\begin{array}{ll}\sigma & =\frac{1}{N}\sum _{i=1}^{N}{({x}_{i}-\mu )}^{2}\phantom{\rule{2em}{0ex}}\end{array}

(19)

\begin{array}{ll}g& =\frac{\frac{1}{N}\sum _{i=1}^{N}{({x}_{i}-\mu )}^{3}}{{\left(\frac{1}{N}\sum _{i=1}^{N}{({x}_{i}-\mu )}^{2}\right)}^{3/2}}\phantom{\rule{2em}{0ex}}\end{array}

(20)

where *x*_{
i
} and *N* are the *i* th element and the length of block **x**, respectively. To describe the shape feature of block **x** in a simplified form, a new statistical parameter, *V*_{
x
}, is defined, which is given as

{V}_{\mathbf{x}}=(\sigma +e)\xb7(1+log(1+\left|g\right|\left)\right).

(21)

Once a source sequence is split into a series of blocks, the value of *V*_{
x
} will be calculated for each block. Thus, a mapping can be established between the *V*_{
x
} set, composed of all the possible values of *V*_{
x
}, and the input space *Ω*. Then, by splitting the possible values of *V*_{
x
} into two segments, we can partition the input space *Ω* into two clusters, *Ω*_{
k
} and {\Omega}_{k}^{\text{C}}. Here, *k* denotes the dimension of cluster *Ω*_{
k
}. To implement the split, a threshold *V*_{T} is employed, whose value is obtained by maximizing the coding gain of the FS-ECVQ under the constraint of the total memory requirements using the training data. As for the two resulting clusters, *Ω*_{
k
} is supposed to contain the blocks occurring relatively frequently, whereas {\Omega}_{k}^{\text{C}} is assumed to hold those occurring relatively scarcely.

To construct the next-state function, four previous blocks, *A*, *B*, *C*, and *D*, which are adjacent to the current block, **x**, can be employed [29]. For simplicity, we assume that the current block and its previous neighbors form a Markov chain [26]. The relative positions of all these blocks are demonstrated in Figure 3. Assume that the shape parameters of the four adjacent blocks are independent measurements, then according to the research done by Nasrabadi et al. [30], the conditional joint posterior probability, which the next-state function is built on, can be given as

P\left({V}_{\mathbf{x}}\right|({V}_{A},{V}_{B},{V}_{C},{V}_{D}))=\frac{P\left({V}_{\mathbf{x}}\right)\prod _{i=A}^{D}P\left({V}_{i}\right|{V}_{\mathbf{x}})}{\prod _{i=A}^{D}P\left({V}_{i}\right)}

(22)

where *V*_{
x
} and *V*_{
i
} are the shape parameters of block **x** and its four neighbors *A*, *B*, *C*, and *D*, respectively. Suppose that *P*(*V*_{
x
}) and *P*(*V*_{
i
}) are measured independently and considered to be equal, then probability *P*(*V*_{
i
}|*V*_{
x
}) will be equal to probability *P*(*V*_{
x
}|*V*_{
i
}), which represents a conditional probability of the parameter *V*_{
x
} given one of its neighbors *V*_{
i
}, for *i*=*A*, *B*, *C*, and *D*, and can be obtained through recording all the possible cluster pairs occurring together using the training data. Assume that all the shape parameters obey the same probability distribution, then the conditional joint probability, *P*(*V*_{
x
}|(*V*_{
A
},*V*_{
B
},*V*_{
C
},*V*_{
D
})), will only depend on the four conditional probabilities *P*(*V*_{
x
}|*V*_{
i
}), for *i*=*A*, *B*, *C*, and *D*, and the other parameters in (22) will be constant for any input block. Therefore, we can build the next-state function (6) on these four conditional probabilities, and the current state of the main FSVQ, *s*, can be given by

s=\gamma ({V}_{A},{V}_{B},{V}_{C},{V}_{D})=\underset{\mathbf{x}\in \{{\Omega}_{k},\phantom{\rule{1em}{0ex}}{\Omega}_{k}^{C}\}}{\mathrm{max}}\prod _{i=A}^{D}P\left({V}_{\mathbf{x}}\right|{V}_{i})

(23)

which denotes an estimation of the cluster to which the current block is most likely to be classified.

To split the source sequence into smaller clusters, a pyramidal decomposing algorithm is employed, as demonstrated in Figure 4. In this algorithm, a block, **x**, is first separated from the source sequence, whose length is set to be the largest available vector dimension, supposed to be 8. Then, the current state *s* of the obtained block **x**, which is calculated through (23), is compared with a given threshold, *T*_{8}. If current state *s* is lower than *T*_{8}, block **x** will be taken as an element belonging to cluster *Ω*_{8}. Else, it would be equally decomposed into two smaller blocks, {\mathbf{x}}_{4}^{\left(1\right)} and {\mathbf{x}}_{4}^{\left(2\right)}, whose vector dimensions are both 4, and then the block {\mathbf{x}}_{4}^{\left(1\right)} will be taken as the new current block. Once again, the current state *s* is calculated and is compared with another threshold, *T*_{4}. If the obtained state *s* is lower than *T*_{4}, the block {\mathbf{x}}_{4}^{\left(1\right)} will be taken as an element belonging to cluster *Ω*_{4}. Else, it will be decomposed once more. This procedure continues iteratively until the lowest available vector dimension, supposed to be 1, is reached. Since each threshold can be regarded as the occurrence frequency of a block, then the current blocks considered to be with low-occurrence frequency, will be split iteratively, until a suitable vector dimension is found. The whole procedure is summarized in Algorithm 1.

At beginning, there is no previous block, and therefore, an original state, *s*_{0}, ought to be initialized by the main FSVQ.

### 3.2 ECVQ

Based on the research done by Gray et al. [25], in our work, *Z*_{
n
} lattice quantizer and arithmetic coder are selected as the lattice quantizer and the length function of each component ECVQ, respectively. Unlike conventional ECVQ [15, 17], where all the vector indices generated from the lattice quantizer share a same length function regardless of their possible differences, in our work multiple length functions are available and the optimal one is selected by another FSVQ (sub-FSVQ) for each generated vector index. Moreover, to improve the robustness and, at the same time, decrease the memory requirements of each component ECVQ, the design of sub-FSVQ is optimized and an iterative method to merge the similar length functions is proposed.

The length functions are implemented by an arithmetic coder, which are based on the pdf model of the input index. Hence, the main work of the sub-FSVQ is to search for the optimal one among a predesigned collection of pdf models based on the information obtained from previous indices.

#### 3.2.1 Lattice quantizer

The issue whether an optimal ECVQ has a finite or infinite number of codevectors has been in-depth investigated by Gyärgy and Linder [31]. They found that ECVQ has a finite number of codevectors only if the tail of the source distribution is lighter than the tail of a Gaussian distribution. With respect to the probability distribution of an audio MDCT coefficient sequence, Yu et al. [32] show that the generalized Gaussian function with distribution parameter *r*=0.5 provides a good approximation. Moreover, in practice, the possible values of the audio MDCT coefficients are always finite and concentrated in a finite range. Therefore, in our work, all codevectors of the lattice quantizer are simply constrained in the range

P\left\{\left|\right|X\left|\right|\le {t}_{0}\right\}\ge {p}_{0}

(24)

where *X* denotes an input vector, and *t*_{0} and *p*_{0} are two thresholds that constrain the norm and the probability of input vector *X*, respectively.

Since all the codevectors are constructed within the range (24), the input vectors outside the range will suffer a larger quantization loss than those inside the range. Such circumstances are usually required to be avoided for audio MDCT coefficients quantization. To keep the possible quantization error constant, the input vector which falls outside the range (24) will be split into two parts, the least significant bits (LSB) and the most significant bits (MSB), and then the two parts are encoded separately. Let **x**=(*x*_{1},…,*x*_{
k
}) be a candidate vector, whose vector dimension is *k*. Assume that after each split the generated MSB and LSB are denoted by **x**^{∗} and {B}_{i}=({b}_{0}^{i},{b}_{1}^{i},\dots ,{b}_{k}^{i}), respectively, where *i* denotes the *i*-th split. To indicate an overflow happens, a symbol, *e* *s* *c* *a* *p* *e* *s* *y* *m* *b* *o* *l*, is employed. The whole procedure is demonstrated in Algorithm 2.

#### 3.2.2 Sub-FSVQ

This FSVQ is used to search for the optimum in a predesigned collection of length functions, which are used to encode the current vector index generated from the lattice quantizer. The next-state function of the sub-FSVQ, {\gamma}_{{s}_{i}}, is built on the four previous indexes *I*_{
A
}, *I*_{
B
}, *I*_{
C
}, and *I*_{
D
}, adjacent to the current input, *I*_{
x
}. Since the ECVQ holds a finite number of codevectors, the simplest way to construct the next-state function is to enumerate all the possible combinations of the four neighbors, each denoting a certain state. But with the increase of the number of codevectors, the possible number of current states will be extremely large, and thus, the memory requirements and the computation cost skyrocket.

To reduce the number of possible current states, the different dependencies between the current index and its four previous neighbors must be taken into account. In practice, less emphasis is placed on indices *I*_{
A
} and *I*_{
C
} than on indices *I*_{
B
} and *I*_{
D
}. This is due to the fact that among the four neighbors, current vector **x** is less relevant to vectors *A* and *C* than to vectors *B* and *D*. Thus, we apply the operation ||·||_{2} to vectors *A* and *C*, so as to reduce the number of their possible values.

The location of the current vector should also be considered. The frame, current vector located, can be generally classified into two types: the normal frame and the reset frame. In addition, within a frame the current vector can be located at the normal position or the starting position. Thus, there exist four cases, as demonstrated in Figure 5. Specially, if the current vector is located at the starting position of a reset frame, there will be no adjacent vector to build the next-state function, then a special state, {s}_{{s}_{0}}, should be assigned.

As a result, the next-state function of the sub-FSVQ can be written as

\begin{array}{ll}{s}_{{s}_{i}}& ={\gamma}_{{s}_{i}}({I}_{B},{I}_{D},{I}_{\left|\right|A|{|}_{2}},{I}_{\left|\right|C|{|}_{2}})\phantom{\rule{2em}{0ex}}\\ =\left\{\begin{array}{cc}{t}_{0}{I}_{B}+{t}_{1}{I}_{D}+{t}_{2}({I}_{\left|\right|C|{|}_{2}}+{I}_{\left|\right|A|{|}_{2}})& \text{Case:}\phantom{\rule{1em}{0ex}}\left(a\right)\\ {t}_{0}{I}_{B}+{t}_{2}{I}_{\left|\right|A|{|}_{2}}& \text{Case:}\phantom{\rule{1em}{0ex}}\left(b\right)\\ {t}_{1}{I}_{D}& \text{Case:}\phantom{\rule{1em}{0ex}}\left(c\right)\\ {s}_{{s}_{0}}& \text{Case:}\phantom{\rule{1em}{0ex}}\left(d\right)\end{array}\right.\phantom{\rule{2em}{0ex}}\end{array}

(25)

where *i* denotes that the sub-FSVQ belongs to the *i*-th ECVQ and *t*_{0}, *t*_{1}, and *t*_{2} are three constants making each combination of the four indices corresponding to a different current state. This is feasible since for an audio MDCT coefficient sequence, the values of the four variables, *I*_{
B
}, *I*_{
D
}, {I}_{\left|\right|A|{|}_{2}}, and {I}_{\left|\right|C|{|}_{2}}, are all finite, and then according to their maximum possible values, it is easy to find the possible values of the three constants.

#### 3.2.3 Length function

The length functions are realized by an arithmetic coder holding multiple pdf models. There are two difficulties in building an optimal arithmetic coder for an optimal ECVQ. First, the memory requirements for saving the predesigned pdf models will become infeasible as the number of states derived from (25) increases. Second, as the volumes of the partitions split by the sub-FSVQ shrink, the available data may not provide credible pdf estimation. Popat and Picard [33] proposed a solution to the second problem using a Gaussian mixture model (GMM) for describing the source pdf. Thus, this work mainly focuses on reducing the memory requirements for saving the pdf models necessary for the arithmetic coder.

The memory requirements can be reduced by merging the similar pdf models. However, according to (16), if one pdf model is replaced by another, mismatch will inevitably take place. Let *g* be the true pdf of the input signal and suppose that *Ω*_{
g
} is its support. Assume that \{{S}_{m};m\in \mathcal{U}\}, whose corresponding pdf model is \{{g}_{m};m\in \mathcal{U}\} for \mathcal{U}=\{1,\dots ,M\}, is a finite partition of *Ω*_{
g
} and that *P*_{
g
}(*S*_{
m
})≤0 for all *m*. Assume also that model *g*_{
m
} is replaced by another model, *g*_{
n
}, then according to (17) the mismatch of the pdf model pair, *g*_{
m
} and *g*_{
n
}, denoted by *d*_{mis}(*m*,*n*), can be given as

\begin{array}{ll}{d}_{\text{mis}}(m,n)& =\underset{{S}_{m}}{\int}{\rho}_{m}{g}_{m}\left(x\right)ln\frac{{g}_{m}\left(x\right)}{{g}_{n}\left(x\right)}\mathit{\text{dx}}\phantom{\rule{2em}{0ex}}\\ ={\rho}_{m}I\left({g}_{m}\right|\left|{g}_{n}\right)\phantom{\rule{2em}{0ex}}\end{array}

(26)

where *ρ*_{
m
}, which equals to the probability *P*_{
g
}(*S*_{
m
}), is the weight of model *g*_{
m
}. Thus, the mismatch *d*_{mis} can be seen as a distance measure of a pdf model pair. The more similar the two models are, the smaller is the mismatch. Therefore, we can efficiently decrease the memory requirements for saving the pdf models by merging the model pairs, which hold small enough mismatches, into a new pdf model with a negligible loss of the coding performance.

For a pdf model collection, once we have obtained the *d*_{mis} values of each model pair, we can merge the ones with minimal *d*_{mis} values into a new pdf model so as to reduce the memory requirements. If the memory size is still above the requirements, the mergence of the similar pdf models should be continued. But once a new pdf model is generated, the mismatches among pdf models should be updated first. And then, a new merge can be executed. The whole procedure will be carried out iteratively, until the memory size reaches the requirements. Once the final pdf models are obtained, a remapping between these models and their corresponding states is needed.