Figure 1(a) is a block diagram that illustrates the speech feature extraction methods of MFCC and the proposed speech feature. The proposed feature is obtained by applying an IPS transform instead of DCT in the logarithmic melfrequency filter bank domain. The IPS transform consists of two transforms: the projection onto phoneme subspaces and integration of phoneme subspaces, as shown in Figure 1(b). These two transforms are conducted by multiplying the feature vector by linear transform matrices.
2.1. Base Feature Extraction
To estimate the IPS transform matrix, we use logarithmic melfrequency filter bank (called LogMFB) coefficients. As shown in Figure 1(b), speech signals are preemphasized by using a firstorder FIR filter, and a stream of speech signals is segmented into a series of frames, with each frame windowed by a Hamming window. Next, applying FFT to each frame, the power spectra of timeseries are obtained. The power spectra are filtered using a melfrequency filter whose center frequency is spaced in mel scale and whose coefficients are weighted according to a triangular shape. Finally, the logarithms of MFB components are then computed based on the fact that the human auditory system is sensitive to speech loudness in the logarithmic scale.
2.2. Phoneme Subspaces Using PCA
To extract phonemic information from speech signals, we use the subspace method with Principal Component Analysis (PCA). PCA is defined as an orthogonal linear transformation that transforms data to a new coordinate system. This is also usually used for dimensionality reduction and decorrelation of feature coefficients. By applying PCA to each clean phoneme feature set, as shown in Figure 2, each respective phoneme subspace is obtained.
PCA is applied to each phoneme data matrix that is a set of dimensional LogMFB vectors, , and those are randomly sampled from the frame set for each phoneme. The eigenvectors that make the new coordinate system are computed by eigenvalue decomposition of the covariance matrix as follows:
Here and are a mean vector and an eigenvalue corresponding to the , respectively.
When an unknown vector is inputted, by projecting the onto the th phoneme subspace with eigenvectors corresponding to the largest eigenvalues, a feature vector is defined, ignoring the constant term as follows:
In the next subsection, the method of selecting the optimal dimension of each phoneme subspace is described.
Finally, the supervector is obtained by concatenating as follows:
Here, indicates the number of phonemes and is the matrix of the whole phoneme subspace defined as . The dimensionality of , , is
When a frame of reverberant speech is inputted, the clean speech portion is projected onto subspace . Then the reverberant portion projected onto , (complementary space of ) is reduced as in [7]. The phoneme subspace estimate scheme is represented in Figure 2.
2.3. Optimal Phoneme Subspace Selection Based on MDL
The determination of the dimension for each phoneme subspace, , requires the use of a truncation criterion. In [5], the MDL criterion was applied to the subspace selection problem in the case of noisy speech enhancement. Assuming that the redundancy of clean speech is additive white Gaussian in the logarithmic domain, the MDL criterion could be applied to clean speech data as follows:
where , , and are the dimension parameter, the selectivity of MDL, and the number of free parameters, respectively. We set then the optimal is obtained as follows:
This criterion provides both consistent and automatic phoneme subspace estimates.
2.4. Integration of Phoneme Subspaces
We made optimal phoneme subspaces and obtained feature vectors that enhance phonemic information from input speech signals. It should be noted that the aforementioned feature vectors are large dimension vectors (sum of each optimal phoneme subspace dimension), and some base vectors may correlate. It is efficient to reduce the dimension of the feature vector and to decorrelate components for speech recognition. For this purpose, we apply PCA or ICA to a set of feature so that the integration matrix is obtained, as shown in Figure 3. This integration matrix is timeinvariant and linear under the assumption that phoneme structures are timeinvariant and are composed linearly of decorrelated components. Using the integration matrix , our proposed speech feature vectors are generated as follows:
In our experiments, for a Hidden Markov Model (HMM)based recognizer, we normalized to zero mean and added the time derivatives to those normalized mean values so that the final dimensionality is .
2.4.1. Integration Using PCA
As stated previously, PCA is able to reduce dimension and to decorrelate the components. Using eigenvalue decomposition of a covariance matrix of the data matrix , eigenvalues and eigenvectors are obtained, nd by utilizing eigenvectors corresponding to the largest eigenvalues, we are able to construct an integration matrix .
2.4.2. Integration Using ICA
Independent component analysis is a method for separating mutually independent source signals from mixed signals. In [9], ICA was used for speech feature extraction and phoneme recognition resulting in good recognition performance, and it is shown that the filter obtained by applying ICA to a speech data set in the time domain from a single microphone worked like a bandpass filter. Here, we use ICA for integrating phoneme subspaces.
A generative model of ICA is linear, , where , , and are the observed data vector, mixing matrix, and source vector, respectively. By assuming that only the components of the source vector are mutually independent, an unmixing matrix (ideally ) and independent components are estimated as follows . The unmixing matrix is estimated by maximizing the statistical independence of the estimated components. The statistical independence is usually represented by negentropy or kurtosis that is fourthorder cumulant, and maximization of statistical independence is implemented in a gradient algorithm or fixedpoint algorithm.
In this paper, we used FastICA [8] which is based on a fixedpoint iteration scheme that maximizes negentropy. The FastICA algorithm for finding one that derives one independent component is as follows.

(1)
Center the data to make its mean zero.

(2)
Whiten the data to give .

(3)
(3)Choose an initial (e.g., random) vector of unit norm.

(4)
Let where is the function that gives approximations of negentropy.

(5)
Let .

(6)
If it is not converged, go back to step (4).
To estimate more independent components, different kinds of decorrelation schemes should be used; please refer to [8] for more information.
Applying ICA to the data matrix , the independent components among phonemes are extracted and the dimensionality is compressed. The obtained unmixing matrix is used for the integration matrix. The PCA integration matrix decorrelates the components, and the ICA integration matrix makes the components mutually independent.