The motivation behind SVM supervector reconstruction is to provide vector space modeling diversification to improve the performance of the overall language recognition system. In the language recognition system employed in this paper, we focus on how a change in the input to the SVM affects the output.
Given an SVM supervector φ(x), we define a function ϕSSR which operates on φ(x):
(5)
We are interested in understanding how ϕSSR(φ(x)) affects the behavior of the output scores of the SVM. The goal is to define the relationship between φ(x) and ϕSSR(φ(x)) to enhance the variety of the supervector input.
Selecting SVM supervector reconstruction methods is an open question, so here we propose some typical methods to the implementation. In this section, three SVM supervector reconstruction methods are proposed: relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbative SVM supervector reconstruction. Relative SVM supervector reconstruction has been presented in [25], while functional and perturbative reconstructions are new methods. Relative reconstruction is a linear reconstruction, while functional and perturbative reconstructions are non-linear ones.
3.1 Relative SVM supervector reconstruction
The relative SVM supervector method uses a relative feature approach. Relative features in contrast to absolute features, which represent directly calculable information, are defined by the relationship between an utterance and a set of selected datum utterances. We have presented the concept of relative features in [25].
Calculating relative features requires a relationship measurement, such as distance on similarity. By selecting a proper relationship measurement, the identifiable characteristics can be strengthened and nuisance attributes of the utterance can be discarded. Unlike absolute features, relative features make utterances more convenient to classify by showing the relationship between the utterances and the datum database directly.
Here, we introduce a relative SVM supervector reconstruction defined using the similarity between the utterance SVM supervectors. The widely used kernel methods offer efficient similarity measurements between two SVM supervectors. In this paper, the empirical kernel [29] is introduced into language recognition and a relativized SVM supervector developed. Kernel methods have been used for face recognition [30] and handwritten digit recognition [31] and achieved higher robustness to noise [30]. Using the SVM supervectors that are already built into a language recognition system, we can easily compose a new relativized SVM supervector with only a small increase in computation.
The architecture of the relative SVM supervector reconstruction subsystem is shown in Figure 1. To construct the SVM supervector relativization map, a database s =[s1,s2,…s
m
] containing m utterances is used as the datum mark of similarity. The datum database is stochastically selected from some corpus, whose language need not be the same with the target language. s is mapped into vector space:
(6)
The vector relativizational (VR) kernel between two supervectors φ(x
i
) and φ(x
j
) is
(7)
The VR kernel is similar to the TFLLR kernel, but normalized by the observed probability across all lattices of the datum dataset p(d
i
|ℓ
s
). This kernel reflects the degree of similarity between two supervectors.
The utterance x is mapped from the input data space to a relativized m-dimensional Euclidean space m: ΦREL:→m as follows:
In general, KVR(φ(x),φ(S )) defines a space in which each dimension corresponds to the similarities to a prototype. Thus, KVR(·,φ(S )) can be viewed as a mapping onto an m-dimensional relativized vector space.
The SVM output score is computed as
(8)
where are support vectors obtained from the training set using the Mercer condition. Selecting a radial basis function (RBF) kernel, K RBF′ is computed as
(9)
where DRF is the dimension of the relativized SVM supervector. Selecting a TFLLR kernel, K TFLLR′ is computed as
(10)
3.2 Functional SVM supervector reconstruction
In actual test conditions, the training and test data are variable in speakers, background noise, and channel conditions. To achieve higher robustness to variable test conditions, the widely used kernel methods offer efficient similarity measurements between two SVM supervectors in PR-VSM system [30]. The geometrical structure of the SVM vector space is completely determined by the kernel, so the selection of the kernel has a crucial impact on the performance of the language recognition systems. The functional SVM supervector reconstruction method defines a mixture between the functional and the original kernels, which can offer the robust discriminative information of the data and get robust language model.
But how to select a proper function is an open problem. There are many functions that can be used to the reconstruction, while not every function is available for the reconstruction that can reduce the equal error rate (EER). What we need to do is to find out what kind of functions can be used in feature reconstruction. The functions need to satisfy the following conditions: (1) monotonic and (2) can make the identifiable characteristics strengthened and nuisance attributes of the utterance discarded. The proposed functional SVM supervector reconstruction method does not rely on prior knowledge to select the functional to reconstruct the supervector. A development database is used for cross validation to select the function. So, here, three functions selected to be used in this paper include
-
(a)
(11)
-
(b)
(12)
-
(c)
(13)
The utterance x is mapped onto a functionalized vector space:
(14)
The three functions are all monotonic in the range of amplitude of the SVM supervector. Selecting a TFLLR kernel, K TFLLR′ is computed as
(15)
The SVM output score is computed as
(16)
3.3 Perturbational SVM supervector reconstruction
For spoken language recognition, the first and most essential step is to tokenize the running speech into sound units or lattices using a phone recognizer. The phoneme error rate is around 40% to 60% [32] when tokenizing an utterance. The decoding errors are deletion, insertion, and substitution errors, which are expressed as some discrete ‘noise’ when mapped to the high-dimensional SVM supervector space (shown in Figure 2). So, here, we introduce a perturbational denoising method for the SVM supervector. Given a supervector φ(x) and some perturbation operator on φ(x), we are interested in understanding how a small perturbation added to the supervector affects the behavior of the SVM [33]. This relationship can be represented using a mapping onto a perturbational vector space.
There are three purposes of proposing perturbational SVM supervector reconstruction method: first, adding perturbational noise to reduce the impact of noise in the SVM supervector introduced by the decoding errors; second, generating a more robust language model to provide input variety to the SVM classifier; and third, highlighting the most discriminative information of the SVM supervector and drowning the non-discriminative information into the perturbation (shown in Figure 3).
To accomplish the above goals, the type and strength of the perturbation must be selected carefully. How to define a proper perturbation is an open problem. There are a wide variety of perturbations,which can be categorized into multiple ways, including (1) global perturbation and local perturbation, (2) stochastic perturbation and constant perturbation according to the amplitude, (3) absolute perturbation and relative perturbation according to the relationship between the SVM supervector and the perturbation, and (4) addictive perturbation and multiplicative perturbation.
For feature supervectors in vector space, the perturbations are always discrete, maybe random in a certain range or change with the amplitude of the expected value of the supervector. So, we consider both deterministic perturbation and stochastic perturbation , where is the mean of the SVM supervector, is the uniform distribution between 0 and , and w∗ is the perturbation weight. More details of the perturbation methods are discussed below.
3.3.1 Perturbational approach 1 (deterministic additive perturbation)
(17)
This kind of perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by an equivalent additive amount.
3.3.2 Perturbational approach 2 (stochastic additive perturbation)
(18)
This perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by an amount proportional to the frequency of the phoneme sequences.
3.3.3 Perturbational approach 3 (deterministic multiplicative perturbation)
(19)
This kind of perturbation represents the assumption that the expected count of every phoneme sequence perturbed by an equivalent multiple amount.
3.3.4 Perturbational approach 4 (stochastic multiplicative perturbation)
(20)
This perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by a proportional to the frequency of the phoneme sequences.
From above, it can be seen that methods 1 and 2 implement absolute perturbation, and methods 3 and 4 implement relative perturbation. All are global perturbation algorithms, operating across the entire vector space. We can also investigate local perturbation using these same approaches. Local perturbation is more flexible and realistic for the noises would have effect on part of the expected counting. The proposed methods also do not rely on prior knowledge to put noising into the supervector; we use development database for cross validation to select a better perturbation.
The utterance x is mapped onto a perturbational vector space:
(21)
where φPER(x) is a perturbation of φ(x). Selecting a TFLLR kernel, K TFLLR′ is computed as
(22)
The SVM output score is computed as
(23)