Open Access

Audio Query by Example Using Similarity Measures between Probability Density Functions of Features

EURASIP Journal on Audio, Speech, and Music Processing20092010:179303

https://doi.org/10.1155/2010/179303

Received: 22 May 2009

Accepted: 9 November 2009

Published: 22 December 2009

Abstract

This paper proposes a query by example system for generic audio. We estimate the similarity of the example signal and the samples in the queried database by calculating the distance between the probability density functions (pdfs) of their frame-wise acoustic features. Since the features are continuous valued, we propose to model them using Gaussian mixture models (GMMs) or hidden Markov models (HMMs). The models parametrize each sample efficiently and retain sufficient information for similarity measurement. To measure the distance between the models, we apply a novel Euclidean distance, approximations of Kullback-Leibler divergence, and a cross-likelihood ratio test. The performance of the measures was tested in simulations where audio samples are automatically retrieved from a general audio database, based on the estimated similarity to a user-provided example. The simulations show that the distance between probability density functions is an accurate measure for similarity. Measures based on GMMs or HMMs are shown to produce better results than that of the existing methods based on simpler statistics or histograms of the features. A good performance with low computational cost is obtained with the proposed Euclidean distance.

1. Introduction

The enormous growth of personal and on-line multimedia content has created the need for tools of automatic database management. Such management tools include, for instance, query by humming or query by example, multimedia classification, and speaker recognition. Query by example is an audio retrieval task where a user provides an example signal and the retrieval system returns similar samples from the database. The main problem in the query by example and the other above content management applications is to determine the similarity between two database items.

The fundamental problem when measuring the similarity between audio samples is the imperfect definition of similarity. For example, a human can judge the similarity of two speech signals by the topic of the speech, by the speaker identity, or by any sounds on the background. There are retrieval approaches where the imperfect definition of similarity is circumvented differently. First, the similarity criterion can be defined beforehand. For example, query by humming [1, 2] retrieves pieces of music which have a musically similar melody to an input humming. Query-by-beat-boxing [3], on the other hand, aims at retrieving music pieces which are rhythmically similar to the example. These retrieval methods are based on extracting features which are tuned for the particular retrieval problem.

Second, supervised classification can be used to classify each database signal into a predefined class, for instance, to speech, music, and environmental sounds. Supervised classification in general has been widely studied, and audio classifiers typically employ neural networks [4] or hidden Markov models (HMMs) [5] on frame-wise features. In general audio classification, extracting features in short ( 40 ms) frames has turned out to produce good results (see Section 2.1 for detailed discussion).

Since the above approaches define the similarity beforehand, they limit the applicability of the method to a certain application area or to certain classes of signals. The generic query by example of audio does not restrict the type of signals, but aims at finding similarity criteria which correlates with the perceptual similarity in general [6, 7].

The combination of the above mentioned methods have also been used. Kiranyaz et al. made initial segmentation and supervised classification into four predefined classes, after which query by example was applied to samples, which were classified into the same class [8]. For image databases, also using multiple examples [9] and user feedback [10] have been suggested.

This paper proposes a query by example system for generic audio. Section 2 gives an overview of the system and previous similarity measures. We observe that the similarity of audio signals can be measured by the difference between the probability density functions (pdfs) of their frame-wise features. The empirical pdfs of continuous-valued features cannot be estimated directly, but they are modeled using Gaussian mixture models (GMMs). A GMM parametrizes each sample efficiently with small number of parameters, retaining the necessary information for similarity measurement. An overview of other applications utilizing GMMs in the music information retrieval can be found in [11].

In Section 3 we present similarity measures between pdfs parametrized by GMMs. We propose a novel method for calculating the Euclidean distance between GMMs with full covariance matrices. We also present approximations for the Kullback-Leibler divergence between GMMs, which have not been previously used in audio similarity measurement. A cross-likelihood test is presented and extended to hidden Markov models, which allow modeling temporal characteristics of the signals. Simulation experiments on a database consisting of wide range of sounds were conducted, and the distance measures between pdfs are shown to outperform the existing methods in audio retrieval task in Section 4.

2. Query by Example

Figure 1 illustrates the block diagram of the query by example system. An example signal is given by a user. A set of features is extracted, and GMM or HMM is trained for the example signal and for each database signal. The similarity between the example and each database signal is estimated by calculating a distance measure between their GMMs or HMMs, and the signals having the smallest distance are retrieved as similar to example signal.
Figure 1

Query by example system overview.

2.1. Feature Extraction

Feature extraction aims at modeling the perceptually most relevant information of a signal using only a small number of features. In audio classification, features are usually extracted in short (20–60 ms) frames, and typically they parametrize the spectrum of the sound. In comparison to the time-domain signal, the spectrum correlates better with the human sound perception, and the human auditory system has been found to perform frequency analysis [12, pages 20–53]. The most commonly used features in audio classification are Mel-frequency cepstral coefficients (MFCCs) which were used for example by Mandel and Ellis [13].

In our earlier studies [6, 7], different feature sets were tested in general audio retrieval, and based on the experiments the best feature set was chosen. Features were MFCCs (the first three coefficients were found to give the best results), spectral spread, spectral flux, harmonic ratio [14], maximum autocorrelation lag, crest factor, noise likeness [15], total energy, and variance of instantaneous power. Even though the feature set was tuned for a particular data set and similarity measures, the evaluated distance measures are general and can be applied to any set of features. In more specific retrieval tasks it is likely that better results will be obtained by using feature sets tuned for the particular tasks.

2.2. Previous Similarity Measures

Previous distance measures have used some statistical measures (mean, covariance, etc.) of the features (see Sections 2.2.1 and 2.2.2) or quantized the feature vectors and then measured the similarity by the distance between feature histograms, as will be explained in Section 2.2.3. Recently, specific distance measures between the pdfs of the feature vectors has been observed to be good similarity measures [7, 1618]. Section 3 describes distance measures which can be calculated between pdfs parametrized by GMMs.

2.2.1. Mahalanobis Distance

Mahalanobis distance calculates the distance between two samples based on their mean feature vectors and , and the covariance matrix of the features across all samples in the database. The distance is given as
(1)

If the distribution of feature vectors of all observations is ellipsoidal, then the Mahalanobis distance between two mean vectors in feature space is dependent on the distance along each feature dimension but also on the variance of that feature dimension. This property makes the Mahalanobis distance independent of the scale of the features. In supervised classification of music, Mandel and Ellis [13] used a version of Mahalanobis distance, where the mean vector consisted of all the entries of the sample-wise mean vector and covariance matrix.

2.2.2. Bayesian Information Criterion

The Bayesian information criterion (BIC), which is a statistical criterion for model selection, has been used especially with speech material to segment and cluster a database [19]. BIC has been used to measure the changing point in audio by having two hypotheses: the first assumes that the whole sequence is generated by a single Gaussian model, whereas the second assumes that two segments separated by a changing point are generated by two different Gaussian models. The BIC difference between the hypotheses is
(2)

where is the total number of observations, is the number of observations in sequence , and is the number of observations in sequence . , , and are the covariance matrices of all the observations, sequence , and sequence , respectively. is the number of dimensions and is the penalty factor to compensate for small sample sizes. A changing point is detected if the BIC measure is above zero [20].

2.2.3. Histogram Method

Kashino et al. [21] proposed quantizing the frame-wise feature vectors and estimating the similarity of two audio samples by calculating distance between feature histograms of the samples. The centers for quantization levels were found using the Linde-Buzo-Gray [22] vector quantization algorithm. The feature histogram for each sample was generated by calculating the amount of frame-wise feature values falling on each quantization level. The quantization level of a sample was chosen by measuring the Euclidean distance between feature vector and the center of each level and choosing the level that minimizes the distance. Finally, the similarity between samples was estimated by calculating the chosen distance (e.g., -norm or -norm) between feature histograms.

The use of histograms is very flexible and straightforward compared to other distance measures between distributions, because practically any distance measure can be used to calculate the distance between histogram bins. However, a problem of using a quantized version of probability distribution is that even if two feature vectors are closely spaced, it is possible that they fall in a different quantization level. Since each histogram bin is used independently, the resulting quantization error may have a negative effect on the performance of the similarity measure.

2.3. Query Output

After feature extraction the chosen distance measure between the feature vectors of the example and each database sample is calculated. Samples having the smallest distances are considered as similar and are retrieved to the user. There are two main possibilities for this. The first is the k-nearest neighbor (k-NN) query, which retrieves a fixed number of samples having the shortest distance to the example [23]. The second is the -range query, which retrieves all the samples having a shorter distance to the example than a predefined threshold [23].

In an optimal situation, the -range query can retrieve all the similar samples, whereas the k-NN query always retrieves a fixed number of samples. Furthermore, in the k-NN query the whole database has to be browsed before any samples can be retrieved but in the -range query the samples can be retrieved already during the query processing. On the other hand, finding the threshold in the -range query is a complex task and it might require estimating all the distances between database samples before the actual query. One possibility for estimating the threshold was suggested by Kashino et al. [21]. They determined the threshold as , where is the mean, is the standard deviation of all distances, and is an empirically determined constant.

3. Distribution Based Distance Measures

The distance between the pdfs of feature vectors has been observed to be a good similarity measure [7, 1618]: the smaller the distance, the more similar are the signals. Most commonly used audio features are continuous valued, thus distance measures for continuous probability distributions are required. A fundamental problem when using continuous-valued features is that the empirical pdf cannot be represented as a histogram of samples, but it has to be approximated by a model.

We model the pdfs using GMMs or HMMs and then calculate the distance between samples from the model parameters. GMM for the features is explained in Section 3.1, and Section 3.2 proposes a method for calculating the Euclidean distance between full-covariance GMMs. Section 3.3 presents methods for approximating the Kullback-Leibler divergence between GMMs. Section 3.4 presents the likelihood ratio test based similarity measure, which is then extended for HMMs. The section also shows the connection of the methods to likelihood-ratio test and maximum likelihood classification.

3.1. Gaussian Mixture Model for the Features

GMMs are commonly used to model continuous pdfs, since they can flexibly approximate arbitrary distributions. A GMM for a feature vector is defined as
(3)
where is the weight of the th Gaussian component, is the number of components, and
(4)

is the multivariate normal distribution with mean vector and covariance matrix . is the dimensionality of the feature vector. The weights are nonnegative and sum to unity. The distribution of the th component of GMM is referred as .

The similarity is measured between two signals, both of which are divided into short (e.g., 40 ms) frames and a feature vector is extracted in each frame. and denote the feature sequence matrices of two signals, where and are the number of frames in signal and , respectively. Here we do not restrict ourselves to a certain set of features. An example of a possible set of features is given in Section 2.1.

For the two observation sequences and , the parameters of two GMMs are estimated using the expectation maximization (EM) algorithm [24]. Let us denote the resulting pdf of signal and by and , respectively. and are the number of Gaussian components, and and are the weights of the th component in GMM and GMM , respectively.

3.2. Euclidean Distance between GMMs

The squared Euclidean distance between two distributions and can be calculated in closed form. In [7] we derived the calculations for diagonal-covariance GMMs, and extend here the method for full-covariance GMMs.

The Euclidean distance is obtained by integrating the squared difference over the whole feature space:
(5)
where denotes the feature. To simplify the notation, we rewrite the above multiple integral as
(6)
By writing the pdfs explicitly as weighted sums of Gaussians, the above equals
(7)
The squared distance (5) can be written as , where the three terms are defined as
(8)

All the above terms are weighted sums of definite integrals of the product of two normal distributions. The integrals can be solved in closed form as shown in the appendix.

Let us denote the integral of the product of the th component of GMM and the th component of GMM by
(9)
The values for the terms , , and in (8) can now be calculated as
(10)

Finally, the squared Euclidean distance is .

We observe that the Euclidean distance between two Gaussians with means and and the same covariance matrix is equal to the Mahalanobis distance (1), up to a monotonic function
(11)

which preserves the order of samples when distance is used in similarity measurement.

3.3. Kullback-Leibler Divergence

The Kullback-Leibler (KL) divergence is an information-theoretically motivated measure between two probability distributions. The KL divergence between two distributions and is defined as:
(12)

which can be symmetrized by adding the term .

The KL-divergence between two Gaussian distributions [25] with means and and covariances and is
(13)

For the KL divergence between GMMs which have several Gaussian components, there is no closed-form solution. There exists some approximations, many of which were tested by Hershey and Olsen [26]. They found that variational approximation, Goldberger approximation, and Monte Carlo sampling produced good results.

3.3.1. KL Variational Approximation

The variational approximation [26] of the KL divergence is given as
(14)

3.3.2. KL Goldberger's Approximation

The Goldberger approximation [25] is given as
(15)
where
(16)

3.3.3. Monte-Carlo Approximation

Monte-Carlo approximation measures (12) by
(17)
where the random samples are drawn from distribution . An accurate approximation requires a large number of samples and is therefore computationally inefficient. In [18], we proposed to use the samples of the observation sequence that were used to train the distribution . We observe that the resulting empirical Kullback-Leibler divergence can be written as
(18)

Here and denote the product of frame-wise pdfs evaluated at the points of the argument , that is, and , respectively.

3.4. Cross-Likelihood Ratio Test

Likelihood ratio test is widely used in speech clustering and segmentation (see e.g., [16, 17, 27]) to measure the likelihood that two segments are spoken by the same speaker. The likelihood ratio test statistic is a ratio of the likelihoods of two hypotheses. The first assumes that two feature sequences and are generated by two separate models having pdfs and , respectively. The second assumes that the sequences are generated by the same model having pdf . This results in the similarity measure
(19)

where is a model trained using both and .

A commonly used modification of the above is the cross-likelihood ratio test given as
(20)

Here the denominator measures the likelihood that signal is generated by model and signal is generated by model , whereas the numerator acts as a normalization term which takes into account the complexity of both signals. The measure (20) is computationally less expensive to calculate than (19) because it does not require training a model for signal combinations, and therefore it has been used in many speaker segmentation studies (see e.g., [16, 28, 29]). In our simulations it also produced better results than the likelihood ratio test. However, the distance measure still requires the access to the original feature vectors requiring more storage space than Euclidean distance or KL divergence [30].

By taking the logarithm of (20) we end up with a measure which is identical to the symmetric version of the empirical KL divergence (18), which is
(21)

Reynolds et al. [27] denoted (21) as the symmetric Cross Entropy distance. The lower the above measure, the more similar are and .

The empirical KL divergence was derived here for GMMs, but in (19) and (20) we can also use HMMs to model the signals. An HMM extends the GMM by using multiple states, the emission probabilities of which are modeled by GMMs. A state indicator variable is allowed to move from a state to another at each frame. This is controlled by using state transition probabilities, allowing modeling of time-varying signals. The parameters of an HMM can also be estimated by using a special version of EM algorithm, the Baum-Welch algorithm [31]. In other applications, estimating the HMM parameters from an individual signal may require modifying the EM algorithm [32], but in our studies this was not found to be necessary since good results were obtained by the basic Baum-Welch algorithm. The value of the pdf parametrized by an HMM was here evaluated by the Viterbi algorithm, that is, we used only the most likely state transition sequence. The cross-likelihood test has been previously used with HMMs to cluster time-series data in [29]. An alternative HMM similarity measure was recently proposed by Hershey and Olsen [33] who derived a variational approximation for the Bhattacharyya divergence between HMMs.

The measure (20) has a connection to maximum likelihood classification. If we consider each signal as an individual class , the maximum likelihood classification principle classifies an observation into the class having the highest conditional probability . If we assume that each class has the same prior probability, the likelihood of a class is . The likelihood can be divided by a normalization term without affecting the classification to obtain . In similarity measurement we do "two-way'' classification where the likelihood of signal belonging to class and the likelihood of signal belonging to class are multiplied. When each class is parametrized by model , this results to the measure (20).

4. Experiments

To evaluate the performance of the above similarity measures, they were tested in the query by example system described in Section 2. The simulations were made using an audio database which contained 1332 samples. The signals were manually annotated into 4 main categories and 17 subcategories. In the evaluation, samples falling into each category (main or subcategory depending on the evaluation metric) were considered to be similar. The categories and the number of samples in each category are listed in Table 1.
Table 1

Audio categories in our database and the number of samples in each category.

Main category

Subcategory

Environmental (231)

Inside a car (151)

 

In a restaurant (42)

 

Road (38)

Music (620)

Jazz (264)

 

Drums (56)

 

Popular (249)

 

Classical (51)

Sing (165)

Humming (52)

 

Singing (60)

 

Whistling (53)

Speech (316)

Speaker1 (50)

 

Speaker2 (47)

 

Speaker3 (44)

 

Speaker4 (40)

 

Speaker5 (47)

 

Speaker6 (38)

 

Speaker7 (50)

Samples for the environmental main category were taken from the recordings used in [34]. The subcategories correspond the car, restaurant, and road classes used in that study. The drum subcategory consist of acoustic drum sequences used by Paulus and Virtanen [35]. The rest of the music main category was from RWC Music Database [36], the subcategories corresponding to the individual collections. The sing main category was taken from Vox database presented in [37]. The speech samples are from the CMU Arctic speech database [38], and the subcategories correspond to individual speakers. The samples within categories were selected randomly, but the samples were screened by listening, and the samples having a significant amount of content from other categories than their class were discarded.

All the samples in our database were 10 seconds long. The length of speech samples in the Arctic database were 2–4 seconds, thus multiple samples from each speaker were concatenated so that 10-second samples were obtained. Original samples in the other source databases were longer than 10 seconds, thus random 10-second excerpts were used. Before the feature extraction all the samples were downsampled at 16 kHz.

4.1. Evaluation Procedure

One sample at the time was drawn from the database to serve as an example for a query and the rest were considered as the database. The distance from the example to all the other samples in the database was calculated, thus the total number of distance calculations in test was , where is the number of samples in the database. Then database samples having the shortest distance to the example were retrieved. Unless otherwise stated, the simulations here use the k-NN query where the number of retrieved samples is 10. A database sample was seen as correctly retrieved, if it was retrieved, and annotated in the same category with the example.

The results are presented here as an average value of recall and precision rates. Precision gives the proportion of correctly retrieved samples in all the retrieved samples :
(22)
Recall means how large proportion of the similar samples was retrieved from the database:
(23)

where is the number of samples in the database. The recall is only used in -range query. To clarify the results we also use a precision error rate which is defined as .

4.2. Tested Methods

A set of the similarity measures explained in Section 2.2 and the novel ones proposed in Section 3 were used in the evaluation. The measures and their acronyms in parenthesis are as follows.

  1. (i)

    Distance between histograms (Histogram). The number of quantization levels was 8 for the whole database and the quantization levels were estimated using the Linde-Buzo-Gray (LBG) vector quantization algorithm [22]. The distance metric was the -norm.

     
  2. (ii)

    Mahalanobis distance, calculated as in (1) (Mahalanobis).

     
  3. (iii)

    Bhattacharyya distance [39] between single Gaussians (Bhattacharyya).

     
  4. (iv)

    KL divergence between two normal distributions (KL-Gaussian).

     
  5. (v)

    Goldberger approximation of the KL divergence between multiple component GMMs (KL-Goldberger).

     
  6. (vi)

    Variational approximation of the KL divergence between multiple component GMMs (KL-variational).

     
  7. (vii)

    Monte Carlo approximation of the KL divergence between multiple component GMMs using 10000 random samples (KL-Monte Carlo).

     
  8. (viii)

    Euclidean distance between GMMs (Euclidean).

     
  9. (ix)

    Cross-likelihood ratio test using GMMs (CLRT-GMM).

     
  10. (x)

    Cross-likelihood ratio test using HMMs (CLRT-HMM).

     

For GMMs and HMMs, diagonal covariance matrices were used and the number of Gaussians was 12 unless otherwise stated later. In HMMs the number of states was 3 and the number of Gaussians per state was 4. We also tested the correlation between pdfs parametrized by GMMs (10), which resulted in significantly worse results than Euclidean distance. The KL divergence approximations used here were all symmetric. We also tested a version of the Euclidean distance where each GMM was normalized so that its distance from zero is unity, but this did not improve the results and was therefore not used in the tests.

All the systems use the feature set described in Section 2.1. Features were extracted in 46 ms frames. After the extraction, each feature was normalized to have zero mean and unity variance over the whole database.

We observed that low-variance Gaussians may dominate the distance measures. To prevent this, we restricted the variances of each Gaussian above a fixed minimum level. We used threshold 0.01 in approximations of KL divergence, and threshold 1 in Euclidean distance and cross-likelihood ratio test.

4.3. Experimental Results

Table 2 presents the results for different similarity estimation methods in k-NN query, where the number of retrieved samples is 10. The results are precision error rates for the main categories and the subcategories. The confidence interval for subcategories with 95% confidence level is around 0.9% and for main categories 0.3%. The cross-likelihood ratio test using GMMs and KL approximations give the most accurate results for the subcategories. The precision error for these methods was 6.0%. For the main categories cross-likelihood ratio test using GMMs gives 0.5% precision error followed by Euclidean distance having 1.0% precision error.
Table 2

The average precision error rates for k-NN query for main and subcategories. The number of retrieved samples was 10.

Method

Main

Sub

Comp. time

Histogram

7.7%

24.3%

0.41 ms

Mahalanobis

1.2%

6.8%

0.013 ms

Bhattacharyya

1.3%

7.9%

6.5 ms

KL-Gaussian

5.0%

14.1%

0.19 ms

KL-Goldberger, GMM (12 comp.)

1.1%

6.0%

9.30 ms

KL-variational, GMM (12 comp.)

1.1%

6.0%

20.2 ms

KL-Monte Carlo, GMM (12 comp.)

1.2%

8.6%

510 ms

Euclidean dist. GMM (12 comp.)

1.0%

6.5%

0.87 ms

CLRT-GMM (12 comp.)

0.5%

6.0%

16.6 ms

CLRT-HMM (3 state, 4 comp.)

1.1%

8.5%

39.3 ms

The histogram method and the KL divergence between single Gaussians performed clearly worse than measures based on GMMs. However, the Mahalanobis distance also gave competitive results. Since the cross-likelihood ratio test (empirical KL divergence) provided the best results, we can assume that the original samples contain information which is not included to GMMs.

Table 2 also illustrates the computational time of a single distance calculation for each measure. Euclidean distance is over 10 times faster than Golberger's approximation, which is the second fastest measure of those which use multiple Gaussian components. Considering that Euclidean distance also provides one of the lowest precision errors makes it suitable for practical applications. However, it should be noted that different distance measures require varying amount of offline preprocessing, for example, generating different kinds of signal models and histograms. Also, the further optimization of algorithms might slightly accelerate some of the measures.

Figure 2 presents the precision of k-NN query for different methods when k was varied from 1 to 35. The larger the area below the curve, the better the method is. Here we can see that the cross-likelihood ratio test using GMMs gave the best results, followed closely by Euclidean distance and Mahalanobis distance.
Figure 2

Results of the different methods for subcategories when the k is changed from 1 to 35 in k-NN query.

Figure 3 illustrates precision and recall when is changed in the -range query. Here we can see that in the most parts of the curve, the cross-likelihood ratio of GMMs gives the highest precision. However, when a small amount of signals is retrieved (low recall/high precision) the approximations of KL divergence, Euclidean distance, and Mahalanobis distances produces the highest accuracy.
Figure 3

Results of the different methods in ϵ-range query for subcategories when ϵ is changed.

In Figure 4, the distance measures are tested with different number of GMM components in k-NN query when k is 10. Generally, the accuracy of all the methods increases when the number of components is increased. However, after 12 GMM components there is no significant change. Thus, 12-component GMMs are used in our other simulations. Pampalk [40] used cross-likelihood ratio test in music similarity and the results using 1-component GMMs were similar to those using 30 components.
Figure 4

Results of the Euclidean distance of pdfs for subcategories when the number of GMM components is changed in k-NN query.

Table 3 is a confusion matrix of the query by example when the Euclidean distance was used and 10 nearest samples were retrieved. The values in the matrix are the percentage of the signals retrieved from each category (rows) when the example was from the certain category (columns). The most confusion was between the music subcategories, especially with jazz and popular music. However, these categories were close to each other also from the human perspective. On the other hand, the speakers were separated from each other almost perfectly. The confusion matrix is here presented only for Euclidean distance, but for other methods the matrices are rather similar.
Table 3

Confusion matrix for Euclidean distance when 10 nearest neighbors were retrieved. The values in the matrix are the percentage of the signals retrieved from each category (rows) when the example was from the certain category (columns).

 

Inside a car

In a restaurant

Road

Jazz

Drums

Popular

Classical

Humming

Singing

Whistling

Speaker1

Speaker2

Speaker3

Speaker4

Speaker5

Speaker6

Speaker7

Inside a car

99.5

1.2

4.7

0

0

0

0

0

0

0

0

0

0

0

0

0

0

In a restaurant

0

98.8

2.6

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Road

0.2

0

92.4

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Jazz

0.1

0

0

90.2

0

11.6

5.9

0.2

0.2

0.4

0

0

0

0

0

0

0

Drums

0

0

0

0

93.6

0.1

0

0

0

0

0

0

0

0

0

0

0

Popular

0

0

0.3

8.4

0.2

87.8

13.5

0

0.2

0.9

0

0

0

0

0

0

0

Classical

0

0

0

0.5

0

0.5

78.0

0

0

0.4

0

0

0

0

0

0

0

Humming

0

0

0

0.4

0

0

0.6

90.8

4.0

0.5

0

0

0

0

0

0

0

Singing

0

0

0

0.2

1.1

0

0.4

3.7

93.5

0.4

0

0

0

0

0

0

0

Whistling

0

0

0

0.1

0.7

0

0.4

0

0.5

97.9

0

0

0

0

0

0

0

Speaker1

0

0

0

0

0

0

0.8

0.4

0

0

100

0

0

0

0

0

0

Speaker2

0

0

0

0

0

0

0

0

0

0

0

99.9

2.1

0

0

0

0

Speaker3

0

0

0

0

0

0

0

0

0

0

0

0.1

97.7

0

0

0.3

0

Speaker4

0.2

0

0

0

0

0

0

0

0

0

0

0

0

100

0

0

0

Speaker5

0

0

0

0

0

0

0

0.3

1.8

0

0

0

0

0

100

0

0

Speaker6

0

0

0

0

0

0

0

0

0

0

0

0

0.2

0

0

99.7

0

Speaker7

0

0

0

0

4.5

0

0.4

0

0

1.3

0

0

0

0

0

0

100

5. Discussion

The above results show that the proposed similarity measures perform well in query by example with the database. The good performance is partly exampled by the good quality of the database: the signals within a class are usually significantly different from those in other classes, and they do not contain acoustic interference which would make the problem harder.

Even though the methods are intended for generic audio similarity, it is likely that as such they are restricted only to relatively low-level similarities. For example, it is very unlikely that the measure will be able to measure the similarity of speech samples by their topic. This is naturally affected by the features. In our study the features measure mostly the spectral characteristics of the signals, and therefore the methods are able to find spectrally similar signals, for example samples from the same speaker or the same musical instrument. It is also likely that the measures will be affected by the recording setup which affects the spectral characteristics.

A single audio recording may contain different sound sources. Depending on the situation, a human can interpret the mixture consisting of several sources as a whole or as separate sound sources. For example, in music all the instruments contribute to the rhythm and harmonicity, but one can also concentrate to and identify single instruments. Furthermore, a long recording can consist of sequential entities which differ significantly from each other. In practice this requires processing a recording in smaller entities. For example, Eronen et al. [41] segmented the input signal and applied supervised classification on each segment.

For practical applications, the speed of operations is an essential factor. The computational complexity of proposed methods is relatively low. The distance calculation between two 10-second samples, depending on the measure, takes from 0.87 ms (Euclidean distance) to 510 ms (Monte Carlo approximation of KL divergence) with the tested GMM distances. The algorithms were implemented with Matlab and simulations were made with 3.0 GHz PC. The estimation of GMM or HMM parameters is also time consuming, but the model need to be estimated only once for each sample.

When a search is performed in a very large database, it becomes exhaustive to go through the whole database and to calculate the distance between the example and all database samples. One solution proposed to solve this problem is clustering the database prior the search. In the search phase it is then possible to restrict the search only to a few clusters [42].

The way the GMMs are trained has an effect on the accuracy of the similarity estimation. We also tested Parzen-window [43, pages 164–174] approach which assigns a GMM component with fixed variance for each observation so that equals the number of frames, is the feature vector within frame , is fixed, and . However, the results were quite similar with the EM algorithm and the Parzen window method is not very practical since the computational complexity is very high compared to the GMMs obtained with the EM algorithm. Euclidean distance was also calculated between full-covariance GMMs. However, the results of diagonal covariance algorithm were clearly better. A major problem with full-covariance GMMs is that within a short signal (430 frames in our simulations) the features often exhibit multicollinearity and therefore the covariances become easily singular, making robust estimation of full covariance matrices difficult.

6. Conclusions

This paper proposed a query by example system for generic audio. We measure the similarity between two audio samples by the distance of the pdfs of their frame-wise feature vectors. Based on the simulation results, we conclude that the distance between pdfs can be used as an accurate similarity estimate for audio signals. Estimating the pdfs of continuous-valued features cannot be done exactly, but the use of GMMs or HMMs turned out to be a good solution.

The simulations revealed that the the cross-likelihood ratio test between GMMs and Euclidean distance gave the most accurate results in query by example. From the methods based on simpler statistics, the Mahalanobis distance gave quite competitive results. However, none of the tested methods gave clearly the best results and thus the similarity measure should be chosen according to the application at hand.

Declarations

Authors’ Affiliations

(1)
Department of Signal Processing, Tampere University of Technology

References

  1. Song J, Bae S-Y, Yoon K: Query by humming: matching humming query to polyphonic audio. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '02), August 2002, Lausanne, Switzerland 329-332.View ArticleGoogle Scholar
  2. Lu L, You H, Zhang H-J: A new approach to query by humming in music retrieval. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '01), August 2001, Tokyo, Japan 595-598.Google Scholar
  3. Kapur A, Benning M, Tzanetakis G: Query-by-beat-boxing: music retrieval for the DJ. Proceedings of the 15th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, SpainGoogle Scholar
  4. Kung S-Y, Hwang J-N: Neural networks for intelligent multimedia processing. Proceedings of the IEEE 1998, 86(6):1244-1271. 10.1109/5.687838View ArticleGoogle Scholar
  5. Pikrakis A, Theodoridis S, Kamarotos D: Classification of musical patterns using variable duration hidden Markov models. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(5):1795-1807.View ArticleGoogle Scholar
  6. Helén M, Lahti T: Query by example methods for audio signals. Proceedings of the 7th Nordic Signal Processing Symposium (NORSIG '06), June 2006, Reykjavik, Iceland 302-305.Google Scholar
  7. Helén M, Virtanen T: Query by example of audio signals using Euclidean distance between Gaussian mixture models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA 1: 225-228.Google Scholar
  8. Kiranyaz S, Qureshi AF, Gabbouj M: A generic audio classification and segmentation approach for multimedia indexing and retrieval. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(3):1062-1081.View ArticleGoogle Scholar
  9. Assfalg J, Del Bimbo A, Pala P: Image retrieval by positive and negative examples. Proceedings of the International Conference on Pattern Recognition (ICPR '00), September 2000, Barcelona, Spain 15: 267-270.View ArticleGoogle Scholar
  10. Aggarwal G, Dubey P, Ghosal S, Kulshreshtha A, Sarkar A: iPURE: perceptual and user-friendly retrieval of images. Proceedings of IEEE International Conference on Multi-Media and Expo (ICME '00), July-August 2000, New York, NY, USA 693-696.Google Scholar
  11. Aucouturier J-J, Pachet F: Improving timbre similarity: how high is the sky? Journal of Negative Results in Speech and Audio Sciences 2004, 1(1):1-13.Google Scholar
  12. Zwicker E, Fastl H: Psychoacoustics: Facts and Models. Springer, Berlin, Germany; 1999.View ArticleGoogle Scholar
  13. Mandel M, Ellis D: Song-level features and support vector machines for music classification. Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR '05), September 2005, London, UKGoogle Scholar
  14. Burred JJ, Lerch A: A hierarchical approach to automatic musical genre classification. Proceedings of the 6th Conference on Digital Audio Effects (DAFx '03), September 2003, London, UKGoogle Scholar
  15. Uhle C, Dittmar C, Sporer T: Extraction of drum tracks from polyphonic music using independent subspace analysis. Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), April 2003, Nara, JapanGoogle Scholar
  16. Stadelmann T, Freisleben B: Fast and robust speaker clustering using the earth mover's distance and Mixmax models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006, Toulouse, France 1: 989-992.Google Scholar
  17. Meignier S, Bonastre J, Magrin-Chagnolleau I: Speaker utterances tying among speaker segmented audio documents using hierarchical classification: towards speaker indexing of audio databases. Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP '02), September 2002, Denver, Colo, USA 577-580.Google Scholar
  18. Virtanen T, Helén M: Probabilistic model based similarity measures for audio query-by-example. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '07), October 2007, New Paltz, NY, USA 82-85.Google Scholar
  19. Zhou B, Hansen JHL: Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. Proceedings of the International Conference on Spoken Language Processing (ICSLP '00), October 2000, Beijing, China 3: 714-717.Google Scholar
  20. Chen S, Gopalakrishnan P: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Proceedings of the Broadcast News Transcription and Understanding Workshop (DARPA '98), February 1998, Lansdowne, Va, USAGoogle Scholar
  21. Kashino K, Kurozumi T, Murase H: A quick search method for audio and video signals based on histogram pruning. IEEE Transactions on Multimedia 2003, 5(3):348-357. 10.1109/TMM.2003.813281View ArticleGoogle Scholar
  22. Linde Y, Buzo A, Gray R: An algorithm for vector quantizer design. IEEE Transactions on Communications Systems 1980, 28(1):84-95. 10.1109/TCOM.1980.1094577View ArticleGoogle Scholar
  23. Ferhatosmanoglu H, Tuncel E, Agrawal D, El Abbadi A: Approximate nearest neighbor searching in multimedia databases. Proceedings of the 17th IEEE International Conference on Data Engineering (ICDE '01), April 2001, Heidelberg, Germany 503-511.View ArticleGoogle Scholar
  24. Dempster AP, Laird NM, Rubin DBB: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 1977, 39(1):1-38.MATHMathSciNetGoogle Scholar
  25. Goldberger J, Gordon S, Greenspan H: An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003, Nice, France 1: 487-493.View ArticleGoogle Scholar
  26. Hershey JR, Olsen PA: Approximating the Kullback Leibler divergence between Gaussian mixture models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA 4: 317-320.Google Scholar
  27. Reynolds DA, Singer E, Carlson BA, O'Leary GC, McLaughlin JJ, Zissman MA: Blind clustering of speech utterances based on speaker and language characteristics. Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP '98), December 1998, Sydney, Australia 3193-3196.Google Scholar
  28. Solomonoff A, Mielke A, Schmidt M, Gish H: Clustering speakers by their voices. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 2: 757-760.Google Scholar
  29. Yin J, Yang Q: Integrating hidden Markov models and spectral analysis for sensory time series clustering. Proceedings of the IEEE International Conference on Data Mining (ICDM '05), November 2005, Houston, Tex, USA 506-513.View ArticleGoogle Scholar
  30. Aucouturier J-J: Ten experiments on the modelling of polyphonic timbre, Ph.D. dissertation. University of Paris, Paris, France; 2006.Google Scholar
  31. Baum LE, Petrie T, Soules G, Weiss N: A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics 1970, 41(1):164-171. 10.1214/aoms/1177697196MATHMathSciNetView ArticleGoogle Scholar
  32. Laurila K: Noise robust speech recognition with state duration constraints. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 871-874.Google Scholar
  33. Hershey JR, Olsen PA: Variational Bhattacharyya divergence for hidden Markov models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), March 2008, Las Vegas, Nev, USA 4557-4560.Google Scholar
  34. Peltonen V, Tuomi J, Klapuri A, Huopaniemi J, Sorsa T: Computational auditory scene recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 2: 1941-1944.Google Scholar
  35. Paulus J, Virtanen T: Drum transcription with non-negative spectrogram factorisation. Proceedings of the 13th European Signal Processing Conference (EUSIPCO '05), September 2005, Antalya, TurkeyGoogle Scholar
  36. Goto M, Hashiguchi H, Nishimura T, Oka R: RWC music database: popular, classical, and jazz music databases. Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR '02), October 2002, Paris, FranceGoogle Scholar
  37. Viitaniemi T, Klapuri A, Eronen A: A probabilistic model for the transcription of single-voice melodies. Proceedings of the Finnish Signal Processing Symposium (FINSIG '03), May 2003, Tampere, Finland 59-63.Google Scholar
  38. Kominek J, Black A: The CMU ARCTIC speech databases. Proceedings of the 5th ISCA Speech Synthesis Workshop (SSW '04), June 2004, Pittsburgh, Pa, USA 223-224.Google Scholar
  39. Rahman MM, Bhattacharya P, Desai BC: Similarity searching in image retrieval with statistical distance measures and supervised learning. Proceedings of the 3rd International Conference on Advances in Pattern Recognition (ICAPR '05), August 2005, Bath, UK, Lecture Notes in Computer Science 3686: 315-324.Google Scholar
  40. Pampalk E: Computational models of music similarity and their applications in music information retrieval, Ph.D. dissertation. Technische Universitat, Wien, Austria; 2006.Google Scholar
  41. Eronen AJ, Peltonen VT, Tuomi JT, et al.: Audio-based context recognition. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(1):321-329.View ArticleGoogle Scholar
  42. Helén M, Lahti T: Query by example in large databases using key-sample distance transformation and clustering. Proceedings of the 3rd IEEE International Workshop on Multimedia Information Processing and Retrieval (MIPR '07), December 2007, Taichung, Taiwan 303-308.Google Scholar
  43. Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. John Wiley & Sons, New York, NY, USA; 2001.MATHGoogle Scholar
  44. Ahrendt P: The multivariate Gaussian probability distribution. IMM, Technical University of Denmark, Bygning, Denmark; 2005.Google Scholar
  45. Gales MJF, Airey SS: Product of Gaussians for speech recognition. Computer Speech and Language 2006, 20(1):22-40. 10.1016/j.csl.2004.12.002View ArticleGoogle Scholar

Copyright

© M. Helén and T. Virtanen. 2010

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.