Analytical Features: A Knowledge-Based Approach to Audio Feature Generation

Pachet, François; Roy, Pierre

doi:10.1155/2009/153017

Research Article
Open access
Published: 08 April 2009

Analytical Features: A Knowledge-Based Approach to Audio Feature Generation

François Pachet¹ &
Pierre Roy¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2009, Article number: 153017 (2009) Cite this article

4793 Accesses
31 Citations
Metrics details

Abstract

We present a feature generation system designed to create audio features for supervised classification tasks. The main contribution to feature generation studies is the notion of analytical features (AFs), a construct designed to support the representation of knowledge about audio signal processing. We describe the most important aspects of AFs, in particular their dimensional type system, on which are based pattern-based random generators, heuristics, and rewriting rules. We show how AFs generalize or improve previous approaches used in feature generation. We report on several projects using AFs for difficult audio classification tasks, demonstrating their advantage over standard audio features. More generally, we propose analytical features as a paradigm to bring raw signals into the world of symbolic computation.

1. Introduction

This paper addresses two fundamental questions of human perception: (1) to what extent are human perceptual categorization for items based on objective features of these items, and (2) in these situations, can we identify these objective features explicitly? A natural paradigm for addressing these questions is supervised classification. Given a data set with perceptive labels considered as ground truth, the question becomes how to train classifiers on this ground truth so that they can generalize and classify new items correctly, that is, as humans would? A crucial ingredient in supervised classification is the feature set that describes the items to be classified. In machine-learning research, it is typically assumed that features naturally arise from the problem definition. However, good feature sets may not be directly available, motivating the need for techniques to generate features from the raw representations of objects, signals in particular. In this paper, we claim that the generation of good feature sets for signal-based classification requires the representation of various types of knowledge about signal processing. We propose a framework to demonstrate and accomplish this process.

1.1. Concept Induction from Symbolic Data

The idea of automatically deriving features that describe objects or situations probably originated from Samuel's pioneering work on a program that played the game of checkers [1]. In order to evaluate a position, the program needed a number of features describing the important properties of the board [2]. These features were designed by hand, and Samuel considered the automatic construction of these features to be a major goal for AI [3]. The machine-learning community first addressed automatic feature generation through supervised concept induction [4–7]. Several algorithms adapted to problem solvers were developed to automatically infer object descriptions by combining elementary features using construction operators. Most works in inductive concept learning consider feature induction as a transformational process (see [3] for a review); new features are induced by operating on existing primitive features. The composition operators are based on general mathematical (e.g., Boolean or arithmetic) functions.

The next step taken in this area was to apply feature generation to supervised classification. In this context, classifiers are trained on labeled data sets based on these features. The performance of these classifiers strongly depends on the feature characteristics, and a great deal of research in machine learning has addressed how to define a "good feature set" for a particular classification problem. Notably, the field of feature selection has been widely studied, leading to the development of many algorithms to selectively choose an expressive and compact set of features out of a possibly much larger set [8, 9]. In this field, the notions of feature interaction [10] and feature construction have naturally emerged, both for improving classifier performance and for improving the comprehensibility of the classifiers, for example, for explanation purposes. A number of feature generation algorithms have been proposed [11], resulting in significant improvements in performance over a range of classification tasks, and further establishing the field in machine learning. These works were generalized simultaneously by researchers in several areas. Markovitch and Rosenstein [12] proposed a general framework for feature generation based on a machine-learning perspective. This framework uses a grammar for specifying construction operators. An algorithm iteratively constructs new features and tests them using a training set. This framework was evaluated using the Irvine repository of classification problems, showing that generated features improved the performance of traditional classifiers on various problems. An interesting extension to this approach was applied to text categorization [13]. Krawiec [14] proposed a similar approach from a genetic programming perspective and performed the same evaluation on reference databases.

Despite these favorable results, a careful examination of the generated features in relation to the corresponding grammars reveals that it is relatively easy to generate the "right features" (as known by the researchers) from the grammars. Therefore, the question remains whether this general approach will scale up for concrete cases in which (1) knowledge about the domain cannot be expressed as an efficient grammar, or is otherwise "hidden" in the construction rules, and (2) the input data are not symbols but raw signals.

1.2. From Symbols to Signals

Earlier work addressed problems in which the object features were naturally obtained from the problem definition. In the game of checkers, the basic elements from which features are constructed are the rows and columns of the board. Similarly, reference classification problems (such as [15]) are also described by sets of features which contain all of the relevant problem information.

When the data to be classified consists of raw signals, the situation changes radically. As opposed to symbolic problems in which the data representation is naturally induced by the problem definition, there are no "universally good" features to represent signals.

Raw signals are not suited to the direct use by classifiers for two reasons, size and representation. In the audio signal domain, signals are represented as time series made up of thousands of discrete samples. Sampling a single second of monophonic audio at 22 kHz results in approximately 44000 16-bit values. Because of the curse of dimensionality [16], this amount of information would require an impractical number of training samples. The representation issue poses a larger problem: the temporal representation of signals (amplitude varying with time) is unsuitable for perceptual coding. For example, the superposition of two sine waves with a phase difference produces very different signals in the temporal domain (see Figure 1). However, since the human ear is insensitive to phase differences [17], it will not perceive any significant difference. In general, most perceptive dimensions (e.g., loudness, timbre) have no obvious signal counterpart [18].

Most of the work in signal-based classification performed to date exploits general-purpose features which are well known, are well understood, and have a precise mathematical definition. Indeed, technical features for describing signals abound in the huge body of signal processing literature; see, for example, [19]. A limitation to these methods is that the designer of a signal classifier has to select an adequate feature set based on intuition. Although this step is crucial to the design of an overall classifier, it tends to be neglected in the literature. The manual selection of features is a source of suboptimal solutions, and more importantly, gives the whole process of signal classification a debatable epistemological status. How can we interpret the performance of a classifier which uses manually designed features? In particular, what happens if other features had been used instead?

Applying automated feature generation to signal classification is a promising avenue, as it may produce better features than those designed by humans. More importantly, feature generation replaces manual feature design by a systematic search process. Ritthoff et al. [20] proposed to combine feature generation and feature selection algorithms, and applied them to the interpretation of chromatography time series. They confirmed that generated features performed better as a feature set than the raw time series data. Recently, feature generation ideas have spread into many domains of signal classification. Lim et al. [21] describe an approach for hand-written Chinese character recognition, in which high-level signal features are constructed using the explanation-based learning (EBL) framework. The feature construction process in EBL emphasizes the use of domain-specific knowledge in the learning process, using explanations of training examples. Similar improvements were reported with approaches based on general image features. In the engineering domain, Guo et al. [22] describe a feature generation approach to fault classification. They use a small set of transformation operators to produce features of the raw vibration signals from a rotating machine. The same framework was applied to breast cancer diagnosis [23] with similar conclusions. In the biology domain, Dogan et al. [24] describe a feature generation algorithm for analyzing sequences, applied to splice-site reduction. Details on the feature generation algorithm were not provided, but they did report an improvement of around 6% compared with traditional approaches using well-known features. Similarly, speech prosody classification was addressed by Solorio et al. [25], applying ideas from computer vision to the definition of the primitive operators from which features are generated. More recently, Eads et al. [26] generalized their earlier work on lightning classification [27], by proposing a grammar-based approach for generating Matalab code to extract features from time series, similar to the work of Markovitch and Rosenstein [12]. Krawiec and Bhanu [28] extended earlier work [14] on coevolutionary feature generation, applying it to 3D-object recognition using a method in which several populations evolve simultaneously.

1.3. Audio Feature Generation

Audio classification has received a great deal of attention due to the explosion of electronic music distribution, and many studies have been devoted to finding good feature sets to classify signals into perceptual categories [29]. Most of this work addresses the issues of feature selection, classifier parameter tuning, and the inherent difficulty in producing reliable "ground truth" databases for training and testing classifiers. General-purpose feature sets have been applied to a great variety of musical classification tasks.

Much previous work has been based on the so-called "bag-of-frames" (BOFs) approach [30]. This approach handles the signal in a systematic and general fashion, by slicing it into consecutive, possibly overlapping frames (typically 50 milliseconds) from which a vector of short-term features is computed. These vectors are then aggregated, and fed to the rest of the chain. The temporal succession of the frames is usually lost in the aggregation process (hence the "bag"), and Aucouturier and Pachet [31], Aucouturier [32] showed that preserving the order does not generally improve the performance. First, a subset of available features is identified using a feature selection algorithm. Then, the feature set is used to train a classifier from a database of labeled signals (training set). The classifier is then tested against another database (test set) to assess its performance. The BOF approach is the most natural way to compute a signal-level feature (e.g., at the scale of one entire song).

The BOF approach is highly successful in the music information retrieval (MIR) community, and has been applied to virtually every possible musical dimensions: genre classification [33, 34]; instrument classification [35, 36]; percussion instrument classification [37, 38]; audio fingerprinting [39]; noise classification [40]. It has also been applied to identification tasks, such as vocal identification [41] and mood detection [42]. The BOF approach achieves a reasonable degree of success on some problems. For instance, speech music discrimination systems based on BOF yield almost perfect results [43], as do carefully designed music genre classifiers. However, the BOF approach has limitations when it is applied to more difficult problems. Although classification difficulty is hard to define precisely, it can be observed that problems involving classes with a smaller degree of abstraction are usually much more difficult than others. For instance, we have noticed that genre classification works well to differentiate between abstract, large categories (Jazz versus Rock), but its performance degrades for more precise classes (e.g., Be-bop versus Hard-bop). For such difficult problems, the systematic limitations of BOF, referred to as glass ceilings [44], immunize it to all known techniques for improving classifier performance (e.g., feature selection, boosting, parameter tuning).

The realization that general purpose features do not always represent relevant information in signals from difficult classification problems was at the origin of the development of extractor discovery system (EDS), probably the first audio feature generation system [45, 46]. This paper provides a survey of the most important technical contributions of this work regarding current feature generation studies, and reports on its applications to several difficult audio classification problems.

Recently, researchers have recognized the importance of deriving ad hoc domain-specific features for specific audio classification problems. For instance, Mörchen et al. [47, 48] performed an extensive study of a large set of bag-of-frames features that were obtained from a cross-product of general short-term features using several temporal aggregation methods. The result was a space of about 40,000 features. Their systematic evaluation showed that some of these features could improve the performance of a music genre classifier, compared with features used in the literature.

An interesting approach to audio feature generation, with similar goals to EDS, was pursued in parallel by Mierswa and Morik [49], based on earlier work on time series feature generation [20]. They proposed a framework for extracting domain-dependent "method trees", representing ad hoc features for a time series. They applied their framework to music genre classification and reported an improvement over approaches that use off-the-shelf audio features [50]. The proposed method trees are essentially BOF features constructed from basic audio operators, with the addition of a complexity constraint (method trees are designed to have polynomial complexity). Using this method, Schuller et al. [51] reported an improvement in classifier performance in the domain of speech emotion recognition.

In this paper, we describe two contributions to the field of signal feature generation, with an emphasis on the most important aspects for successfully generating "interesting" audio features. First, we describe a generic framework, EDS, which has been specialized for the generation of audio features and is used for supervised classification and regression problems. This framework uses analytical features to generally represent audio features. The framework bears some similarities with other feature generation frameworks such as those by Markovitch and Rosenstein [12] and Mierswa [50]. Notably, EDS uses genetic programming as a core generation algorithm to explore the function space. However, it differs from other work in signal feature generation in that it was specifically designed to integrate knowledge representation schemes about signal processing. Specifically, we graft type inference, heuristics, and patterns onto analytical features (AFs) to control their generation. Second, we describe several projects using EDS to solve difficult audio classification tasks, thereby extending the usual scope of examples used in the literature. We report on the main results achieved, and emphasize the lessons learned with these experiments.

2. Analytical Features

In this section, we introduce the notion of analytical features (AFs), which represent a large subset of all possible digital signal processing functions having an audio signal as input. We also introduce the algorithms that generate and select AFs for supervised classification and for regression problems.

2.1. Origin

Previous work in musical signal analysis has addressed the automatic extraction of accurate high-level music descriptors. Scheirer's tempo algorithm ([52], see Figure 2) is a prototypical example of a complex function implementing a high-level musical feature (the tempo) from a raw music signal (taken from a CD). This example clearly shows that the design of a good audio feature involves engineering skills, signal processing knowledge, and technical know-how. The initial motivation for EDS originated from the observation that manually designing such a descriptor is costly, yet to a large extent it could be automated by a search algorithm. In fact, a careful examination of Scheirer's seminal tempo paper reveals that many of the design decisions for the tempo extractor were arbitrary, and could be automated and possibly optimized. By examining other high-level feature extractors, it could be seen that there are general patterns in the way features are constructed. These observations have led to the design of a system that automates the construction of high-level features in the context of supervised classification tasks. The key issue was not so much the algorithm, but rather the encoding of the necessary knowledge to guide this search.

An important source of inspiration was the automated mathematician (AM) system [53]. This system invented mathematical conjectures based on a set of primitive operations which were gradually combined using a set of powerful and general heuristics. Although AM was a milestone in AI, the a posteriori analysis by Lenat and Feigenbaum [54] revealed that it had deep limitations, as summarized in the famous quote, "machines only learn at the fringe of what they already know" [55]. Indeed, AM and related systems could explore only the fringe of the set defined by the initial basic operators. However, AM's success led to the development of genetic programming [56], which is a general method to search function spaces using genetic search [57], and is systematically used in feature generation systems. AM tutorial reconstruction by Haase [58] emphasized a crucial aspect of AM that is at the heart of EDS' design, notably the concept of "representational density". The foundations of this concept are that syntactic constructions should be optimally "interesting", as the system will only concretely explore forms which are relatively easy to generate. This remark was made during the last days of AM, when it was noticed that no interesting conjectures could be invented after the initial frenzy of mathematical discoveries were made; this is the so-called "AM malaise".

The driving force behind the design of EDS and of our primitive audio operators is that the generated features should be optimally interesting (statistically) and simple (syntactically). To this end, the feature generation should exploit a large amount of knowledge about signal processing, as was demonstrated by Scheirer's tempo example. Without this knowledge, the feature space generated would be too small to contain interesting features, or would include features which are only superficially better than well-known ones. Moreover, some kernels (e.g., Boolean or arithmetic) embedded in classifiers (such as support vector machines, SVMs) are able to automatically reconstruct implicit feature combinations at the training phase [59]. These may incorporate domain-specific knowledge [60], and they eliminate the need for an extra and costly feature generation phase. Therefore, the primary task for feature generation is the construction of features containing in-depth signal information that cannot be extracted or reconstructed otherwise. This "representational density" argument was translated into a series of important extensions to the basic genetic programming algorithm. These are described in the next section, beginning with the definition of our core concept: analytical features.

2.2. Analytical Features: A Subset of All Possible Audio Digital Signal Processing Functions

Analytical Features (AFs) are expressed as a functional term, taking as only argument the signal as an input (represented here as x). This functional term is composed of basic operators, and its output is a value or vector of values that is automatically typed from the operators using a dimensional type inference system. The feature's value is designed to be directly fed to a classifier, such as a SVM, using a traditional train/test process.

AFs depart from approaches based on heterogeneous data types and operations, such as that of Mierswa [50], who distinguishes between basis transformations, filters, functions, and windowing. In contrast, AFs are designed with these simplicity constraints: (1) only one composition operation is used (functional composition), and (2) AFs encompass the whole processing chain, from the raw signal to the classifier. The processing chain includes in particular windowing operations, which are crucial for building signal features in a BOF approach [49]. As explained above, a signal of substantial length (say, more than a few hundred milliseconds) is frequently sliced into frames rather than being treated as a single time series. Features are then extracted for each frame, and the resulting vector set is aggregated using various statistical means, such as statistical moments or Gaussian mixture models [47].

These operations are all included in the definition of an AF, through a specific operator Split. For instance, the following AF (1) computes the mel-frequency cepstrum coefficients (MFCC) of successive frames of length 100 (samples) with no (0) overlap, and then computes the variance of this value vector:

(1)

This uniform view on windowing, signal operators, aggregation operators, and operator parameters introduces a great deal of flexibility in feature generation. It also simplifies the entire generation phase by eliminating the choice of specific windowing or aggregation parameters. Note that AF are not, by design, limited to BOF features (i.e., statistical reductions of short-term features), nor are they limited to polynomial-complexity features as are other feature generation approaches [49]. AFs can express arbitrary signal functions, and might not make obvious sense. For instance, the following AF (2) computes a complex expression involving a low-pass filter, whose cutoff frequency is itself computed as the maximum peak of the signal spectrum:

(2)

where FFT indicates a fast Fourier transform. To be manipulated efficiently, each AF is associated with a dimensional type, inferred automatically by a type inference system. These types, introduced in the next section, form the basis for implementing random function generators, for defining powerful heuristics, and for guiding the search through feature patterns.

It is important to note that AFs, by definition, do not capture all possible digital signal processing (DSP) functions that could be used as audio features, that is, that reduce a signal to a lower-dimensional vector. For instance, most programs written using an imperative programming style cannot be expressed as AFs. It is difficult to give a precise semantic definition of the space covered by AFs (this is an interesting avenue of research currently being investigated). As we will see below, Scheirer's tempo extractor can be reasonably well represented as an AF, as well variations about this feature. But it is easy to devise other DSP functions which cannot be expressed as AFs. We chose to restrict our investigation to AFs for a number of reasons.

(1)The space of "reasonable size" AFs is huge, and likely contains features which are sufficiently good for the problem at hand. The number of AFs of size (number of operators) less than 10 is estimated to 10²⁰ [61], even without accounting for the combinatorial exploration of operator parameters.

(2)Thanks to their homogeneity and conceptual simplicity, AFs can be easily typed, which eases the implementation of control structures (e.g., heuristics, function patterns, and rewriting rules), as explained below. Without these control structures, the generation can only superficially improve on existing features. Consequently, AFs are a good compromise between BOF features, as generalized by Mierswa [50], and arbitrary DSP programs [26], whose automatic generation is notoriously difficult to control.

(3)Many non-AF functions can be approximated by AFs or by sets of AFs. We give an example in Section 3.1.2, where EDS is used as a regression classifier to approximate a known audio feature that is not included in the basic operator set.

2.3. A Dimensional Type Inference System

The need for typing functions is well known in genetic programming (GP), and ensures that the generated functions are at least syntactically correct. Different typing systems have been proposed for GP, including strong typing [62], which explicitly represents the "programming" types (floats, vectors, or matrix) of the inputs and outputs of functions. However, for this application, programming types are superficial and are not needed at the AF level. For example, the operator absolute value (Abs) can be applied to a float, a vector, or a matrix; this polymorphism provides needed simplicity in the function expression. However, we do need to distinguish how AFs handle the signals in terms of their physical dimensions, since knowledge about DSP operators naturally involves these dimensions. For instance, audio signals and spectrum can both be seen as float vectors from the usual typing perspective, but they have different dimensions. A signal is a time-to-amplitude function, whereas a spectrum is a frequency-to-amplitude function. It is important for the heuristics to be represented in terms of physical dimensions instead of programming types.

Surprisingly, to our knowledge, there is no type inference system for describing the physical dimensions of signal processing functions. Programming languages such as Matlab or Mathematica manipulate implicit data types for expressing DSP functions, which are solely used by the compiler to generate machine code. In our context, we need a polymorphic type inference system that can produce the type of an arbitrary AF from its syntactic structure. The design of a type inference system has been well addressed in theoretical computer science [63]. The choice of the primitive types is again determined by the representational density principle, in which the most used types should be as simple as possible so that heuristics and patterns can be expressed and generated easily. Accordingly, we define a primitive type system and type inference rules to apply to each operator in the next section.

2.3.1. Basic Dimensional Types

In the current version of EDS, we have chosen a dimensional type system based on only three physical dimensions: time "t", frequency "f", and amplitude or nondimensional data "a". All of the AF data types can be represented by constructing atomic, vector, and functional types out of primitive types.

"Atomic" types describe the physical dimension of a single value. For instance:

(i)position of a drum onset in a signal: "t",

(ii)cut-off frequency of a filter: "f",

(iii)amplitude peak in a spectrum: "a".

"Functional" types represent data of a given type, which yield a dimension of a different type. The target type is separated from the data type using the ":" notation. For instance,

(i)audio signal (amplitude in time): "",

(ii)spectrum (amplitude in frequency): "".

"Vector" types, notated "", are special cases of functions used to specify the types for homogeneous sets of values. For instance,

(i)temporal positions of the autocorrelation peaks of an audio signal: "Vt"

(ii)amplitudes of autocorrelation peaks: "Va".

Vector and functional notations can be combined. Currently the type system is restricted to two-dimensional constructs. For instance:

(i)a signal split into frames: "",

(ii)autocorrelation peaks for each frame: "VVa"

(iii)types like "" are currently not implemented.

Note that this notation is equivalent to the traditional notation of type inference, where ":" is notated "" and "" is notated "". Also, there are many ways to represent DSP function dimensions. For instance, the physical dimension of frequency is the inverse of time, and we could use a dimension system with exponents so that "f" is replaced by "t ⁻¹". However, this involves a more complex type inference system (e.g., the type of Energy would be a ²). More importantly, we did not identify knowledge (heuristics or patterns) that would require such a detailed view of dimensions. But our choice is arbitrary and has limitations, as discussed in Section 4. Hereafter, in this paper, the word "type" will mean "dimensional type".

2.3.2. Typing Rules

For each operator, we define typing rules so that the type of its output data is a function of its input data types. For instance, the Split operator transforms

(i)a signal ("") into a set of signals ("");

(ii)a set of time values ("Vt") into multiple sets of time values ("VVt").

These rules are defined for each operator, so that arbitrary functions types can be inferred automatically by the type inference system. For instance, the following AF is typed automatically as follows (types are written as a subscript):

(3)

(In traditional type notation: .) This AF yields an amplitude value "" from a given input signal x of type "".

Equipped by a type system, EDS can support constructs for pattern-based generators, heuristics, and rewriting rules. These are described in the following sections.

2.4. The AF Generation Algorithm

AFs are generated with a relatively standard function search algorithm. The main goal is to quickly explore the most promising areas of the AF space for a given problem. Like many feature generation approaches, our AF generation algorithm is based on genetic programming [56], one of the most convenient methods to search a function space. The general framework is illustrated in Figure 3. It bears some similarity with the framework proposed by Markovitch and Rosenstein [12], and other frameworks proposed since in the literature, such as Mierswa and Morik [49]. As described in Section 2.2, AFs are designed specifically to support a series of mechanisms that represent operational knowledge about signal processing operators (see Section 2.5). From a machine-learning perspective, most feature generation frameworks use a wrapper approach to evaluate generated features on the training set. Our approach also differs in the way that AFs are evaluated (individually in our case), and in how AF populations are successively selected using a feature selection algorithm based on the syntactic structure of AFs, as described in the next section.

2.4.1. Main Algorithm

The algorithm works on audio description problems composed of a database containing labeled audio signals (numeric values or class labels). The algorithm builds an initial population of AFs, and then tries to improve them through the application of various genetic transformations. The precise steps followed by the algorithm are the following:

(1)build an initial population using random AFs, possibly constrained by patterns,

(2)compute the fitness of each AF in the population. This fitness depends on the nature of the problem and other criteria, as explained in Section 2.4.3,

(3)if (Stop Condition): STOP, then RETURN the best AFs,

(4)else, select the AFs with the highest fitness, and apply transformations to them to create a population ,

(5)return to step (2) and repeat.

Arbitrary stop conditions can be specified. For example, a stop condition can be defined as a conjunction of several criteria.

(i)The maximum number of iterations is reached. The search stops automatically after populations.

(ii)The fitness of the population converges and ceases to improve. That is, the fitness of the best function of population the fitness of the best function of population , and this condition occurs after, say, unimproved populations.

(iii)An optimal AF is found. That is, the fitness is maximal.

2.4.2. Genetic Operations

New populations are created by applying genetic transformations to the best fitted functions of the current population. These operations are relatively standard in genetic programming. In addition to selection, five transformations are used in EDS: cloning, mutation, substitution, addition, and crossover.

(i)Cloning maintains the tree structure of a function and applies variations to its constant parameters, such as the cut-off frequencies of filters or the computation window sizes. For example,

(ii)Mutation removes a branch of a function and replaces it with another composition of operators of the same type. For example,

(iii)Substitution, a special case of mutation, replaces a single operator with a type-wise compatible one. For example,

Substitution typically corresponds to what a sound engineer would do when replacing a low-pass filter by a band-pass filter (see example above). Mutation, on the other hand, corresponds to a more general principle: replacing an elementary task by a more complex process, or the way around.

(i)Addition adds an operator to form a new root of the feature. For example,

(ii)Crossover cuts a branch from a function and replaces it with a branch cut from another function. For example,

In addition to the genetically transformed functions, the new population contains a set of new randomly generated AFs, thereby ensuring its diversity and introducing new operations to the population evolution. Random AF generators are described in Section 2.5.2.

The distribution of the features in a population is as follows: 25% are randomly generated; 10% are clones (of the features kept from the previous generation); 10% are mutations; 10% are additions; 10% are crossovers; 10% are substitutions; 25% are random.

2.4.3. Evaluation of Features and Feature Sets

The evaluation of features is a delicate issue in feature generation, as it is well-known that good individual features may not form a good feature set when they are combined with others, due to feature interaction. In principle, only feature sets should be considered during the search process, as there is no principled way to guarantee that an individual feature will be a good member of a given feature set. The approach taken by Markovitch and Rosenstein [12] is conservative in this sense, as feature interaction is considered up-front during the search. However, they point out that there is both a risk to narrowing the search and a high evaluation cost.

With experience based on our experiments with large-scale feature generation, we chose another option in EDS, in which we favor the exploration of large areas of the AFs space. Within a feature population, the features are evaluated individually. Feature interaction is considered during the selection step used to create new populations.

Individual Feature Evaluation

There are several ways to assess the fitness of a feature. For classification problems, the Fisher Discriminant Ratio [64] is often used because it is simple to compute and reliable for binary classification problems. However, it does not adapt to multiclass problems, in particular those with nonconvex distributions of data. To improve feature evaluation, we chose a wrapper approach to feature selection [65]. In this approach, features are evaluated using an SVM classifier built during the feature search, and there is a 5-fold cross-validation on the training database. The fitness is assessed from the performance of a classifier built with this unique feature. As we often deal with multiclass classification (not binary), the average F-measure is recommended to assess the classifier's performance [66]. However, as training databases are not necessarily balanced class-wise, the average F-measure can be artificially high. Therefore, the fitness in EDS is defined by an F-measure vector (one F-measure per class) of the wrapper classifier. For regression problems, we use the Pearson correlation coefficient [67]. Other methods also could be applied, such as a wrapper approach with a regression SVM.

Note that training and testing an SVM classifier on a single scalar feature require little computation time. Indeed, the fitness computation is generally much faster than the actual feature extraction.

Feature Set Evaluation: Taking Advantage of the Syntactic Form of AFs

After a population has been created and each feature has been individually evaluated, we need to select a subset of these features to be retained for the next population. Any feature selection algorithm could be used for this purpose, such as InfoGain [68]. However, feature selection algorithms usually require a calculation of redundancy measures, for example, by computing correlations of a feature's values across samples [9]. As our features are all AFs, we can take advantage of their syntactic expression to efficiently compute an approximate redundancy measure. This is possible because syntactically similar AFs have (statistically) correlated value series [69]. Also, our algorithm considers the performance of features in each class, rather than globally for all classes.

Finding an optimal solution in a problem requires a costly multicriteria optimization. As an alternative, we propose a low-complexity algorithm as a one-pass selection loop. First the best feature is selected, and then the next best feature that is not redundant is iteratively selected, continuing until the required number of features is reached. The algorithm cycles through each class in the problem, and accounts for the redundancy between a feature and the currently built feature set, using the syntactic structure of the feature. The algorithm is stated (in its simplest form) as in Algorithm 1.

Algorithm 1

The syntactic correlation () between two features is computed based on their syntactic form. This not only speeds up the selection, but also forces the search algorithm to find features with a greater diversity of operators. S-correlation is defined as the tree edit-distance [70] between two AFs. The edit distance uses specific edit operation costs, taking into account the AF's types. More precisely, the cost of replacing operator Op1 by Op2 in an AF is

To yield a Boolean s-correlation function, we compute the edit distance for all pairs of features in the considered set (the AF population in our case), and get the maximum () values for these distances. S-correlation is defined as

With this procedure, our mechanism can efficiently evaluate individual features, allowing the exploration of a larger feature space. It also ensures a syntactic diversity within feature populations. A comparison of this feature selection scheme with an information-theoretic approach was performed for the Pandeiro study, described in Section 3.3.

The syntactic correlation is used for the feature selection, after the generation process. Using the syntactic correlation during the generation is an open problem that is not investigated in this article.

2.5. Function Patterns and Function Generators

Genetic programming traditionally relies on the generation of random functions to create the initial population and to supplement the population at each iteration. However, relying on random generation only, our experiments showed that feature generation algorithms only explore a superficial region of the search space, and do not find really novel features in many cases. This limitation can be overcome by providing specific search strategies to the system based on the designer's intuitions about particular feature extraction problems.

2.5.1. Patterns

To perform specific search strategies, we constrain the random generation so that the system explores specific areas of the AF space. Although there is no known general paradigm to extract relevant features from the signal, the design of such features usually follows regular patterns. One pattern consists of filtering the signal, splitting it into frames, applying specific treatments to each frame, and aggregating the results to produce a single value. This occurred in the beat tracking system described by [52]. That system includes an expansion of the input signal into several frequency bands, followed by a treatment of each band, and concluded with an aggregation of the resulting coefficients using various aggregation operators, ultimately yielding a float value representing (or strongly correlated to) the tempo.

To represent this kind of knowledge, we introduce the notion of function pattern. A function pattern is a regular expression denoting subsets of AFs that correspond to a particular building strategy. Syntactically patterns look like AFs, with the addition of regular expression operators such as "!", "?", and "*". Patterns make use of types to specify the collections of targeted AFs in a generic way. More precisely (the current system uses additional regular expression tokens not described here, notably to control the clarity of operators):

(i)"" stands for one operator whose type is ,

(ii)"" stands for a composition of several operators, all of which have types ,

(iii)"" stands for several operators whose final type is (the types of the other operators are arbitrary).

For instance, the pattern "" represents all scalar AFs, for example, , and the pattern "" represents all scalar vector patterns, for example, Mfcc(split(x, 521).

The following pattern represents a construction strategy which abstracts the tempo extractor strategy investigated manually by Scheirer [52]:

(4)

This pattern can be paraphrased as follows.

(1)"Apply some signal transformations in the temporal domain" ().

(ii)"Split the resulting signal into frames" (Split).

(iii)"Find a vector of characteristic values, one for each frame" ().

(iv)"Find one operation that aggregates a unique value for the entire signal" ().

It can be instantiated by the following concrete AFs (5):

(5)

Another typical example of a pattern, referred to as the BOF pattern, is

This pattern imposes a fixed windowing operation with a Hanning filter, on successive frames of 512 samples with a 50% overlap, and then allows one operation to be performed for each frame (either in the temporal or spectral domain). This is followed by an aggregation function, which is typically a statistical reduction (including operators such as Mean, Variance, Kurtosis, and Skewness). This pattern corresponds approximately to the set of about 66 000 features described and explored manually by Mörchen et al. [47]. It is difficult to propose patterns corresponding exactly to the method proposed by Mierswa [50], due to lack of information about their system. However, it is likely that their "method trees" could be reasonably approximated by one or several AF patterns.

Another interesting example of a pattern is the BOBOF pattern, in which windowing of various sizes is chained together to produce a nonlinear aggregation method from the short-term feature vectors:

Another pattern example is more complex:

It consists of transforming the signal to the spectral domain and back to the temporal domain, eventually yielding a time series. This pattern may be instantiated by the following AFs:

Patterns are used by AF generators to generate correspondingly random AFs. The systematic generation of all possible concrete AFs for a given pattern is a difficult task, but is probably not needed. Instead, we designed a random generator that generates a given number of AFs satisfying a given pattern, as described in the following section.

2.5.2. Pattern-Based AF Generators

Most patterns can, in principle, generate arbitrarily complex features. For several reasons, we generate only simple features from a given template, that is, the features should use the fewest possible type transitions and should not be too long. The generated features are kept simple since the subsequent genetic operations will later introduce complexity into them.

Technically, the main problem is the pattern "", because it allows arbitrary type transitions. In its first phase, our algorithm rewrites the pattern to an explicit type transition, so that the final pattern no longer contains "". Each "" has an input type given by its child and an output type . The explicit type transition is found by computing the shortest type transition path between the input and output types. This shortest path algorithm uses a type transition table that contains all of the valid type transitions (some type transitions are not possible for a given operator library).

For instance, the pattern will be rewritten as since no single operator in the default library can transform a structure of type into a vector of type Vt. In other words, the shortest path from to Vt in the type transition graph is .

Once the pattern is rewritten in this form, it can be completed by randomly drawing operators corresponding to each variable type. For the case of "", a random number n of operators between 1 and 10 are drawn first, followed by n operators that are randomly drawn and chained together.

2.5.3. Heuristics

Most of the approaches used for feature generation have been purely algorithmic. Genetic programming is applied as the search paradigm, and generates functions from primitive operators. The current literature lacks descriptions about the details of the search sessions produced. It appears that only relatively "simple" features are generated, considering the information given in the primitive attributes, the construction process [50], or in the grammars [12].

Our AF framework does not behave in this way. Based on random generation and the genetic transformation operators as a search paradigm, our observations show that AFs rarely converge to really interesting features, unless these features are found quickly within the very first iterations of the search.

Another observation is that real world problems described in the literature only partially reuse the proposed frameworks (see, e.g., the text application by Gabrilovitch and Markovitch [13], the work of Markovitch and Rosensein [12], the application by Mierswa [50], or the speech emotion classification by Schuller et al. [51]). This indicates that more than frameworks is needed to build effective features, namely, heuristics.

We introduce explicit heuristics to guide the search. These heuristics represent know-how about signal processing operators. Heuristics can promote a priori interesting functions, or eliminate obviously noninteresting ones. They are a vital component of EDS, as they were in AM. A heuristic in EDS is a function that gives a score to an AF, ranging from 0 (forbidden) to 10 (very recommended). The score is determined prior to the AF's effective construction and integration into a population. These scores are systematically used when EDS builds a new AF, to select the candidates from all of the possible operations. In the current version of EDS, these heuristics were designed and implemented manually. The most interesting and powerful heuristics are the following.

(i)
Control The Structure of The Functions

AFs can in principle have arbitrarily complex forms, including the form for the arguments of operators (see AF (2) in Section 2.2). However, it is rare that a very complex function is needed to compute the value of a scalar argument (e.g., the cut-off frequency for a high-pass filter). A heuristic can be used instead, where x is the input signal, and Branch represents the sub-AF of the argument of a HpFilter operator, considered as a potential argument for HpFilter

The resulting AF will be scored 5 if Branch is a constant operator, 4 if its length is 1, and so on.

(ii)
Avoid Bad Combination of Operations

There are specific combinations of operators that we know are not of interest. For instance, multiple high-pass filters can be avoided using the heuristic

This heuristic considers two consecutive Split operations to be a bad composition. Note that this heuristic differs from rewriting rules, which will simply combine filters.

(iii)
Range Constant Parameter Values

Some heuristics control the range of parameter values for some operators. For example, the following heuristic

(iv)
Avoid Too Many Operator Repetitions

It is frequently useful to compute the spectral representation of a signal (FFT in concrete cases). In signal processing, it is also common for an operation to be repeated twice. For instance, MFCC coefficients can be seen as a "double FFT" of a signal. However, it seems unlikely that three consecutive applications of the FFT could generate interesting data. This idea is easily represented as a heuristic, and can be programmed explicitly using the number of occurrences of a given operator (e.g., FFT). The pattern can be written as:

(v)
Avoid Too Many Type Repetitions

The ideas above can be applied to types instead of concrete operators. In particular, we wish to disregard compositions containing several (i.e., more than 3) operators of the same type, in particular when this type is scalar. For instance, expressions like contain many repetitions of operators of type "a", and are probably less interesting to explore than AFs with a more balanced distribution of types:

(vi)
Favor Particular Compositions

There are recommended compositions of operators; for example, it is useful to apply a windowing function to each frame after a Split:

Note that this heuristics can be generalized to any operator which does not alter the type of its input (usually a signal, "t:a").

All of these heuristics can be considered as a manual bootstrap that makes the system operational. Ideally, these heuristics could be learned automatically by a self-analysis of the system. Some work has begun in this domain (see Section 4).

2.5.4. Rewriting Rules

Rewriting rules simplify functions prior to their evaluation and speed up the search. Rewriting rules are rudimentary representations of DSP theorems. Unlike heuristics, they are not used by the genetic algorithm to favor combinations, but they do impact the search by

(i)avoiding the need to compute a function multiple times with different but equivalent forms. For example,

(ii)reducing the computational cost. For example, Parseval's equality ([71], see below) avoids computing a possibly costly FFT of a signal.

The rules are triggered iteratively using a fixed-point algorithm until a normal form is obtained [72]. The confluence of our rule set has not been proven, but is likely to occur as symmetry was avoided. Here are the most useful rewriting rules used by EDS:

(i)Abs(Abs()) => Abs(),

(ii)Hpfilter (, 0) = ,

(iii)Envelope (, 1) = ,

(iv)Envelope(Abs(), ) => Envelope(, ),

(v)Abs(FFT()) => FFT(),

(vi)Abs(Square()) => Square(),

(vii)(Parceval) Mean(FFT(Square(X)) = Sum(Square(X)),

(viii)Var(X) = Mean(Square(X−Mean(X))) = Mean(Square(X)−Square(Mean(X)),

(x)Max(Autocorr(X)) = Sum(Square(X)) = Autocorr(X)(0) = E(X).

2.5.5. Computation of AFs

In contrast to most approaches in feature generation, AFs are not computed in isolation, but are computed globally for a whole population. This approach reduces the number of computations by exploiting the redundancies between features in each population. When a population is generated, a tree representing the set of all AFs is also generated. Each subsequent set of operators is then computed once (for each sample in the training set), using a bottom-up evaluation of the population tree. The increase in computational efficiency ranges from a factor of 2 to 10, depending on the patterns used and the population variability and size. This computation can be easily distributed to several processors.

2.6. Basic Operators: The "General" Library

The design of the primitive operators is essential, but this topic has yet not been addressed in the feature generation literature. Currently this choice must be done by the user, for example, by the use of a grammar. This choice is complex because many of the features generated by traditional techniques can now be produced systematically by kernel-based classifiers. For instance, features generated by Boolean or simple mathematical expressions can be easily reconstructed using Boolean kernels [73]. Conversely, the representational density principle suggests that generated features should be kept as simple as possible, analytically. Operators should therefore embody signal characteristics that are useful for general purpose feature construction. For instance, a "filter" operator that applies to any signal will be useful. Conversely, too specific operators will not lead to interesting constructions. An operator such as "log-attack-time" is not relevant outside of specific attack detection applications, so it was not included. A mathematical "plus" operator is also problematic, as it will generate an exponential growth in the feature space size and will be difficult to control.

Here follows a description of operators in the basic version of EDS. Note that a signal is considered as a vector of values (following the time/amplitude representation). In the following descriptions, we generally use the term "vector of values". We use "signal" only when it makes the description clearer.

Abs:the absolute value

Arcsin: the arc-sinus

AttackTime: the duration of the attack of a signal does a low-pass filter at 50 Hz and then computes the duration in seconds of the attack (the attack is for amplitudes between 20% and 80% of the max value of the signal)

Autocorrelation: the autocorrelation of a signal computed using the FFT

Bandwidth: the width of the highest peak at a percent threshold of the peak; the threshold is a parameter of this operator

BarkBands: computes the sums of amplitudes in filtered Bark bands covering the whole spectrum (all bands have the same size in terms of Bark pitches)

Bartlett: applies a Bartlett filter

Blackman: applies a Blackman filter

BpFilter:band-pass filter

Centroid: the Centroid of a vector of values

Chroma: computes the average of amplitudes of a signal in filtered pitch-bands covering the whole spectrum; all bands have the size of one pitch, and all pitches are wrapped onto a single octave

Correlation: correlation computed using FFT

dB: dB scale: 20 × log₁₀(x)

Differentiation: the first derivative of a vector of values: the vector containing the differences of successive elements in the input vector

Division: divides every value by a constant parameter

Envelope: the outline of the signal. It is a signal that connects all the peaks of the input signal. We calculate it using the Hilbert transform

FFT: the discrete Fourier transform of a signal, based on FFTW

FilterBank: bandpass filters the signal across n band of equals width

Flatness: measures the flatness of a signal

Hamming: applies a Hamming filter

Hann: applies a Hann filter

Hanning: applies a Hanning filter

HarmSpectralCentroid: the Harmonic spectral Centroid is computed as the average over of the instantaneous harmonic spectral Centroid within a running window. The instantaneous spectral Centroid is computed as the amplitude (linear scale) weighted mean of the harmonic peaks of the spectrum

HarmSpectralDeviation: the Harmonic Spectral Deviation is the average of the instantaneous harmonic spectral deviation within a running window. The instantaneous harmonic spectral deviation is the spectral deviation of amplitude (logarithmic scale) components from a global spectral envelope

HarmSpectralSpread: the harmonic spectral spread is the average of the instantaneous harmonic spectral spread within a running window. The instantaneous harmonic spectral spread is the amplitude (linear scale) weighted standard deviation of the harmonic peaks of the spectrum, normalized by the instantaneous harmonic spectral centroid

HarmSpectralVariation: the harmonic spectral variation is the mean of the instantaneous harmonic spectral variation. The instantaneous harmonic spectral variation is the normalized correlation between the amplitude (linear scale) of the harmonic peaks of two adjacent frames

HFC: the high-frequency Content: taken across a signal spectrum. It characterizes the amount of high-frequency content in the signal (the magnitudes of the spectral bins are added together, after mutliplying each magnitude by the bin "position" (proportional to the frequency))

HpFilter: high-pass filter

Integration: the cumulated sum of values of a vector

Inverse:

IQR: the Interquartile range of a vector of values: the difference between the percentile at 75% and the percentile at 25%

Length: the number of items in a vector

Log10: log₁₀(x)

LpFilter: low-pass filter

Max: the maximal value of any vector of values

MaxPos: the index of the maximal value of a vector

Mean: the mean value of a vector

Median: the median value of a vector

MelBands: computes the sums of amplitudes in filtered Mel bands covering the whole spectrum (all bands have the same size in terms of Mel pitches)

Min: the minimal value of a vector

Mfcc0: the Mel-Frequency Cepstral coefficients of a signal, including the coefficient 0 (the energy)

Mfcc: the Mel-Frequency Cepstral coefficients of a signal

Multiplication: multiplication by a constant

Normalize: normalizes a signal

N th: the value of the n th item in a vector

PeakPos: returns the positions of all the peaks of a signal

Percentile: for a vector v of values, of length l , and a parameter , sorts the values of v in increasing order, and returns the value at position (rounded if is not an integer)

Pitch: the pitch of a signal computed using the spectral multiplication method

PitchBands: computes the sums of amplitudes in filtered pitch bands covering the whole spectrum

Power: for a parameter n , returns x ⁿ

Range: difference between the max and min values of a vector of values

RemoveSilentFrames: takes a list of frames and returns the list without the silent frames (frames with only 0s)

RHF: the ratios of High-Frequencies of a signal: the sum of the squares of the high frequency half of the spectrum divided by the sum of the squares of the low frequency half of the spectrum

RMS: Root Mean Square, the mean energy of the signal

SpectralCentroid: the first moment of the spectrum: the brightness of the sound

SpectralDecrease: measures the decrease of the spectrum energy with perceptive criteria

SpectralFlatness: measures the similarity of the spectrum to a white noise

SpectralKurtosis: smoothness of the spectrum compared to a Gaussian distribution (4th moment)

SpectralRolloff: frequency under which most of the energy is found in the spectrum (rough approximation of the cut frequency between harmonic signal and noise)

SpectralSkewness: the dissymmetry of the sprectrum compared to a Gaussian distribution (3rd moment)

SpectralSpread: the spread of the spectral energy around the Centroid

Split: splits the input signal in constant-size frames (size is specified as an argument); returns the list of frames

SplitOverlap: idem Split, except with overlapping; the overlapping factor is a parameter

Sqrt: the squared-root of a value

Square: the square of a value

Sum: the sum of the values in a vector

Triangle: applies a triangle filter to a signal

Variance: the first statistical moment

ZCR: the zero-crossing rate of a signal

Harmonicity: the degree of acoustic periodicity, also called harmonics-to-noise ratio (HNR); see Praat, [74]

LTAS: the long-term average spectrum; see Praat, [74].

3. Experiments

Many experiments were conducted with the EDS system to evaluate the extent that interesting AFs are found for difficult audio classification tasks. Several systematic studies are reported by Zils [46], including the detection of a singing voice in polyphonic recordings, the classification of musical instruments, and the classic problem of music genre classification. Music genre classification has received a lot of attention in music information retrieval research. Although it has been shown that existing large-scale music taxonomies are inconsistent with each other [75], several works have addressed the problem of automatic classification for small-scale genre taxonomies based on the acoustic properties of music signals (see, e.g., [33]). As are most music classification systems, published music genre classifiers are based on statistical aggregation of short-term features of signals. Because of its relative simplicity, four-genre classification has become a textbook case for signal classification research. The problem also provides a typical example in which AFs are useful. Zils [46] reports an improvement due to the use of AFs over known BOF features in four-class genre classification; the four easily distinguishable musical genres are Dance, Hip-Hop, Symphonic Classical, and Pop/Rock (see Table 1). Similarly, Mierswa [50] reports good results, although no comparison was made with other approaches on the same data set.

Table 1 Performance of AFs generated for a four-class music genre problem [46].

Full size table

However, it should be noted that in all cases, the features found by Zils [46] or by Mierswa [50] fall into the category of BOF features. We do not think that music genre classification is an interesting example for deriving more complex features, because current systems (including our own approach) already achieve about 80% performance; the remaining error is mostly due to the difficulty of defining the "ground truth" precisely.

Applications in less studied domains were also performed. Interestingly, initial attempts to apply EDS to chord recognition problems showed reasonable performance, but without improvement over existing approaches [76]. Subsequent comparative studies using an improved version of the basic operator library were performed on f0 estimation and chord recognition problems [77]. These studies showed that EDS is able to match and occasionally surpass traditional ad hoc approaches which are known to be particularly efficient and difficult to improve. A study concerning urban sounds analysis considered the use of AFs for hierarchical classification [78], and similarly concluded that there are advantages of AFs over standard methods. For movie remastering, Monceaux et al. [79] showed that perceptive descriptors built with AF could be used to automate the tedious task of audio upmixing. All of these studies reported substantial improvements in classification using AFs over state-of-the-art audio features.

In the following sections, we describe the results obtained with AFs on several difficult audio classification tasks, and report on the most salient aspects of these studies.

3.1. Artificial Experiments

In this section, we describe two artificial experiments aimed at situating AFs empirically in the space of "known" features. The first problem demonstrates that a single AF can perform better than a hundred known, generic features. The second experiment shows the reverse, that AFs can approximate a known feature.

3.1.1. The Pink Noise Experiment

The problem we consider here, briefly sketched by Pachet and Zils [80], consists of detecting a sinus waveform in a given frequency range (say, 0–1000 Hz), mixed with a powerful colored noise in a separate frequency range (1000–2000 Hz). Since the colored noise is the most predominant characteristic of the signal, we show that generic features are unable to detect the isolated sinus. For instance, examining the spectrum of a 650 Hz sinus mixed with a 1000–2000 Hz colored noise (Figure 4), the peak of the sinus is visible but not predominant; thus it is hard to extract automatically with general spectral features.

Of course, this problem is easy to solve manually with knowledge of how these signals are constructed. Prefiltering can be applied to cut off most of the colored noise frequencies, allowing the sinus to emerge from the spectrum. Figure 4 shows the sinus peak emerging when the signal is low-pass filtered, thus it becomes easier to extract automatically.

Although simple, this problem cannot be adequately solved using generic features. This is demonstrated in the following experiment. We build a database of sinus plus colored noise with varying values for the sinus (from 0 to 1000 Hz). Each signal is labeled with the corresponding frequency of the sinus. We then train two regression classifiers for this task. The first one (called REF) is trained with a "generic feature set" and contains features taken from the Mpeg-7 audio standards; it has dimensions varying from 1 to 100 (details in [61]). The second one (called EDS) is trained with AFs built specifically for this problem. EDS explored about 40 000 AFs to find an almost optimal AF for this problem, using the most general pattern "". The result is illustrated in Figure 5. This shows clearly that one analytical feature (dimension 1) can solve the problem almost perfectly, whereas a full set of generic features (dimension 100) cannot. The best analytical feature found by EDS is the following:

This AF (A) is easy to interpret, and does almost exactly what should be done in this case, although probably not in the simplest way. Its fitness (Pearson coefficient) is .99. The most likely human feature, based on knowledge of the database creation process, would be something like

This theoretically perfect feature is also good (fitness = 0.87), but it has a slightly lower fitness than (A). This shows incidentally that the best features are in practice not always the most justified, mathematically.

3.1.2. Approximating a Known Feature

The second experiment assesses how well AFs can approximate a known feature, which is not expressible directly, using the operator library. To this end, we chose an audio feature called "spectral compactness", with an implementation available from the jAudio feature extraction library [81]. This feature is described as follows (p. 603, ibid.): "Compactness is closely related to Spectral Smoothness as defined by McAdams [ 82 ]. The difference is that instead of summing over partials, compactness sums over frequency bins of an FFT. This provides an indication of the noisiness of the signal."

We do not know the details of the implementation of this feature, nor the feature's precise definition, but they are not needed here. The aim of this experiment is to see how "reachable" this feature is, using the set of generic features and using AFs. Because spectral features are known to be correlated, and because MFCC coefficients are known to well represent spectral information in general, we conducted several experiments to show that spectral compactness is not easily reached from other known sets of generic features. We used a database of 2500 percussion sounds described by Roy et al. [83]. We used patterns "", "", and the usual BOF pattern "". The results are illustrated in Figure 6.

We can see that the performance gain of analytical features increases as the size of the feature set decreases, with a spectacular result from single features (84% versus 43%). Additionally, Figure 6 shows that AFs outperform all the specific feature sets considered here: general spectral features, MFCC, and the entire set of general features! EDS produced the following analytical feature, an instantiation of the pattern "":

This AF yields a correlation coefficient of 0.79. This is an interesting result, since it contains operators that explicitly appear in the definition of spectral compactness (see above). However, EDS produced an even better analytical feature,

whose correlation with spectral compactness is over 0.84. As a comparison, the best reference feature identified is

3.2. Perceived Intensity

One of the first problems addressed by EDS was the perceived energy problem (also called the subjective intensity problem). Perceived energy relates to the relative energy levels that listeners intuitively attribute to different songs played at the same volume level. For example, a punchy punk-rock song with loud saturated guitars and screaming voices sounds relatively "more energetic" than, say, an acoustic guitar ballad with a soft voice. To model this subjective impression, users were asked to label musical extracts of various genres with the "energy felt", independent of the listening volume [84]. We collected more than 2600 answers, about 12-13 answers per title. A consensus for the perceived intensity value was found for each title, using a process that eliminated extreme values, assigned a numeric value to each energy level (normalized between 0 for "Low" and 1 for "Very High"), and computed the mean value from this normalized data. To evaluate the relevance of this intensity value, we computed the standard deviation of the results for all of the listeners. For 200 titles (98%), the standard deviation was less than the distance between two successive categories (0.33), so the mean value was assumed to be a correct evaluation of the intensity. The remaining 2% of titles were removed from the database, as their intensity was considered to be too indefinite to be evaluated automatically. The resulting intensity values were used as ground truth to train EDS and to perform feature selection. We proceeded the same way on a second series of perceptive tests, building a 200-title database to test the AFs and classifiers.

A statistical analysis of these tests showed that musical energy is a consensual concept, that is, most users perceive the same energy for the same songs, with a 10% statistical variance. We then built two labeled databases of 200 signals, each of length 5 s and at 11025 Hz; one database was used for training and the other was used for testing. The results are shown in Table 2.

Table 2 The performances for the perceived intensity classifiers using various feature sets.

Full size table

We then proceeded to compare three different feature sets. The first set was obtained from 30 low-level Mpeg-7 descriptors, extracted from Herrera et al. [85]. The second set was composed of manually derived AFs based on our "intuition" about perceived intensity, in which the intensity is related to the tempo, or more generally, to energy variations of the signal. Several new features were built out of intuitive concepts. Using the template of Scheirer's tempo extractor [52], we manually designed AFs describing the signal energy variations. We then correlated these features with the perceptive tests. The strongest correlated AF was found (by hand) to be

with a correlation of 0.57 with the perceptive tests. This AF relates to the amplitude and frequency of the signal's raw energy. As a comparison, the tempo that was extracted automatically with Scheirer's method yielded a correlation of 0.55.

The third set of AFs was derived automatically by EDS using the AF BOF pattern (see Section 2.5.1). After evolving over 22 populations, 100% of the AFs created by EDS were found to perform better than the best Mpeg-7 features on the training database. They also yielded 91% accuracy on the test database.

The best AF found by EDS was:

with a fitness (Pearson correlation coefficient) of 0.744 on train, and 0.812 on test.

It is interesting to examine how EDS yielded its best AFs. At first, EDS found AFs close to the Mpeg7 low-level descriptors, but it eventually improved upon them with various types of preprocessing. For instance, EDS found:

which yielded a fitness (Pearson correlation) of 0.60. Then, EDS found AFs which were close to the empirically discovered Scheirer-inspired feature, although they had been improved with additional operators. The AF with the strongest correlation was:

with a correlation coefficient of 0.64. Finally, EDS found really novel features, such as:

which yielded a fitness of 0.69, as well as the best feature shown above (BestEDS).

Interestingly, these findings (notably feature Mpeg7) were confirmed by a larger-scale manual study by Sandvold and Herrera [86], using a traditional feature selection approach (by definition,they could not find the other features found by EDS). To conclude our study, we evaluated the performances of classifiers built with the optimal combination of the resulting AFs. After running a final selection, EDS retained 24 features to use in the final energy classifier. The best method found was a kNN that yielded a score (accuracy) of 78.2% on train, and 85.1% on test. It combined both Mpeg7 and the best AFs found by EDS. The model error of this best classifier was 11.3%. Considering the 10% variance on the perceptive tests, we believe that EDS found a virtually optimal model for this problem. The performances of these classifiers are compared in Table 2.

The study of perceived energy was an interesting one, since it demonstrated that EDS can outperform any available general-purpose feature set, and because the results were confirmed by a later manual study.

3.3. Classification of Pandeiro Sounds

Analytical audio features can naturally be used to classify instrument sounds. Some small-scale studies were reported by Zils [46] for instrument databases, again demonstrating improvements using AFs compared with general-purpose features. However, the limitation to these kinds of study is the difficulty of building reference ground truth databases [87]. As a consequence, any deficiencies in these classifiers cannot be attributed with certainty to their features. Some studies have addressed the sub-problem of drum sound classification, for which the ground truth issue is less salient; however, since the use of traditional feature sets results in excellent performance, this suggests that the problem is not a very difficult one [38].

Again, the difficulty becomes apparent when we consider lower-level abstractions. For instance, tabla transcription [88] appears to be more difficult than general drum sound identification. An interesting study was performed on the focused but difficult problem of pandeiro sound classification [83, 89].

This study was motivated by the design of interactive music systems that are controlled in real time by percussive instruments. The pandeiro is a Brazilian percussion instrument that is able to produce six categories of sounds (see Figure 7). The problem addressed in our study was to classify pandeiro sounds in each of these six categories, using only the attack portion of the signal. Because the playing modes of the pandeiro, and therefore the six categories of sound, are well known, the ground truth problem becomes irrelevant and we can focus on the feature generation.

The classification of complete sounds was reported by Roy et al. [83] and was shown to be a relatively easy task. However, the application targeted is an interactive music system. Therefore, we need to classify pandeiro sound in real-time, or at least within a few milliseconds, so no interruption can be perceived during the interaction. As a consequence, we cannot use complete sounds to classify a stroke, as they last approximately 150 milliseconds. Instead, we try to classify from the first 3 milliseconds (or 128 samples, at 44 100 Hz) of each sound.

For this study, we recorded a database of 2448 complete, isolated Pandeiro sounds that were produced by the same instrument during different sessions. The 2448 samples were equally distributed (408 per class).

We extracted the attack portion of the sounds so that the attack lies in the middle of the considered sample (see Figure 8, [89]).

To assess the efficiency of AFs, we compared them to results obtained with a reference feature set (a complete list of the reference features is given by [83]). This set includes features commonly used in audio signal classification tasks, notably the Mpeg-7 audio list, and Chroma, which is often used for music analysis [90]. We systematically evaluated the performance of two classifiers: one built with the reference set, the other built with the AFs found by EDS using the set of basic operators defined in Section 2.6.

Each evaluation experiment was divided into two parts. In the first part, classifiers were trained with the training database and tested with the test database. The databases were divided so that two-thirds were used for training and one-third for testing. The samples were chosen at random to avoid artifacts, such as those caused by the evolution of the membrane during the recording session, or by small variations in the player gestures. In the second part of the experiment, classifiers were trained and tested on the test database only, using 10-fold cross-validation in which each classifier was trained on 9/10 of the test database, and its precision was evaluated on the remaining 1/10. This operation was repeated 10 times using random draws. The objective of this double experiment was to demonstrate that the advantages of using AFs do not depend on the conditions of the experiments. The use of cross-validation using only the test database is further supported since EDS uses the training database to evaluate the analytical features; reusing it to train the classifiers could introduce biases.

Finally, we also evaluated the signal itself (128 samples) as a feature (this is possible here only because these attack signals are very short). This was done to confirm that the raw signal is not a good feature, and also to provide a point of comparison for the other experiments.

We used an SVM with a default polynomial kernel. EDS was used fully automated to create and select AFs. The patterns used included the following simple patterns (we deliberately avoided the Split operator, since the samples are very short):

(i)"",

(ii)"".

We ran the genetic search until the improvements in the feature fitness converged. EDS evaluated about 50,000 features. Because attacks are small signals, only 17 reference features could be computed and evaluated, with a total dimension of the feature set equal to 90. We therefore selected 90 AFs out of the AFs found by EDS on the attack. We present results obtained for various sizes of the feature sets, ranging from 1 to 90. This is an important aspect for real time applications. As seen in Figure 9, EDS finds not only better features, but also feature sets with a smaller dimension.

Again, AFs found by EDS improve the classification performance. AFs are superior to general features, in particular for small feature sets; three AFs perform as well as the 50 best general features.

For the experiments reported in Table 3, the feature selection algorithm used is based on syntactical properties of analytical features. Other feature selection schemes, such as Information Gain Ratio (IGR), yield to the same results. However, for feature sets of small dimensions, the performance of IGR decreases substantially (this result is discussed in [91]).

Table 3 Results obtained for the attack problem. Reference refers to a reference feature set of MPEG7 audio features. AF refers to analytical features generated with EDS. Feature sets of smaller dimensions are subsets of the feature sets of dimension 90, constructed using our feature selection algorithm based on AF syntactical structure. Signal refers to the signal used as a feature set, for comparison.

Full size table

The performance gain due to analytical features for small feature sets is advantageous, especially for real time applications. Table 3 shows that three AF features yield a better precision than 50 reference features. These features are:

3.4. Classification of Animal Sounds

A pioneering large-scale study was conducted concerning the automatic classification of companion dog barks, using machine learning algorithms [92]. The goal of this study was to determine whether specific but speaker-independent (so-to-speak) acoustic features of barks were encoded in different "barking situations". Previous studies have showed that humans have the ability to categorize various barks that they hear and to associate them with the appropriate emotional content [93]. They also showed that humans with different dog experience levels show similar trends in categorization of the possible inner state of the given barking dog. Another study showed that human perception of the motivational state of dogs is influenced by acoustic parameters in the barks [94]. In contrast, humans show only modest accuracy in discriminating between individual dogs by hearing their barks [95].

The database consists of barks of the Mudi breed individuals recorded from 14 individuals. The total sample size of barks is 6646, categorized in 6 different situations: "stranger", "fight", "walk", "alone, "ball", and "play". AFs were investigated as to their performance for training and testing six-category classifiers on this database. An interesting aspect of this study is that we used an external library [74] to complement the basic operator library. The recognition of dog barks requires the analysis of specific signal features, similar to the features used routinely in human speech recognition. Spectral operators used in speech recognition, such as MFCCs, are included in the basic EDS library, but others are not, such as formants, which are specific to speech. We used the BOF patterns "" and "". With this extension of the basic operator library, the best AFs found by EDS were the (admittedly relatively simple) following Figure 7.

The most interesting conclusion of this study was that the automatic classifier could reliably discriminate between individual dogs while humans could not [95], as shown in Figure 10. This was the main difference between the performances of the machines and humans.

This supports the concept that there are individual differences in dog barks, but humans are not able to recognize them easily. Incidentally, this study (discussed in detail in Molnár et al. [92]) provides a new perspective for understanding animal communication. There may be "hidden" acoustic features in animal sounds that could eventually be identified. Similar issues have been raised regularly by behavioral scientists of all sorts, for example in the study of bird songs [96].

4. Discussion

The traditional view of feature generation, inspired by supervised concept induction, is that the performance of classifiers can be boosted if the "natural" features are not expressive enough. In this context, feature generation is an additional method for improving classifier performance, along with feature selection, kernel parameter tuning, and boosting. The situation is different when the items to classify are raw signals, as no "natural" features are known a priori. However, there is a huge body of technical knowledge about particular signals types (e.g., audio) and particular operators. For the most part, the link between signals and features for a particular classification task is still performed heuristically and manually in the signal processing research community. Here we introduce analytical features, in association with a feature generation scheme that produces feature sets adequate for arbitrary supervised audio classification problems. For efficiency, the generation of AFs exploits explicit knowledge representation constructs, notably heuristics, rewriting rules, and function patterns, all based on the AF dimensional typing system. We have shown that AFs improve classifier performance on concrete and difficult classification problems. More importantly, they fill the gap between the body of scientific knowledge about audio signals, and the feature identification problem inherent to all signal classification tasks. In this respect, AFs represent the knowledge-based counterpart of feature generation studies.

However, the use of analytical features raises several open issues, related to the very nature of signals.

(i)
The operators considered so far were chosen by studying existing and efficient features designed manually. However, it cannot be assumed that these operators are the best ones. More cognitively plausible operators could substantially improve the performance of generated features. One possibility is to design operators by taking inspiration from the model of the cochlea, as suggested by Pressnitzer and Gnansia [18], as a way to evolve more biologically grounded features. Another possibility is to manipulate lower-level, sample-based operators, for instance, by using data flow models of signal processing functions [97]. Finally, for applications dealing with psychologically high-level classes like musical genre, operators explicitly dealing with long-term properties of the signal that are not expressible through statistical moments should be investigated in more detail. Finally, the intrinsic limitation of AFs compared with more general programs can be overcome by allowing the system to create new operators on the fly. For instance, a classifier achieving good performance on different problems could be abstracted into a new operator, and then reused on other problems.
(ii)
We do not know much about the topology of the AF space. How can we define similarity measures for AFs? Are there clusters or hubs in this space? How can we infer feature properties (fitness) from their syntactic representation? Preliminary work has started to address these issues [69], but much remains to be done before the space can be better understood mathematically, and before we can possibly predict the usefulness of generated features.
(iii)
To date, the set of heuristics (and rewriting rules) is hand-coded. Ideally, this knowledge about signal processing operators and functions would be gained automatically by the system, similar to AM's Mathematical inventions. To do so, the system needs to be able to design experiments, assess their interestingness, and infer general properties from them. This would allow the system to constantly update its body of knowledge autonomously. Such an autotelic automated researcher is currently being investigated, bootstrapping off the problems and results found from the study of EDS.
(iv)
As predicted by Lenat's observations [54], it appears that EDS, like other feature generation systems, eventually reaches the fringe of the AF space, although it certainly does not explore all of it. This phenomenon (most AFs are similar after a certain point) may be related to the intrinsic brittleness of systems built upon a finite set of basic operators. A way to increase the potential performance of feature generation is to open the system so that it exploits information or knowledge produced "outside", for example, knowledge available online, in an e-science perspective. Technically, this requires a different design of the operator library and of the type system onto which everything is grafted. The use of tags as substitutes for operator types is an interesting avenue of research. Tags have already proven to be efficient for indexing knowledge implicitly expressed in online scientific papers, most of which are not fully exploited (or even read).

Whereas supervised concept induction is primarily used as a tool for optimizing the performance of problem solvers (starting with Samuel's checkers), the AF construct allows the transition of symbolic problems to signal-based problems. This transition offers a practical way to investigate two fundamental questions of human perception: (1) to what extent is human perceptual categorization of items based on objective features of those items, and (2) in those cases, can we identify these features explicitly? Analytical features do not provide a definitive answer to these two questions in general. However, by performing a systematic search in large function spaces, this approach lessens the impact of arbitrarily chosen feature sets on classifier performance. As such, they provide a promising avenue to address these issues, as has been illustrated in the audio domain, but also, we think, in all other domains of human categorization.

References

Samuel AL: Some studies in machine learning using the game of checkers. IBM Journal of Research and Development 1959,3(3):211-229.
Article MathSciNet Google Scholar
Buro M: From simple features to sophisticated evaluation functions. In Proceedings of the 1st International Conference on Computers and Games (CG '98), Lecture Notes in Computer Science. Volume 1558. Springer, London, UK; 1998:126-145.
Google Scholar
Fawcett TE, Utgoff PE: Automatic feature generation for problem solving systems. In Proceedings of the 9th International Workshop on Machine Learning (ML '92), July 1992, Aberdeen, Scotland. Edited by: Sleeman D, Edwards P. Morgan-Kaufmann; 144-153.
Google Scholar
R. S. Michalski, “A theory and methodology of inductive learning,” in Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonnell, and T. M. Mitchell, Eds., chapter 4, pp. 83–134, Morgan Kaufmann, San Mateo, Calif, USA, 1983.
Chapter Google Scholar
De Jong KA, Spears WM, Gordon DF: Using genetic algorithms for concept learning. Machine Learning 1993,13(2-3):161-188.
Article Google Scholar
Fawcett TE: Knowledge-based feature discovery for evaluation functions. Computational Intelligence 1996,12(1):42-64. 10.1111/j.1467-8640.1996.tb00252.x
Article Google Scholar
Bagallo G, Haussler D: Boolean feature discovery in empirical learning. Machine Learning 1990,5(1):71-99.
Article Google Scholar
Salzberg S: Improving classification methods via feature selection. Department of Computer Science, Johns Hopkins University, Baltimore, Md, USA; 1992.
Google Scholar
Guyon I, Elisseeff A: An introduction to variable and feature selection. The Journal of Machine Learning Research 2003, 3: 1157-1182.
MATH Google Scholar
Zhao Z, Liu H: Searching for interacting features. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07), January 2007, Hyderabad, India 1156-1161.
Google Scholar
Yang D, Rendell LA, Blix G: A scheme for feature construction and a comparison of empirical methods. Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI '91), August 1991, Sydney, Australia 699-704.
Google Scholar
Markovitch S, Rosenstein D: Feature generation using general constructor functions. Machine Learning 2002,49(1):59-98. 10.1023/A:1014046307775
Article MATH Google Scholar
Gabrilovich E, Markovitch S: Feature generation for text categorization using world knowledge. Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI '05), August 2005, Edinburgh, UK 1048-1053.
Google Scholar
Krawiec K: Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 2002,3(4):329-343. 10.1023/A:1020984725014
Article MATH Google Scholar
Asuncion A, Newman DJ: UC Irvine Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, Calif, USA, 2007, http://archive.ics.uci.edu/ml
Google Scholar
Bellman RE: Dynamic Programming. Princeton University Press, Princeton, NJ, USA; 1957.
MATH Google Scholar
Risset J-C: Timbre analysis by synthesis: representations, imitations, and variants for musical composition. In Representation of Musical Signals. MIT Press, Cambridge, Mass, USA; 1991:7-43.
Google Scholar
Pressnitzer D, Gnansia D: Real-time auditory models. Proceedings of the International Computer Music Conference (ICMC '05), September 2005, Barcelona, Spain 295-298.
Google Scholar
Theodoridis S, Koutroumbas K: Pattern Recognition. 3rd edition. Academic Press, San Diego, Calif, USA; 2006:370-396.
MATH Google Scholar
Ritthoff O, Klinkenberg R, Fischer S, Mierswa I: A hybrid approach to feature selection and generation using an evolutionary algorithm. In Proceedings of the UK Workshop on Computational Intelligence (UKCI '02), September 2002, Birmingham, UK Edited by: Bullinaria J. 147-154.
Google Scholar
Lim SH, Wang L-L, DeJong G: Explanation-based feature construction. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07), January 2007, Hyderabad, India 931-936.
Google Scholar
Guo H, Jack LB, Nandi AK: Feature generation using genetic programming with application to fault classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B 2005,35(1):89-99. 10.1109/TSMCB.2004.841426
Article Google Scholar
Guo H, Nandi AK: Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition 2006,39(5):980-987. 10.1016/j.patcog.2005.10.001
Article Google Scholar
Dogan RI, Getoor L, Wilbur WJ: A feature generation algorithm for sequences with application to splice-site prediction. Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD '06), September 2006, Berlin, Germany, Lecture Notes in Computer Science 4213: 553-560.
Google Scholar
Solorio T, Fuentes O, Ward N, Bayyari YA: Prosodic feature generation for back-channel prediction. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH-ICSLP '06), September 2006, Pittsburgh, Pa, USA 5: 2398-2401.
Google Scholar
Eads D, Glocer K, Perkins S, Theiler J: Grammar-guided feature extraction for time series classification. Proceedings of the 9th Annual Conference on Neural Information Processing Systems (NIPS '05), December 2005, Vancouver, Canada
Google Scholar
Eads DR, Hill D, Davis S, et al.: Genetic algorithms and support vector machines for time series classification. Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation V, July 2002, Seattle, Wash, USA, Proceedings of SPIE 4787: 74-85.
Article Google Scholar
Krawiec K, Bhanu B: Visual learning by coevolutionary feature synthesis. IEEE Transactions on Systems, Man, and Cybernetics, Part B 2005,35(3):409-425. 10.1109/TSMCB.2005.846644
Article Google Scholar
Peeters G: A large set of audio features for sound description (similarity and classification) in the CUIDADO project. 2003, http://www.ircam.fr/anasyn/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf
Google Scholar
West K, Cox S: Features and classifiers for the automatic classification of musical audio signals. Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 531-536.
Google Scholar
Aucouturier J-J, Pachet F: Improving timbre similarity: how high's the sky? Journal of Negative Results in Speech and Audio Sciences 2004,1(11):1-11.
Google Scholar
Aucouturier J-J: Ten experiments on the modeling of polyphonic timbre, Ph.D. thesis. University of Paris 6, Paris, France; 2006.
Google Scholar
Tzanetakis G, Cook P: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 2002,10(5):293-302. 10.1109/TSA.2002.800560
Article Google Scholar
McKay C, Fujinaga I: Automatic genre classification using large high-level musical feature sets. Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 525-530.
Google Scholar
Eronen A, Klapuri A: Musical instrument recognition using cepstral coefficients and temporal features. Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing (ICASSP '00), June 2000, Istanbul, Turkey 2: 753-756.
Google Scholar
Peeters G: Automatic classification of large musical instrument databases using hierarchical classifiers with inertia ratio maximization. Proceedings of the 115th Convention of the Audio Engineering Society (AES '03), October 2003, New York, NY, USA
Google Scholar
Tindale AR, Kapur A, Tzanetakis G, Fujinaga I: Retrieval of percussion gestures using timbre classification techniques. Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 541-545.
Google Scholar
Herrera P, Yeterian A, Gouyon F: Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques. In Proceedings of the 2nd International Conference on Music and Artificial Intelligence (ICMAI '02), September 2002, Edinburgh, UK. Springer; 69-80.
Google Scholar
Cano P, Batlle E, Kalker T, Haitsma J: A review of audio fingerprinting. Journal of VLSI Signal Processing 2005,41(3):271-284. 10.1007/s11265-005-4151-3
Article Google Scholar
Hanna P, Louis N, Desainte-Catherine M, Benois-Pineau J: Audio features for noisy sound segmentation. Proceedings of the 5th International Symposium on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 120-123.
Google Scholar
Nwe TL, Wang Y: Automatic detection of vocal segments in popular songs. Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 138-145.
Google Scholar
Lu L, Liu D, Zhang H-J: Automatic mood detection and tracking of music audio signals. IEEE Transactions on Audio, Speech, and Language Processing 2006,14(1):5-18.
Article MathSciNet Google Scholar
Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331-1334.
Google Scholar
Aucouturier J-J, Pachet F: Tools and architecture for the evaluation of similarity measures: case study of timbre similarity. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 198-203.
Google Scholar
Pachet F, Zils A: Evolving automatically high-level music descriptors from acoustic signals. Proceedings of the International Symposium on Computer Music Modeling and Retrieval (CMMR '03), May 2004, Montpellier, France, Lecture Notes in Computer Science 2771: 42-53.
Article Google Scholar
Zils A: Extraction de descripteurs musicaux: une approche évolutionniste, Ph.D. thesis. University of Paris 6, Paris, France; September 2004. http://aymeric.zils.free.fr/papiers/these04.pdf
Google Scholar
Mörchen F, Ultsch A, Thies M, Löhken I: Modeling timbre distance with temporal statistics from polyphonic music. IEEE Transactions on Audio, Speech, and Language Processing 2006,14(1):81-90.
Article Google Scholar
Mörchen F, Mierswa I, Ultsch A: Understandable models of music collections based on exhaustive feature generation with temporal statistics. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), August 2006, Philadelphia, Pa, USA 882-891.
Chapter Google Scholar
Mierswa I, Morik K: Automatic feature extraction for classifying audio data. Machine Learning 2005,58(2-3):127-149. 10.1007/s10994-005-5824-7
Article MATH Google Scholar
Mierswa I: Automatic feature extraction from large time series. In Classification—The Ubiquitous Challenge: Proceedings of the 28th Annual Conference of the Gesellschaft für Klassifikation e.V.. Edited by: Weihs C, Gaul W. Springer, Berlin, Germanu; 2004:600-607.
Google Scholar
Schuller B, Reiter S, Rigoll G: Evolutionary feature generation in speech emotion recognition. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '06), July 2006, Toronto, Canada 5-8.
Google Scholar
Scheirer ED: Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America 1998,103(1):588-601. 10.1121/1.421129
Article Google Scholar
Lenat DB: AM: an artificial intelligence approach to discovery in mathematics as heuristic search, Ph.D. thesis. Stanford University, Stanford, Calif, USA; 1976.
Google Scholar
Lenat DB, Feigenbaum EA: On the thresholds of knowledge. Artificial Intelligence 1991,47(1–3):185-250.
Article MathSciNet Google Scholar
Lenat DB: When will machines learn? Machine Learning 1989,4(3-4):255-257. 10.1007/BF00130713
Article Google Scholar
Koza JR: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, Mass, USA; 1992.
MATH Google Scholar
Goldberg DE: Genetic Algorithms in Search, Optimization & Machine Learning. Addison-Wesley, Reading, Mass, USA; 1989.
MATH Google Scholar
Haase KW Jr.: Invention and exploration in discovery, Ph.D. thesis. Massachusetts Institute of Technology, Cambridge, Mass, USA; 1990.
Google Scholar
Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, UK; 2000.
Book MATH Google Scholar
Sun Q, DeJong G: Feature kernel functions: improving SVMs using high-level knowledge. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 2: 177-183.
Google Scholar
Pachet F, Roy P: Exploring billions of audio features. Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI '07), June 2007, Bordeaux, France 227-235.
Google Scholar
Montana DJ: Strongly typed genetic programming. Evolutionary Computation 1995,3(2):199-230. 10.1162/evco.1995.3.2.199
Article Google Scholar
Pierce B: Types and Programming Languages. MIT Press, Cambridge, Mass, USA; 2002.
MATH Google Scholar
Fisher RA: The use of multiple measurements in taxonometric problems. Annals of Eugenics 1936, 7: 179-186. 10.1111/j.1469-1809.1936.tb02137.x
Article Google Scholar
Kohavi R, John GH: The wrapper approach. In Feature Selection for Knowledge Discovery and Data Mining. Edited by: Liu H, Motoda H. Kluwer Academic Publishers, Dordrecht, The Netherlands; 1998:33-50.
Google Scholar
van Rijsbergen CJ: Information Retrieval. Butterworths, London, UK; 1979.
MATH Google Scholar
Edwards AL: The correlation coefficient. In An Introduction to Linear Regression and Correlation. W. H. Freeman, San Francisco, Calif, USA; 1976:33-46.
Quinlan JR: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, Calif, USA; 1993.
Google Scholar
Barbieri G, Degli Esposti M, Pachet F, Roy P: Complexity of the fitness landscape for analytical features. Sony Computer Science Laboratory-Paris, Paris, France; 2009.
Google Scholar
Zhang K, Shasha D: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 1989,18(6):1245-1262. 10.1137/0218082
Article MathSciNet MATH Google Scholar
Oppenheim AV, Schafer RW: Discrete-Time Signal Processing. Prentice Hall, Upper Saddle River, NJ, USA; 1999.
MATH Google Scholar
Dershowitz N, Jouannaud J-P: Rewrite systems. In Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics. Elsevier, Amsterdam, The Netherlands; 1990:243-320.
Google Scholar
Khardon R, Servedio RA: Maximum margin algorithms with Boolean kernels. The Journal of Machine Learning Research 2005, 6: 1405-1429.
MathSciNet MATH Google Scholar
Boersma P, Weenink D: PRAAT: a system for doing phonetics by computer. Institute for Phonetic Sciences of the University of Amsterdam, Amsterdam, The Netherlands; 1996.http://www.praat.org
Google Scholar
Pachet F, Cazaly D: A taxonomy of musical genre. In Proceedings of Content-Based Multimedia Information Access (RIAO) Conference, April 2000, Paris, France. Volume 2. Collège de France; 1238-1246.
Google Scholar
Cabral G, Pachet F, Briot J-P: Recognizing chords with EDS: part one. Proceedings of the 3rd International Symposium on Computer Music Modeling and Retrieval (CMMR '06), September 2006, Pisa, Italy, Lecture Notes in Computer Science 3902: 185-195.
Article Google Scholar
Cabral G, Briot J-P, Krakowski S, Velho L, Pachet F, Roy P: Some case studies in automatic descriptor extraction. Proceddings of the Brazilian Symposium on Computer Music (SBCM '07), May 2007, Sao Paulo, Brazil
Google Scholar
Defréville B, Roy P, Rosin C, Pachet F: Automatic recognition of urban sound sources. Proceedings of the 120th AES Conference, May 2006, Paris, France
Google Scholar
Monceaux J, Pachet F, Amadu F, Roy P, Zils A: Descriptor-based spatialization. Proceedings of the 118th Convention of the Audio Engineering Society (AES '05), May 2005, Barcelona, Spain 1-8.
Google Scholar
Pachet F, Zils A: Automatic extraction of music descriptors from acoustic signals. Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 353-356.
Google Scholar
McEnnis D, McKay C, Fujinaga I, Depalle P: Jaudio: a feature extraction library. Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR '05), September 2005, London, UK 600-603.
Google Scholar
McAdams S: Perspectives on the contribution of timbre to musical structure. Computer Music Journal 1999,23(3):85-102. 10.1162/014892699559797
Article Google Scholar
Roy P, Pachet F, Krakowski S: Improving the classification of percussive sounds with analytical features: a case study. Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR '07), September 2007, Vienna, Austria 229-232.
Google Scholar
Zils A, Pachet F: Extracting automatically the perceived intensity of music titles. Proceedings of the 6th Conference on Digital Audio Effects (DAFx '03), September 2003, London, UK 180-183.
Google Scholar
Herrera P, Serra X, Peeters G: Audio descriptors and descriptors schemes in the context of MPEG-7. Proceedings of the International Computer Music Conference (ICMC '99), October 1999, Beijing, China
Google Scholar
Sandvold V, Herrera P: Towards a semantic descriptor of subjective intensity in music. Proceedings of the International Computer Music Conference (ICMC '05), September 2005, Barcelona, Spain
Google Scholar
Livshin A, Rodet X: The importance of cross database evaluation in musical instrument sound classification: a critical approach. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '03), October 2003, Baltimore, Ma, USA 241-242.
Google Scholar
Chordia P: Segmentation and recognition of Tabla strokes. Proceedings of the 11th International Symposium on Music Information Retrieval (ISMIR '05), September 2005, London, UK 107-114.
Google Scholar
Roy P, Pachet F, Krakowski S: Analytical features for the classification of percussive sounds: the case of the Pandeiro. Proceedings of the 10th International Conference on Digital Audio Effects (DAFx '07), September 2007, Bordeaux, France 213-220.
Google Scholar
Gómez E: Tonal description of music audio signals, Ph.D. thesis. Universitat Pompeu Fabra, Barcelona, Spain; 2006.
Google Scholar
Blum AL, Langley P: Selection of relevant features and examples in machine learning. Artificial Intelligence 1997,97(1-2):245-271. 10.1016/S0004-3702(97)00063-5
Article MathSciNet MATH Google Scholar
Molnár C, Kaplan F, Roy P, et al.: Classification of dog barks: a machine learning approach. Animal Cognition 2008,11(3):389-400. 10.1007/s10071-007-0129-9
Article Google Scholar
Pongrácz P, Molnár C, Miklósi Á, Csányi V: Human listeners are able to classify dog (Canis familiaris) barks recorded in different situations. Journal of Comparative Psychology 2005,119(2):136-144.
Article Google Scholar
Pongrácz P, Molnár C, Miklósi Á: Acoustic parameters of dog barks carry emotional information for humans. Applied Animal Behavior Science 2006,100(3-4):228-240. 10.1016/j.applanim.2005.12.004
Article Google Scholar
Molnár C, Pongrácz P, Dóka A, Miklósi Á: Can humans discriminate between dogs on the base of the acoustic parameters of barks? Behavioural Processes 2006,73(1):76-83. 10.1016/j.beproc.2006.03.014
Article Google Scholar
Draganoiu T, Pasteau M, Nagle L: Do different elements of Black redstart song have different threat values? Journal of Ornithology 2006, 147: 158-159.
Google Scholar
Orlarey Y, Fober D, Letz S: Syntactical and semantical aspects of Faust. Soft Computing 2004,8(9):623-632.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Sony CSL-Paris, 6, rue Amyot, 75005, Paris, France
François Pachet & Pierre Roy

Authors

François Pachet
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to François Pachet.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Pachet, F., Roy, P. Analytical Features: A Knowledge-Based Approach to Audio Feature Generation. J AUDIO SPEECH MUSIC PROC. 2009, 153017 (2009). https://doi.org/10.1155/2009/153017

Download citation

Received: 04 September 2008
Accepted: 16 January 2009
Published: 08 April 2009
DOI: https://doi.org/10.1155/2009/153017

Analytical Features: A Knowledge-Based Approach to Audio Feature Generation

Abstract

1. Introduction

1.1. Concept Induction from Symbolic Data

1.2. From Symbols to Signals

1.3. Audio Feature Generation

2. Analytical Features

2.1. Origin

2.2. Analytical Features: A Subset of All Possible Audio Digital Signal Processing Functions

2.3. A Dimensional Type Inference System

2.3.1. Basic Dimensional Types

2.3.2. Typing Rules

2.4. The AF Generation Algorithm

2.4.1. Main Algorithm

2.4.2. Genetic Operations

2.4.3. Evaluation of Features and Feature Sets

2.5. Function Patterns and Function Generators

2.5.1. Patterns

2.5.2. Pattern-Based AF Generators

2.5.3. Heuristics

2.5.4. Rewriting Rules

2.5.5. Computation of AFs

2.6. Basic Operators: The "General" Library

3. Experiments

3.1. Artificial Experiments

3.1.1. The Pink Noise Experiment

3.1.2. Approximating a Known Feature

3.2. Perceived Intensity

3.3. Classification of Pandeiro Sounds

3.4. Classification of Animal Sounds

4. Discussion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords