Skip to main content

Time–frequency scattering accurately models auditory similarities between instrumental playing techniques


Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called “ordinary” technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time–frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99.0%±1. An ablation study demonstrates that removing either the joint time–frequency scattering transform or the metric learning algorithm noticeably degrades performance.

1 Introduction

Music information retrieval (MIR) operates at two levels: symbolic and auditory [1]. By relying on a notation system, the symbolic level allows the comparison of musical notes in terms of quantitative attributes, such as duration, pitch, and intensity at the source. Timbre, in contrast, is a qualitative attribute of music and is thus not reducible to a one-dimensional axis [2]. As a result, symbolic representations describe timbre indirectly, either via visuotactile metaphors (e.g., bright, rough, and so forth [3]) or via an instrumental playing technique (e.g., bowed or plucked) [4].

Despite their widespread use, purely linguistic references to timbre fail to convey the intention of the composer. On the one hand, adjectives such as bright or rough are prone to misunderstanding, as they do not prescribe any musical gesture that is capable of achieving them [5]. On the other hand, the sole mention of a playing technique does not specify its effect in terms of auditory perception. For instance, although the term breathy alludes to a playing technique that is specific to wind instruments, a cellist may accomplish a seemingly breathy timbre by bowing near the fingerboard, i.e., sul tasto in the classical terminology. Yet, in a diverse instrumentarium, the semantic similarity between playing technique denominations does not reflect such acoustical similarity [6].

Although a notation-based study of playing techniques in music has research potential in music information retrieval [7], the prospect of modeling timbre perception necessarily exceeds the symbolic domain. Instead, it involves a cognitive process which arises from the subjective experience of listening [8]. The simulation of this cognitive process amounts to the design of a multidimensional feature space wherein some distance function evaluates pairs of stimuli. Rather than merely discriminating instruments as mutually exclusive categories, this function must reflect judgments of acoustic dissimilarity, all other parameters—duration, pitch, and intensity—being equal [9].

1.1 Use case

Behind the overarching challenge of coming up with a robust predictive model for listening behaviors in humans, the main practical application of timbre similarity retrieval lies in the emerging topic of computer-assisted orchestration [10]. In such context, the composer queries the software with an arbitrary audio signal. The outcome is another audio signal which is selected from a database of instrumental samples and perceptually similar to the query.

The advantage of this search is that, unlike the query, the retrieved sound is precisely encoded in terms of duration, pitch, intensity, instrument, and playing technique. Thus, following the esthetic tradition of spectralism in contemporary music creation, the computer serves as a bridge from the auditory level to the symbolic level, i.e., from a potentially infinite realm of timbral sensations to a musical score of predefined range [11].

Composers may personalize their search engine by manually defining a cluster graph via the Cyberlioz interface (see Section 3). This preliminary annotation lasts 30 to 60 min, which is relatively short in comparison with the duration of the N=9346 audio samples in the database: i.e., roughly three hours of audio.

1.2 Goal

This article proposes a machine listening system which computes the dissimilarity in timbre between two audio samples.Footnote 1 Crucially, this dissimilarity is not evaluated in terms of acoustic tags, but in terms of ad hoc clusters, as defined by a human consensus of auditory judgments. Our system consists of two stages: unsupervised feature extraction and supervised metric learning. The feature extraction stage is a nonlinear map which relies on the joint time–frequency scattering transform [12, 13], followed by per-feature Gaussianization [14]. It encodes patterns of spectrotemporal modulation in the acoustic query while offering numerical guarantees of stability to local deformations [15]. The metric learning stage is a linear map, optimized via large-margin nearest neighbors (LMNN) [16]. It reweights scattering coefficients so that pairwise distances between samples more accurately reflect human judgments on a training set. These human judgments may be sourced from a single subject or the intersubjective consensus of multiple participants [17].

Figure 1 summarizes our experimental protocol: it illustrates how visual annotations (top) can inform feature extraction (center) to produce a nearest-neighbor search engine which is consistent with human judgments of timbre similarity (bottom).

Fig. 1
figure 1

Overview of the proposed approach. See Section 1 for details

1.3 Approach

The main contribution of this article can be formulated as the intersection between three topics. To the best of our knowledge, prior literature has addressed these topics separately, but never in combination.

First, our dataset encompasses a broad range of extended playing techniques, well beyond the so-called “ordinary” mode of acoustic production. Specifically, we fit pairwise judgments for 78 different techniques arising from 16 instruments, some of which include removable timbre-altering devices such as mutes.

Secondly, we purposefully disregard the playing technique metadata underlying each audio sample during the training phase of our model. In other words, we rely on listeners, not performers, to define and evaluate the task at hand.

Thirdly, we supplement our quantitative benchmark with visualizations of time–frequency scattering coefficients in the rate–scale domain for various typical samples of instrumental playing techniques. These visualizations are in line with visualizations of the modulation power spectrum in auditory neurophysiology [18], while offering an accelerated algorithm for scalable feature extraction.

Our paper strives to fill the gap in scholarship between MIR and music cognition in the context of extended playing techniques. From the standpoint of MIR, the model presented here offers an efficient and generic multidimensional representation for timbre similarity, alongside theoretical guarantees of robustness to elastic deformations in the time–frequency domain. Conversely, from the standpoint of music cognition, our model offers a scalable and biologically plausible surrogate for stimulus-based collection of acoustic dissimilarity judgments, which is readily tailored to subjective preferences.

2 Related work

Timbre involves multiple time scales in conjunction, from a few microseconds for an attack transient to several seconds for a sustained tone. Therefore, computational models of timbre perception must summarize acoustic information over a long analysis window [19]. Mapping this input to a feature space in which distances denote timbral dissimilarity requires a data-driven stage of dimensionality reduction. In this respect, the scientific literature exhibits a methodological divide as regards the collection of human-annotated data [20]: while the field of MIR mostly encodes timbre under the form of “audio tags,” music psychology mostly measures timbre similarity directly from pairwise similarity judgments.

2.1 Automatic classification of musical instruments and playing techniques

On the one hand, most publications in music information retrieval cast timbre modeling as an audio classification problem [2130]. In this context, the instrumentation of each musical excerpt serves as an unstructured set of “audio tags,” encoded as binary outputs within some predefined label space. Because such tags often belong to the metadata of music releases, the process of curating a training set for musical instrument classification requires little or no human intervention. Although scraping user-generated content from online music platforms may not always reflect the true instrumentation with perfect accuracy, it offers a scalable and ecologically valid insight onto the acoustic underpinnings of musical timbre.

Furthermore, supplementing user-generated content with the outcome of a crowdsourced annotation campaign allows an explicit verification of instrument tags. For instance, the Open-MIC dataset [31], maintained by the Community for Open and Sustainable Music Information Research (COSMIR) [32], comprises a vast corpus of 20k polyphonic music excerpts spanning 20 instruments as a derivative of the Free Music Archive (FMA) dataset [33]. Another example is the Medley-solos-DB dataset [34], which comprises 21k monophonic excerpts from eight instruments as a derivative of the MedleyDB dataset of multitrack music [35].

Over the past decade, the availability of large digital audio collections, together with the democratization of high-performance computing on dedicated hardware, has spurred the development of deep learning architectures in music instrument recognition [3638]. Notwithstanding the growing accuracy of these architectures in the large-scale data regime, it remains unclear how to extend them from musical instrument recognition to playing technique recognition, where labeled samples are considerably more scarce [39]. We refer to [40] for a recent review of the state of the art in this domain.

2.2 Spectrotemporal receptive fields (STRF) in music cognition

On the other hand, the field of music psychology investigates timbre with the aim of discovering its physiological and behavioral foundations [41]. In this setting, prior knowledge of instrumentation, however accurate, does not suffice to conduct a study on timbre perception: rather, the timbre perception relies on an interplay of acoustic and categorical information [42]. Yet, collecting subjective responses to acoustic stimuli is a tedious and unscalable procedure, which restricts the size of the musical corpus under study. These small corpus sizes hamper the applicability of optimization algorithms for representation learning, such as stochastic gradient descent in deep neural networks.

While training artificial neurons is prone to statistical overfitting, advanced methods in electrophysiology allow to observe the firing patterns of biological neurons in the presence of controlled stimuli. This observation, originally carried out on the ferret, has led to a comprehensive mapping of the primary auditory cortex in terms of its spectrotemporal receptive fields (STRFs) [43]. The STRF of a neuron is a function of time and frequency which represents the optimal predictor of its post-stimulus time histogram during exposure to a diverse range of auditory stimuli [44]. The simplest method to compute it in practice is by reverse correlation, i.e., by averaging all stimuli that trigger an action potential [45]. Historically, STRFs were defined by their Wigner–Ville distribution [46], thereby sparing the choice of a tradeoff in time–frequency localization, but eliciting cross-term interferences [47]. Since then, the STRF of a neuron was redefined as a spectrographic representation of its spike-triggered average [48].

Although this new definition is necessarily tied to a choice of spectrogram parameters, it yields more interpretable patterns than a Wigner–Ville distribution. In particular, a substantial portion of spectrographic STRFs exhibit a ripple-like response around a given region (t,λ) of the time–frequency domain [49]. This response can be approximately described by a pair of scalar values: a temporal modulation rate α in Hertz and a frequential modulation rate β in cycles per octave.

Interestingly, both α and β appear to be arranged in a geometric series and independent from the center time t and center frequency λ. This observation has led auditory neuroscientists to formulate an idealized computational model for STRF, known as the “full cortical model” [50], which densely covers the rate–scale domain (α,β) using geometric progressions. Because they do not require a data-driven training procedure, STRF yield a useful form of domain-specific knowledge for downstream machine listening applications, especially when the number of annotated samples is relatively small.

2.3 Spectrotemporal receptive fields (STRFs) as a feature extractor

Over recent years, several publications have employed the full cortical model as a feature extractor for a task of musical instrument classification, both in isolated recordings [18] and in solo phrases [51]. These biologically inspired features outperform the state of the art, especially in the small data regime where deep learning is inapplicable. Furthermore, the confusion matrix of the full cortical model in the label space of musical instruments is strongly correlated with the confusion matrix between a human listener and the ground truth. Another appeal of the full cortical model is that the three-way tensor of frequency λ, rate α, and scale β can be segmented into contiguous regions of maximal perceptual relevance for each instrument [52]. This is unlike fully end-to-end learning architectures, whose post hoc interpretability requires advanced techniques for feature inversion [53]. Lastly, beyond the realm of supervised classification, a previous publication [54] has shown that query-by-example search with STRFs allows to discriminate categories of environmental soundscapes, even after temporal integration and unsupervised dimensionality reduction.

The reasons above make STRFs an appealing feature extractor for a perceptual description of timbral similarity across instrumental playing techniques. Nonetheless, current implementations of STRF suffer from a lack of scalability, which explains why they have found few applications in MIR thus far. Indeed, the full cortical model is usually computed via two-dimensional Fourier transforms over adjacent time–frequency regions, followed by averaging around specific rates and scales. This approach requires a uniform discretization of the scalogram, and thus an oversampling of the lower-frequency subbands to the Nyquist frequency of the original signal. In contrast, joint time–frequency scattering offers a faster extraction of spectrotemporal modulations while preserving properties of differentiability [55] and invertibility [56]. Such acceleration is made possible by discretizing the wavelet transforms involved in time–frequency scattering according to a multirate scheme, both along the time and the log-frequency variables [13]. In this multirate scheme, every subband is discretized at its critical sample rate, i.e., in proportion to its center frequency. As a by-product, the multirate approach draws an explicit connection between scattering networks and deep convolutional networks, because both involves numerous convolutions with small kernels, pointwise rectifying nonlinearities, and pooling operations [57].

Moreover, a quantitative benchmark over Medley-solos-DB has demonstrated that joint time–frequency scattering, unlike purely temporal scattering, outperforms deep convolutional networks in supervised musical instrument classification, even in a relatively large data regime with 500 to 5k samples per class [13]. However, it remains to be seen whether joint time–frequency scattering is capable of fine-grained auditory categorization, involving variability in instrument, mute, and playing technique. In addition, previous publications on joint time–frequency scattering lack a human-centric evaluation, independently from any classification task. Beyond the case of STRF, we refer to [58] for a detailed review of the state of the art on audio descriptors of timbre.

3 Perceptual data collection

The philharmonic orchestra encompasses four families of instruments: strings, woodwinds, brass, and percussion. In this article, we focus on the first three, and leave the question of learning auditory similarities between percussion instruments to future research. We refer to [59] and [60] for reviews of the recent literature on the timbre modeling of percussive instruments, from the standpoints of MIR and music cognition, respectively.

3.1 Dataset

We consider a list of 16 instruments: violin (Vn), viola (Va), cello (Vc), contrabass (Cb), concert harp (Hp), Spanish guitar (Gtr), accordion (Acc), flute (Fl), soprano clarinet (BbCl), alto saxophone (ASax), oboe (Ob), bassoon (Bn), trumpet in C (TpC), French horn (Hn), tenor trombone (TTbn), and bass tuba (BBTb). Among this list, the first six are strings, the next six are woodwind, and the last four are brass. Some of these instruments may be temporarily equipped with timbre-altering mutes, such as a rubber sordina on the bridge of a violin or an aluminum “wah-wah,” also known as harmon, inside in the bell of a trumpet. Once augmented with mutes, the list of 16 instruments grows to 33. Furthermore, every instrument, whether equipped with a mute or not, affords a panel of playing techniques ranging in size between 11 (for the accordion) and 41 (for the bass tuba). In the rest of this paper, we abbreviate instrument–mute–technique by means of the acronym “IMT.” One example of IMT is TpC+S-ord, i.e., trumpet in C with a straight mute played in the ordinary technique. Another example of IMT is Vn-pont, i.e., violin without any mute played in the sul ponticello technique (bowing near the bridge).

Performers can play each IMT at various pitches according to the tessitura of their instrument. This tessitura may depend on the choice of playing technique but is independent of the choice of mute. Among the 16 instruments in this study, the two instruments with widest and narrowest tessituras, in their respective ordinary techniques, are the accordion (81 semitones) and the trumpet in C (32 semitones) respectively. Lastly, each IMT may be played at up to five intensity dynamics, ranging from quietest to loudest as pianissimo (pp), piano (p), mezzo forte (mf), forte (f), and fortissimo (ff). The resort to a non-ordinary playing technique may restrict both the tessitura and the dynamics range of the instrument–mute pair under consideration. For example, the pitch of pedal tones in brass instruments is tied to the fundamental mode of the bore, i.e., usually B or F. Likewise, the intensity of key clicks in the oboe is necessarily pp, while the intensity of snap pizzicato à la Bartók in plucked strings is necessarily ff.

In summary, audio signals from isolated musical notes may vary across three categorical variables (instrument, mute, and technique) and two quantitative variables (intensity and pitch). The Studio On Line dataset (SOL), recorded at Ircam in 1998, offers a joint sampling of these variables. The version of SOL that we use throughout this paper, named “0.9 HQ,” amounts to a total of 25444 audio signals. Beyond playing techniques, we should note that SOL erases other factors of acoustic variability, such as identity of performer, identity of instrument manufacturer, audio acquisition equipment, and room response characteristics, which are all restricted to singletons. Addressing these factors of variability is beyond the scope of this paper, which focuses on the influence of playing technique. Despite this restriction, the SOL dataset remains impractically large for collecting human similarity judgments. Our protocol addresses this problem by means of three complementary approaches: disentanglement of factors, expert pre-screening, and the use of an efficient annotation interface.

3.2 Disentanglement of factors

First, we purposefully disentangle categorical variables (IMTs) from continuous variables (pitch and intensity) in the SOL dataset. Indeed, under first approximation, the perception of timbre is invariant to pitch and intensity. Therefore, we select auditory stimuli according to a reference pitch and a reference intensity, in our case, middle C (C4) and mf. After this selection, every IMT triplet contains a single acoustic exemplar, regarded as canonical in the following. The number of canonical stimuli for the entire SOL dataset is equal to 235. We should note, however, that the proposed pitch and intensity cannot be strictly enforced across all IMTs. Indeed, as explained above, a fraction of IMTs can only be achieved at restricted values of pitch and intensity parameters, e.g., pedal tones or key clicks. Therefore, at a small cost of consistency, we only enforce the pitch–intensity reference (i.e., C4 and mf) when practically feasible, and fall back to other pitches and intensities if necessary.

Table 1 Full list of audio stimuli (1/2). In each instrument, a blank space in the rightmost column denotes the ordinary playing technique (ordinario)

3.3 Expert pre-screening

Secondly, we reduce the number of IMTs in our study by focusing on those which are deemed to be most relevant. Here, we define the relevance of an IMT as the possibility of imitating it by means of another IMT from a different instrument. One example of such imitation is the acoustic similarity between slap tonguing in reed instruments and a snap pizzicato in string instruments. To collect perceptual ratings of relevance, we recruited two professors in music composition at the Paris Conservatory (CNSMDPFootnote 2). Each of them inspected the entire corpus of 235 IMTs and annotated them in terms of relevance according to a Likert scale with seven ticks. In this Likert scale, the value 1 (least relevant) denotes that the IMT under consideration has a timbre that is idiosyncratic, and that therefore, it is unlikely that humans will pair it with other IMTs. Conversely, the value 7 (most relevant) denotes that the IMT under consideration bears a strong similarity with some other IMT from the corpus.

Once both experts completed their annotations, we retained all IMTs whose average score was judged equal to 3 or higher, thus resulting in a shortlist of N=78 IMTs (see Tables 1 and 2). It is worth noting that, according to both experts, the timbre of the accordion was judged too idiosyncratic to be relevant for this experiment, regardless of playing technique. Indeed, the accordion is the only instrument in the aforementioned list of instrument to have free reeds, keyboard-based actuation, or handheld airflow. Consequently, regardless of mute and technique, the set of instruments \(\mathcal {I}\) in our study contains 15 elements.

Table 1 Full list of audio stimuli (1/2). In each instrument, a blank space in the rightmost column denotes the ordinary playing technique (ordinario). (Continued)
Table 2 Full list of audio stimuli (2/2). In each instrument, a blank space in the rightmost column denotes the ordinary playing technique (ordinario)

3.4 Efficient annotation interface

Thirdly, we design a graphical user interface for partitioning a corpus of short audio samples. The need for such an interface arises from the unscalability of Likert scales in the context of pairwise similarity judgments. Assuming that similarity is a symmetric quantity, collecting a dense matrix of continuously valued ratings of similarity among a dataset of N items would require \(\frac {1}{2}(N^{2}-N)\) Likert scales. In the case of N=78 IMTs, the task would amount to about 3k horizontal sliders, i.e., several hours of cumbersome work for the human annotator.

Engaging as many participants as possible in our study called for a more streamlined form of human–computer interaction, even if it sacrificed the availability of continuously valued ratings. To this end, we implemented a web application, named Cyberlioz, in which the user can spontaneously listen and arrange sounds into clusters of timbre similarity.Footnote 3 The name Cyberlioz is a portmanteau between the prefix cyber- and the French composer Hector Berlioz. The choice is by no means coincidental: Berlioz is famous for having, in his Treatise on Orchestration (1844), shed a particular focus on the role of timbre as a parameter for musical expression.

Cyberlioz consists of a square panel on which is displayed a collection of circular gray dots, each of them corresponding to one of the IMTs, and initially distributed uniformly at random. Hovering the screen pointer onto each dot results in a playback of a representative audio sample of this IMT, i.e., C4 and mf in most cases. Furthermore, each dot can be freely placed on the screen by clicking, dragging, and dropping. Lastly, the user can assign a color to each dot among a palette of 20 hues. The goal of the Cyberlioz interface is to form clusters of timbre similarity between IMTs, expressed by sameness of color.

Cyberlioz implements a data collection procedure known as “free sorting.” In comparison with the direct collection of timbre dissimilarity ratings, free sorting is more efficient yet less accurate [61]. We refer to [62] for an example protocol in which timbre similarity judgments rely on stimuli pairs rather than on a free sorting task.

In comparison with web-based forms, Cyberlioz offers a more intuitive and playful user experience, while limiting the acquisition of similarity judgments to a moderate duration of 30 to 60 min for each participant. Another advantage of Cyberlioz is that it allows to present all stimuli at once rather than according to a randomized sequence.

In May and June 2016, we recruited volunteers to use Cyberlioz on their own computers, via a web browser, and equipped with a pair of earphones. The subjects were asked to “cluster sounds into groups by assigning the same color to the corresponding dots according to how similar the sounds are.”

We publicized this study on the internal mailing list of students at CNSMDP, as well as two international mailing lists for research in music audio processing: AUDITORY and ISMIR Community.Footnote 4 Within 2 months, K=31 participants accessed Cyberlioz and completed the task.

Personal information on the age, sex, nor musical background of participants is not collected, because the goal of our perceptual study is to build a consensus of similarity judgments, rather than to compare demographic subgroups.

In particular, we leave the important question of the effect of musical training on the perception of auditory similarities between playing techniques as future work.

3.5 Hypergraph partitioning

Once the data collection campaign was complete, we analyzed the color assignments of each subject k and converted them into a cluster graph \(\mathcal {G}_{k}\), where the integer k is an anonymized subject index, ranging between 1 and K. For a given k, the graph \(\mathcal {G}_{k}\) contains N vertices, each representing a different IMT in the corpus. In \(\mathcal {G}_{k}\), an edge connects any two vertices m and n if the corresponding dots in Cyberlioz have the same color. Otherwise, there is no edge connecting m and n. Thus, \(\mathcal {G}_{k}\) contains as many connected components as the number of similarity clusters for the subject k, i.e., the number of distinct colors on the Cyberlioz interface in the response of k.

For a particular subject k, let us denote by Ck the number of clusters in the graph \(\mathcal {G}_{k}\). Figure 2a shows the histogram of Ck across the cohort of K=31 participants. We observe that the number of clusters varies between 3 and 19 with a median value of 10. Accordingly, the number of samples belonging to a cluster varies between 1 (the most frequent value) and 50, as shown in Fig. 2b.

Fig. 2
figure 2

Inter-subject variability. Histogram of the number of clusters a and the size of the clusters b defined by the 31 subjects. See Section 3 for details

We aggregate the similarity judgments from all K participants by embedding them into a hypergraph \(\mathcal {H}\), that is, a graph whose edges may connect three or more vertices at once. Specifically, \(\mathcal {H}\) contains N vertices, each representing an IMT, and each “hyperedge” in \(\mathcal {H}\) corresponds to some connected component in one of the graphs \(\mathcal {G}_{1}, \ldots, \mathcal {G}_{K}\). Then, we convert the hypergraph \(\mathcal {H}\) back into a conventional graph \(\mathcal {G}_{0}\) by means of a combinatorial optimization algorithm known as hypergraph partitioning [63].

To construct \(\mathcal {G}_{0}\), we select a number of clusters that is equal to the maximal value of the Ck’s, that is, C0=19. Then, we run hypergraph partitioning on \(\mathcal {H}\) to assign each vertex i to one of the C0 clusters in \(\mathcal {G}_{0}\). Intuitively, hypergraph partitioning optimizes a tradeoff between two objectives: first, balancing the size of all clusters in terms of their respective numbers of vertices, and secondly, keeping most hyperedges enclosed within as few distinct clusters as possible [64, 65].

While the graphs \(\mathcal {G}_{1}, \ldots, \mathcal {G}_{K}\) encode the subjective similarity judgments of participants 1 to K, the graph \(\mathcal {G}_{0}\) represents a form of consensual judgment that is shared across all participants while discarding intersubjective variability. Although the rest of our paper focuses on the consensus \(\mathcal {G}_{0}\), it is worth pointing out that the same technical framework could apply to a single subject k, or to a subgroup of the K=31 participants. This remark emphasizes the potential of our similarity learning method as a customizable tool for visualizing and extrapolating the timbre similarity space of a new subject.

4 Machine listening methods

The previous section described our protocol for collecting timbral similarity judgments between instrumental playing techniques. In this section, we aim to recover these similarity judgments from digital audio recordings according to a paradigm of supervised metric learning. To this end, we present a machine listening system composing joint time–frequency scattering and LMNN.

4.1 Joint time–frequency scattering transform

Let \(\boldsymbol {\psi } \in \mathbf {L}^{2}(\mathbb {R}, \mathbb {C})\) be a complex-valued filter with zero average, dimensionless center frequency equal to one, and an equivalent rectangular bandwidth (ERB) equal to 1/Q. We define a constant-Q wavelet filterbank as the family ψλ:tλψ(λt). Each wavelet ψλ has a center frequency of λ, an ERB of λ/Q, and an effective receptive field of (2πQ/λ) in the time domain. In practice, we define ψ as a Morlet wavelet:

$$ \boldsymbol{\psi}:t \longmapsto \exp\left(-\frac{t^{2}}{2\sigma_{\psi}^{2}}\right) \left(\exp\left(2\pi \mathrm{i}t\right) - \kappa_{\psi} \right), $$

where the Gaussian width σψ grows in proportion with the quality factor Q and the corrective term κψ ensures that ψ has a zero average. Moreover, we discretize the frequency variable λ according to a geometric progression of common ratio \(2^{\frac {1}{Q}}\). Thus, the base-two logarithm of center frequency, denoted by log2λ, follows an arithmetic progression. We set the constant quality factor of the wavelet filterbank (ψλ)λ to Q=12, thus matching twelve-tone equal temperament in music.

Convolving the wavelets in this filterbank with an input waveform \(\boldsymbol {x}\in \mathbf {L}^{2}(\mathbb {R})\), followed by an application of the pointwise complex modulus yields the wavelet scalogram

$$ {}\mathbf{U_{1}}\boldsymbol{x}(t,\lambda) = \left\vert \boldsymbol{x} \ast \boldsymbol{\psi}_{\lambda} \right\vert(t) = \left\vert \int_{\mathbb{R}} \boldsymbol{x}\left(t - t^{\prime}\right) \, \boldsymbol{\psi}_{\lambda} \left(t^{\prime}\right) \; \mathrm{d}{t^{\prime}} \right\vert, $$

which is discretized similarly to the constant-Q transform of [66]. Then, we define a two-dimensional Morlet wavelet Ψ of the form

$$ {}\Psi : (t, u) \longmapsto \exp\left(-\frac{t^{2}+u^{2}}{2\sigma_{\Psi}^{2}}\right) \left(\exp\left(2\pi \mathrm{i} (t + u)\right) - \kappa_{\Psi} \right), $$

taking two real variables t and u as input. In the rest of this paper, we shall refer to Ψ as a time–frequency wavelet. The former is the time variable while the latter is the base-two logarithm of frequency: u= log2λ. Note that u roughly corresponds to the human perception of relative pitch [67].

We choose the Gaussian width σΨ in Eq. 3 such that the quality factor of the Morlet wavelet Ψ is equal to one, both over the time dimension and over the log-frequency dimension. Furthermore, the corrective term κΨ ensures that Ψ has a zero average over \(\mathbb {R}^{2}\), similarly to Eq. 1. From Ψ, we define a two-dimensional wavelet filterbank of the form:

$$ \mathbf{\Psi}_{\alpha,\beta} : (t, u) \longmapsto \alpha \, \beta \, \mathbf{\Psi}(\alpha t, \beta u). $$

In the equation above, α is a temporal modulation rate and β is a frequential modulation scale, following the terminology of spectrotemporal receptive fields (STRF, see Section 2). While α is measured in Hertz and is strictly positive, β is measured in cycles per octaves and may take positive as well as negative values. Both α and β are discretized by geometric progressions of common ratio equal to two. Furthermore, the edge case β=0 corresponds to Ψα,β being a Gaussian low-pass filter over the log-frequency dimension, while remaining a Morlet band-pass filter of center frequency α over the time dimension. We denote this low-pass filter by ϕF and set its width to F=2 octaves.

We now convolve the scalogram U1 with time–frequency wavelets Ψα,β and apply the complex modulus, yielding the four-way tensor

$$ \begin{aligned} \mathbf{U_{2}}\boldsymbol{x} \left(t, \lambda, \alpha, \beta \right) =&\left \vert \mathbf{U_{1}}\boldsymbol{x} \circledast \boldsymbol{\Psi}_{\alpha,\beta} \right \vert (t, \lambda)\\ =&\left\vert \iint_{\mathbb{R}^{2}} \mathbf{U_{1}}\boldsymbol{x}\left(t - t^{\prime}, \log \lambda - u^{\prime}\right)\, \boldsymbol{\Psi}_{\alpha,\beta}\left(t^{\prime}, u^{\prime}\right)\; \mathrm{d}t^{\prime} \, \mathrm{d}u^{\prime} \right\vert, \end{aligned} $$

where the circled asterisk operator \(\circledast \) denotes a joint convolution over time and log-frequency. In the equation above, the sample rate of time t is proportional to α. Conversely, the sample rate of log-frequency u= log2λ is proportional to |β| if β≠0 and proportional to F−1 otherwise.

Let ϕT be a Gaussian low-pass filter. We define the joint time–frequency scattering coefficients of the signal x as the four-way tensor

$$ \begin{aligned} \mathbf{S_{2}}\left(t, \lambda, \alpha, \beta \right) =&\mathbf{U_{2}}\boldsymbol{x} \circledast \left(\boldsymbol{\phi}_{T} \otimes \boldsymbol{\phi}_{F}\right) (t, \lambda) \\ =&\iint_{\mathbb{R}^{2}} \mathbf{U_{1}}\boldsymbol{x}\left(t - t^{\prime}, \log \lambda - u^{\prime}, \alpha, \beta\right) \,\boldsymbol{\phi}_{T} \left(t^{\prime}\right) \,\boldsymbol{\phi}_{F} \left(u^{\prime}\right) \;\mathrm{d}t^{\prime} \, \mathrm{d}u^{\prime}, \end{aligned} $$

where the symbol denotes the outer product over time and log-frequency. In the equation above, the sample rate of time t is proportional to T−1 and the sample rate of log-frequency u= log2λ is proportional to F−1. Furthermore, the rate α spans along a geometric progression ranging from T−1 to λ/Q. In the following, we set the time constant to T=1000 ms unless specified otherwise.

The tensor S2 bears a strong resemblance with the idealized response of an STRF at the rate α and the scale β. Nevertheless, in comparison with the “full cortical model” [18], joint time–frequency scattering enjoys a thirtyfold reduction in dimensionality while covering a time span that is four times larger (1000 ms) and an acoustic bandwidth that is also four times larger (0–16 kHz). This is due to the multirate discretization scheme applied throughout the application of wavelet convolutions and pointwise modulus nonlinearities.

In addition to second-order scattering coefficients (Eq. 6), we compute joint time–frequency scattering at the first order by convolving the scalogram U1x (Eq. 2) with the low-pass filter ϕT over the time dimension with wavelets ψβ (β≥0) over the log-frequency dimension, and by applying the complex modulus:

$$ \begin{aligned} \mathbf{S_{1}}\boldsymbol{x}\left(t, \lambda, \alpha=0, \beta\right) &= \left\vert \mathbf{U_{1}}\boldsymbol{x} \circledast \left(\boldsymbol{\phi}_{T} \otimes \boldsymbol{\psi}_{\beta} \right) \right\vert (t, \lambda) \\ &=\left \vert \iint_{\mathbb{R}^{2}} \mathbf{U_{1}}\boldsymbol{x}\left(t - t^{\prime}, \log \lambda - u^{\prime}\right)\ \,\boldsymbol{\phi}_{T} \left(t^{\prime}\right) \,\boldsymbol{\psi}_{\beta} \left(u^{\prime}\right) \;\mathrm{d}t^{\prime} \, \mathrm{d}u^{\prime} \right \vert, \end{aligned} $$

where the time constant T is the same as in Eq. 6, i.e., T=1000 ms by default.

Over the time variable, we set the modulation rate α of S1 to zero in the equation above. Conversely, over the log-frequency variable, the edge case β=0 corresponds to replacing the wavelet ψβ by the low-pass filter ϕF. We refer to [13] for more details on the implementation of joint time–frequency scattering.

We adopt the multi-index notation p=(λ,α,β) as a shorthand for the tuple of frequency, rate, and scale. The tuple p is known as a scattering path (see [68]), and may apply to index both first-order (S1) and second-order (S2) coefficients. Given an input waveform x, we denote by Sx the feature vector resulting from the concatenation of S1x and S2x:

$$ \mathbf{S}\boldsymbol{x}\left(t, p=(\lambda, \alpha, \beta)\right) =\left\{ \begin{array}{ll} \mathbf{S_{1}}\boldsymbol{x}(t,\lambda,\alpha,\beta) & \textrm{if }\alpha=0,\\ \mathbf{S_{2}}\boldsymbol{x}(t,\lambda,\alpha,\beta) & \textrm{otherwise.} \end{array}\right. $$

4.2 Median-based logarithmic compression and affine standardization

Now, we apply a pointwise nonlinear transformation on averaged joint time-frequency scattering coefficients. The role of this transformation, which is adapted to the dataset in an unsupervised way, is to Gaussianize the histogram of amplitudes of each scattering path p. We consider a collection \(\mathcal {X}\) of N waveforms x1,…,xN. For every path p in the joint time–frequency scattering transform operator S, we average the response of each scattering coefficient Sxn over time and take its median value across all samples n from 1 to N:

$$ \boldsymbol{\mu}(p) =\underset{1\leq n \leq N}{median} \int_{\mathbb{R}} \mathbf{S}\boldsymbol{x}_{n}(t, p) \;\mathrm{d}t. $$

If the collection is split between a training set and a test set (see Section 5), we compute μ on the training set only. Then, to match a decibel-like perception of loudness, we apply the following adaptive transformation, which composes a median-based renormalization and a logarithmic compression:

$$ \mathbf{\widetilde{S}}\boldsymbol{x}_{n} (p) = \log \left(1 + \frac{ \int_{\mathbb{R}} \mathbf{S}\boldsymbol{x}_{n} (t, p) \;\mathrm{d}t }{\varepsilon\boldsymbol{\mu}(p)} \right) $$

where ε is a predefined constant. The offset of one before the application of the pointwise logarithm ensures that the transformation is nonexpansive in the sense of Lipschitzian maps: there exists a constant c such that

$$ \left \Vert \mathbf{\widetilde{S}}\boldsymbol{x}_{m} - \mathbf{\widetilde{S}}\boldsymbol{x}_{n} \right \Vert \leq c \left \Vert \mathbf{S}\boldsymbol{x}_{m} - \mathbf{S}\boldsymbol{x}_{n} \right \Vert $$

for every pair of samples (xm,xn). On a dataset of environmental audio recordings, a previous publication has shown empirically that Eq. 10 brings the histogram of \(\mathbf {\widetilde {S}}\boldsymbol {x}_{n} (p)\) closer to a Gaussian distribution [14]. Since then, this finding has also been confirmed in the case of musical sounds [4].

Lastly, we standardize every feature \(\mathbf {\widetilde {S}}\boldsymbol {x}_{n}\) to null mean and unit variance, across the dataset \(\mathcal {X} = \left \{\boldsymbol {x}_{1} \ldots \boldsymbol {x}_{N}\right \}\), independently for each scattering path p. Again, if \(\mathcal {X}\) is split between training and test sets, we measure means and variances over the training set only and propagate them as constants to the test set. With a slight abuse of notation, we still denote by \(\mathbf {\widetilde {S}}\boldsymbol {x}_{n} (p)\) the standardized log-scattering features at path p for sample n, even though its value differs from Eq. 10 by an affine transformation.

4.3 Metric learning with large-margin nearest neighbors (LMNN)

Let x be some arbitrary audio sample in the dataset \(\mathcal {X}\). Let \(\mathcal {G}\) be a cluster graph with \(N = \text {card} ~\mathcal {X}\) vertices and C clusters in total. We denote by \(\mathcal {G}(\boldsymbol {x})\) the cluster to which the sample x belongs. Given another sample y in \(\mathcal {X}, \boldsymbol {y}\) is similar to x if and only if belongs to the cluster \(\mathcal {G}(\boldsymbol {x})\). Because \(\mathcal {G}\) is a disjoint union of complete graphs, this relation is symmetric: \(\boldsymbol {x} \in \mathcal {G}(\boldsymbol {y})\) is equivalent to \(\boldsymbol {y} \in \mathcal {G}(\boldsymbol {x})\).

In our protocol, x contains the sound of an isolated musical note and the cluster graph \(\mathcal {G}\) encodes auditory similarities within the dataset \(\mathcal {X}\). Moreover, we take \(\mathcal {G}\) to be the equal to the “consensus” cluster graph \(\mathcal {G}_{0}\), i.e., arising from the partition of a hypergraph \(\mathcal {H}\) which contains the judgments of all K participants from our perceptual study (see Section 3).

We denote by \(\mathbf {\widetilde {S}}\boldsymbol {x}\) the feature vector of joint time–frequency scattering resulting from x. This vector includes both first-order and second-order scattering coefficients, after median-based logarithmic compression and affine standardization (see subsections above). Furthermore, we denote by \(\mathcal {Y}_{R} (\boldsymbol {x})\) the list of R nearest neighbors to \(\mathbf {\widetilde {S}}\boldsymbol {x}\) in the feature space of joint time–frequency scattering coefficients according to the Euclidean metric. Unlike cluster similarity, this relationship is not symmetric: \(\boldsymbol {y}\in \mathcal {Y}_{R} (\boldsymbol {x})\) does not necessarily imply \(\boldsymbol {x}\in \mathcal {Y}_{R} (\boldsymbol {y})\). Note that the dependency of \(\mathcal {Y}_{R} (\boldsymbol {x})\) upon the operator S is left implicit. In all of the following, we set the constant R to 5; this is in accordance with our chosen evaluation metric, average precision at rank 5 (AP@5, see Section 5).

Let P be the number of scattering paths in the operator S. The LMNN algorithm learns a matrix L with P rows and P columns by minimizing an error function of the form:

$$ \mathcal{E}(\mathbf{L}) = \frac{1}{2} \mathcal{E}_{\text{pull}} (\mathbf{L}) + \frac{1}{2} \mathcal{E}_{\text{push}} (\mathbf{L}) $$

where, intuitively, \(\mathcal {E}_{\text {pull}}\) tends to shrink local Euclidean neighborhoods in feature space while \(\mathcal {E}_{\text {push}}\) tends to penalize small distances between samples that belong to different clusters in \(\mathcal {G}\).

The definition of \(\mathcal {E}_{\text {pull}}\) is:

$$ \mathcal{E}_{\text{pull}} (\mathbf{L}) = \sum_{\boldsymbol{x}\in\mathcal{X}} \sum_{\boldsymbol{y}\in\mathcal{Y}_{R} (\boldsymbol{x})} \left\Vert \mathbf{L}\mathbf{\widetilde{S}}\boldsymbol{x} - \mathbf{L}\mathbf{\widetilde{S}}\boldsymbol{y} \right\Vert^{2}, $$

Note that the error term \(\mathcal {E}_{\text {pull}}\) is unsupervised, in the sense that it does not depend on the cluster assignments of x and y in \(\mathcal {G}\).

While the term \(\mathcal {E}_{\text {pull}}\) operates on pairs of samples, the term \(\mathcal {E}_{\text {push}}\), operates on triplets \((\boldsymbol {x}, \boldsymbol {y}, \boldsymbol {z})\in \mathcal {X}^{3}\). The first sample, x, is known as an “anchor.” The second sample, y, is known as a “positive”, and is assumed to belong to the Euclidean neighborhood of the anchor: \(\boldsymbol {y} \in \mathcal {Y}_{R} (\boldsymbol {x})\). The third sample, z, is known as a “negative” and is assumed to belong to a different similarity cluster as the anchor: \(\boldsymbol {z}\not \in \mathcal {G}(\boldsymbol {x})\). The term \(\mathcal {E}_{\text {push}}\) penalizes L unless the positive-to-anchor distance is smaller than the negative-to-anchor distance by a margin of at least 1.

The definition of \(\mathcal {E}_{\text {push}}\) is:

$$ \begin{aligned} \mathcal{E}_{\text{push}} (\mathbf{L}) = \sum_{\boldsymbol{x}\in\mathcal{X}} \sum_{\boldsymbol{y}\in\mathcal{Y}_{R} (\boldsymbol{x})} \sum_{\boldsymbol{z}\not\in\mathcal{G}(\boldsymbol{x})} \rho\left(1 +\Vert\mathbf{L}\mathbf{\widetilde{S}}\boldsymbol{x} - \mathbf{L}\mathbf{\widetilde{S}}\boldsymbol{y} \Vert^{2}-\Vert\mathbf{L}\mathbf{\widetilde{S}}\boldsymbol{x} - \mathbf{L}\mathbf{\widetilde{S}}\boldsymbol{z} \Vert^{2} \right), \end{aligned} $$

where the function ρ:u max(0,u) denotes the activation function of the rectified linear unit (ReLU), also known as hinge loss. The cost function described in the equation above is known in deep learning as “triplet loss” and has recently been applied to train large-vocabulary audio classifiers in an unsupervised way [69]. We refer to [70] for a review of the state of the art in metric learning.

4.4 Extension to diverse pitches and dynamics

In order to suit the practical needs of contemporary music composers, computer-assisted orchestration must draw from a diverse realm of instruments and techniques. Therefore, whereas our data collection procedure for timbre similarity judgments focused on a single pitch (middle C) and a single intensity level (mf), we formulate our machine listening experiment on an expanded dataset of audio samples, containing variations in pitch and dynamics.

Given an audio stimulus xn from our perceptual study, we seek its position in the cluster graph \(\mathcal {G}_{0}\). Then, we identify its IMT triplet, scrape for audio samples in SOL matching this triplet, and assign them all to the same cluster as the original audio stimulus. We repeat the same procedure for all N=78 nodes in \(\mathcal {G}_{0}\), resulting in N=9346 samples in total. Thus, from a limited amount of human annotation, we curate a relatively large subset of SOL, amounting to about one third of the entire dataset (9346 out of 25444 samples).

In doing so, we assume that the perception of timbre is fully invariant to frequency transposition as well as changes in dynamics. This assumption coincides with the commonplace definition of timbre as an autonomous parameter of musical sound. Previous studies on real-world musical sounds have confirmed that listeners are able to ignore salient pitch differences while rating timbre similarity [71, 72], insofar as these differences do not exceed one octave. Since then, another study has shown that trained musicians can identify the similarity within pairs of notes from two instruments (horn and bassoon) spanning a range of 2.5 octaves with an accuracy of 80% [73]. In comparison, non-musicians has great difficulty identifying whether two notes were produced by the same instrument or not when the notes were separated by one octave. Thus, the influence of pitch on timbre perception appears to depend upon musical training. With this important caveat in mind, we leave as future work the question of disentangling the effects of pitch and musical training in the modeling of auditory similarities between instrumental playing techniques.

4.5 Evaluation metric

Let us denote by \(\boldsymbol {x}_{1}, \ldots, \boldsymbol {x}_{{N}^{\prime }}\) the N audio samples associated with our annotated dataset. Given a sample n and a human subject k, we denote by \(\mathcal {G}_{k}\) the cluster graph associated to the subject k, and by \(\mathcal {G}_{k} (n)\) the cluster to which the sample xn belongs. Our machine listening system takes the waveform xn as input and returns a ranked list of nearest neighbors: Φ1(xn),Φ2(xn),Φ3(xn), and so forth.

In the context of browsing an audio collection by timbre similarity, xn is a user-defined query while the function Φ plays the role of a search engine. We consider the first retrieved sample, Φ1(xn), to be relevant to user k if and only if it belongs to the same cluster as xn in the cluster graph \(\mathcal {G}_{k}\), hence the Boolean condition \(\mathbf {\Phi }_{1}(\boldsymbol {x}_{n}) \in \mathcal {G}_{k} (n)\). Likewise, the second retrieved sample is relevant if and only if \(\mathbf {\Phi }_{2}(\boldsymbol {x}_{n}) \in \mathcal {G}_{k} (n)\). To evaluate Φ on the query xn, we measure the relevance of all nearest neighbors Φr(xn) up to some fixed rank R and average the result:

$$ p_{\mathbf{\Phi}}(n, k, R) = \frac{1}{R} \sum_{r=1}^{R} {\mathbbm{1}} \left(\mathbf{\Phi}_{r}(\boldsymbol{x}_{n}) \in \mathcal{G}_{k} (n) \right). $$

In the equation above, the indicator function 𝟙 converts Booleans to integers, i.e., 𝟙(b) returns one if b is true and return zero if b is false. Thus, the function pΦ takes fractional values between 0 and 1, which are typically expressed in percentage points.

The precision at rank R of the system Φ is defined as the average value taken by the function pΦ over the entire corpus of N audio samples:

$$ \mathrm{P}_{\mathbf{\Phi}}(k, R) = \frac{1}{N^{\prime}} \sum_{n=1}^{N^{\prime}} p_{\mathbf{\Phi}}(n, k, R) $$

Lastly, the “average precision at rank R” (henceforth, AP@R) is the average value of PΦ, for constant R, across all K=31 participants from our perceptual study:

$$ \text{AP}_{\mathbf{\Phi}}(R) = \frac{1}{K} \sum_{k=1}^{K} \mathrm{P}_{\mathbf{\Phi}}(k, R) $$

It appears from the above that an effective system Φ should retrieve sounds whose IMT triplets are similar according to all of the K cluster graphs \(\mathcal {G}_{1} \ldots \mathcal {G}_{K}\).

In the rest of this paper, we set R to 5. This is in accordance with the protocol of [4], in which the authors trained a metric learning algorithm on the SOL dataset to search for similar instruments and playing techniques, yet without the intervention of a human subject.

5 Results

The previous section described our methods for extracting spectrotemporal modulations in audio signals, as well as learning a non-Euclidean similarity metric between them. We now turn to apply these methods to the problem of allocating isolated musical notes to clusters of some timbre similarity graph \(\mathcal {G}\). In practice, for training purposes, the cluster graph \(\mathcal {G}\) represents the consensus of the K=31 clustering provided by the users interacting with the Cyberlioz web application, which was described in the Section 3 (\(\mathcal {G}=\mathcal {G}_{0}\)). However, for evaluation purposes, this cluster graph corresponds to the subjective preferences of a single user k≥1, in which case we take \(\mathcal {G}=\mathcal {G}_{k}\).

5.1 Best performing system

Our best performing system comprises five computational blocks:

  1. 1

    Joint time–frequency scattering up to a maximal time scale of T= 1000 ms,

  2. 2

    Temporal averaging at the scale of the whole musical note,

  3. 3

    Median-based logarithmic compression,

  4. 4

    Affine standardization so that each feature has zero mean and unit variance, and

  5. 5

    Nearest-neighbor search according to previously learned non-Euclidean metric.

Note that the non-Euclidean metric is learned via LMNN (see Section 4) on the “consensus” cluster graph \(\mathcal {G}_{0}\). Therefore, the system Φ performs timbre similarity retrieval in a user-agnostic way and can serve as a convenient default algorithm for newcoming users. That being said, it is conceivable to replicate the five-stage protocol above on the cluster graph \(\mathcal {G}_{k}\) of a specific user k, instead of the cluster graph \(\mathcal {G}_{0}\). This operation would lead to a new configuration of the search engine Φ that is better tailored to the perceptual idiosyncrasy of user k in terms of timbre similarity.

Within the default setting (\(\mathcal {G}=\mathcal {G}_{0}\)), our system Φ achieves an average precision at rank five (AP@5) of 99.0%, with a standard deviation across K=31 participants of the order of 1%. This favorable result suggests that joint time–frequency scattering provides a useful feature map for learning similarities between instrumental playing techniques. In doing so, it is in line with a recent publication [74], in which the authors successfully trained a supervised classifier on joint time–frequency scattering features in order to detect and classify playing techniques from the Chinese bamboo flute (dizi). However, the originality of our work is that Φ relies purely on auditory information (i.e., timbre similarity judgments), and does not require any supervision from the symbolic domain. In particular, it does not assume the metadata (instrument, mute, technique, pitch, dynamics, and so forth) of any musical sample xn to be observable, in part or in full, at training time.

5.2 Visualization of joint–time frequency scattering coefficients

In the second order, joint time–frequency scattering coefficients depend upon four variables: time t, log-frequency λ, temporal modulation rate α in Hertz, and frequential modulation rate β in cycles per octave. From a data visualization standpoint, rendering the four-dimensional tensor S2x(t,λ,α,β) is impossible. To address this limitation, a recent publication has projected this tensor into a two-dimensional “slice,” thus yielding an image raster [13]. In accordance with their protocol, we compute the following matrix:

$$ \mathbf{V}\boldsymbol{x}(\alpha, \beta) = \iint_{\mathbb{R}^{2}} \mathbf{U_{2}}\boldsymbol{x}\left(t - t^{\prime}, \log \lambda - u^{\prime}\right) \;\mathrm{d}t^{\prime} \, \mathrm{d}u^{\prime}, $$

Observe that the equation above is a limit case of S2 (Eq. 6) in which the constants T and F tend towards infinity. As a result, Vx depends solely upon scale α and rate β. In the scientific literature on STRF, the matrix Vx is known as the “cortical output collapsed on the rate-scale axes” [75].

Previous publications on STRFs have demonstrated the interest of visualizing the slice Vx in the case of speech [76], lung sounds [77], and music [18]. However, the visualization of musical sounds has been restricted to some of most common playing techniques, i.e., piano played staccato and violin played pizzicato. Furthermore, prior publications on time–frequency scattering have displayed slices of S2x in the case of synthetic signals; but there is a gap in literature as regards the interpretability of the scale–rate domain in the case of real-world signals.

To remedy this gap, we select twelve isolated notes from the SOL dataset from two instruments: violin (Fig. 3) and flute (Fig. 4). By and large, we find that joint time–frequency scattering produces comparable patterns in the scale–rate domain as the “cortical output” of the STRF. For example, Fig. 3a shows that a violin note played in the ordinario technique has a local energy maximum at the rate α=6Hz. A visual inspection of U1 demonstrates that this rate coincides with the rate of vibrato of the left hand in the violin note. As seen in Fig. 3b, this local energy maximum is absent when the playing technique is denoted as nonvibrato. Furthermore, Fig. 3c shows that the local energy maximum is displaced to a higher rate (α=12Hz) when the vibrato is replaced by a tremolo.

Fig. 3
figure 3

Six playing techniques of the violin. Subfigures: a ordinario (with vibrato), b nonvibrato, c tremolo, d sforzando, e pizzicato (laissez vibrer, i.e., let ring), and f staccato. In each subfigure, the top image shows the wavelet scalogram as a function of time t (in seconds) and frequency λ (in Hertz). Conversely, the bottom image shows the average time–frequency scattering coefficients, as a function of temporal modulation rate α (in Hertz) and frequential modulation scale β (in cycles per octave). Darker shades denote greater values of acoustic energy. See Section 4 for details

Fig. 4
figure 4

Six playing techniques of the flute. Subfigures: a ordinario, b trill (C4D4), c interference (play C4 while singing C4), d sforzando, e key click, and f staccato. In each subfigure, the top image shows the wavelet scalogram as a function of time t (in seconds) and frequency λ (in Hertz). Conversely, the bottom image shows the average time–frequency scattering coefficients, as a function of temporal modulation rate α (in Hertz) and frequential modulation scale β (in cycles per octave). Darker shades denote greater values of acoustic energy. See Section 4 for details

The visual interpretation of playing techniques in terms of their joint time–frequency scattering coefficients is not restricted to periodic modulation, such as vibrato or tremolo: rather, it also encompasses the analysis of attack transients. Figure 3d, e, and f show the matrix Vx for three instances of impulsive violin sounds: sforzando, pizzicato, and staccato respectively. These three techniques create ridges in the scale–rate domain (α,β), where the cutoff rate α is lowest with sforzando and highest with staccato. These variations in cutoff rate coincide with perceptual variations in “hardness”, i.e. impulsivity, of the violin sound. Moreover, in the case of staccato, we observe a slight asymmetry in the frequential scale parameter β. This asymmetry could be due to the fact that higher-order harmonics decay faster than the fundamental, thus yielding a triangular shape in the time–frequency domain.

Figure 4 shows six playing techniques of the flute. Similarly to the violin (Fig. 3), we observe that periodic modulations, such as a trill (Fig. 4b) or a beating tone (Fig. 4c), cause local energy maxima whose rate α is physically interpretable. Likewise, impulsive flute sounds such as sforzando (Fig. 4d), key click (Fig. 4e), and vibrato (Fig. 4f) create ridges in the scale–rate domain of varying cutoff rates α. We distribute the implementation of these figures as part of the MATLAB library scattering.m, which is released under the MIT license.Footnote 5

5.3 Ablation study

We now turn to alter certain key choices in the design of the above-described computational blocks, and discuss their respective impacts on downstream performance.

Figure 5 summarizes our results. Interestingly, the system Φ is not only best on average, but also best for every subject in the cohort. Specifically, replacing Φ by a simpler model Φ (see subsections below for examples of such models) results in \(\mathrm {P}_{\mathbf {\Phi ^{\prime }}}(k, 5) < \mathrm {P}_{\mathbf {\Phi }}(k, 5)\) for every k. Borrowing from the terminology of welfare economics, Φ can be said to be uniquely Pareto-efficient [78]. This observation suggests that the increase in performance afforded by the state-of-the-art model with respect to the baseline does not come at the detriment of user fairness.

Fig. 5
figure 5

Impact of different processing architecture or protocol designs. For each condition, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The performance achieved for each reference clustering is depicted by a lozenge whose color is chosen arbitrarily but consistently across conditions. See Section 5 for details

5.4 Role of dataset size

First of all, it is worth noting that the presented AP@5 figure of 99.0%±1 does not abide to a conventional training set vs. test set paradigm, as is most often done in machine learning research. Rather, the LMNN algorithm is trained on all available samples (N=9346 isolated notes) with the “consensus” cluster graph as training objective (\(\mathcal {G}=\mathcal {G}_{0}\)). Then, it is evaluated on the same samples with individual cluster graphs as ground truth (\(\mathcal {G}=\mathcal {G}_{k}\) for k≥1). The reason behind this choice is that, in the context of computer-assisted orchestration, the collection of audio samples has a fixed size, as it is shipped alongside the software itself. This is the case, for example, of Orchidea,Footnote 6 which comes paired with its own version of SOL, named OrchideaSOL [79]. Henceforth, our main goal was to evaluate the generalization ability of our metric learning algorithm beyond the restricted set of samples for which human annotations are directly available (see Section 3), that is, beyond one pitch class (middle C) and one intensity level (mf).

Despite this caveat, we may adopt a “query-by-example” (QbE) paradigm by partitioning the database and audio samples in half \(\left (\frac {1}{2}N^{\prime }=4673\right)\), training the LMNN algorithm on the first half, and querying it with samples from the other half. In this evaluation framework, our system reaches an AP@5 of 96.2%±2. Interestingly, querying the system with samples from the training set leads to an AP@5 of 96.5%±2, i.e., roughly on par with the test set. Thus, it appears that the gap in performance between the evaluation presented in the Section 5 (99.0%±1) and query-by-example evaluation (96.2%±2) is primarily attributable to a reduction of the size of training set by half, whereas statistical overfitting of the training set with respect to the test set likely plays a minor role.

These findings are in line with a previous publication [18] which modeled perceived dissimilarity between musical sounds by means of STRF features. Likewise, a recent publication has observed similar performance in instrument identification of a human and machine classifier using separable Gabor filterbank (GBFB) features [80]. We note that both STRF and GBFB bear a strong computational resemblance with time–frequency scattering.

5.5 Role of metric learning

Replacing the LMNN metric learning algorithm by linear discriminant analysis (LDA) leads to an AP@5 of 76.6%±11. Moreover, we evaluate the nearest neighbor algorithm Φ in the absence of any metric learning at all. This corresponds to using a Euclidean distance to compare scattering coefficients, i.e., to set L to the identity matrix. Note that the runtime complexity of Euclidean nearest-neighbor search is identical to LMNN search. We report an average precision at rank five (AP@5) of 92.9%±3, which is noticeably worse than the best performing system. This gap in performance demonstrates the importance of complementing unsupervised feature extraction by supervised metric learning in the design of computational models for timbre similarity between instrumental playing techniques.

5.6 Role of temporal context T

Our best performing system operates with joint time–frequency scattering coefficients as spectrotemporal modulation features. These features are extracted within a temporal context of duration equal to T=1000ms. This value is considerably larger than the frame size of purely spectral features, such as spectral centroid, spectral flux, or mel-frequency cepstral coefficients (MFCCs). Indeed, the frame size of spectral features for machine listening is typically set to T=23ms, i.e., 210=1024 samples at a sampling rate of 44,1kHz [22, 23].

As a point of comparison, we set the maximum time scale of joint time–frequency scattering coefficients to T=25ms, hence a 40-fold reduction in context size. Over our cohort of K=31 participants, we report an AP@5 of 90.9%±4, which is noticeably worse than the best performing system. This gap in performance extends the findings of a previous publication [4], which reported that metric learning with temporal scattering coefficients tends to improve with growing values of T, until reaching a plateau of performance around T500ms.

5.7 Role of joint time–frequency scattering

Let us recall the full definition of second-order joint time–frequency scattering (see Section 4):

$$ {}\mathbf{S_{2}}\boldsymbol{x}(t,\lambda,\alpha,\beta) = \left(\left\vert \left\vert \boldsymbol{x} \ast \boldsymbol{\psi}_{\lambda} \right\vert \circledast \boldsymbol{\Psi}_{\alpha,\beta} \right\vert \circledast \left(\boldsymbol{\phi}_{T} \otimes \boldsymbol{\phi}_{F}\right) \right)(t,\lambda), $$

where the ψλ denotes a Morlet wavelet of center frequency λ and resp. ϕT denotes a Gaussian low-pass filter of width T. Besides the joint time–frequency scattering transform, the generative grammar of scattering transforms [81] also encompasses the separable time–frequency scattering transform:

$$ \begin{aligned} \mathbf{S_{2}^{\text{sep}}}\boldsymbol{x}(t,\lambda,\alpha,\beta) = \left(\bigg\vert \Big\vert \big\vert \boldsymbol{x} \ast \boldsymbol{\psi}_{\lambda} \big\vert \ast \boldsymbol{\psi}_{\alpha} \Big\vert \circledast \left(\boldsymbol{\phi}_{T} \otimes \boldsymbol{\psi}_{\beta}\right) \bigg\vert \circledast \left(\boldsymbol{\phi}_{T} \otimes \boldsymbol{\phi}_{F}\right) \right)(t, \lambda), \end{aligned} $$

where the wavelet ψλ has a quality factor of Q=12 whereas the wavelets ψα and ψβ have a quality factor of one. Previous publications have successfully applied separable time–frequency scattering in order to classify environmental sounds [82] as well as playing techniques from the Chinese bamboo flute [83].

In comparison with its joint counterpart, separable time–frequency scattering contains about half as many coefficients. This is because the temporal wavelet transform with ψα and the frequential wavelet transform with ψβ are separated by an operation of complex modulus. Hence, ψβ operates on a real-valued input. Because separable time–frequency scattering cannot distinguish ascending chirps from descending chirps, flipping the sign of the scale variable β is no longer necessary. Moreover, separable time–frequency scattering has a lower algorithmic complexity than joint time–frequency scattering. Indeed, in Eq. 20, the frequential wavelet transform with ψβ operates on a tensor whose time axis is subsampled at a fixed rate T−1, thus allowing vectorized computations. Conversely, in Eq. 19, the frequential wavelet transform must operate on a multiresolution input, whose sample rate varies depending on α: it ranges between T−1 and the sample rate of x itself.

Yet, despite its greater simplicity, separable time–frequency scattering suffers from known weaknesses in its ability to represent spectrotemporal modulations. In particular, a previous publication has shown that frequency-dependent time shifts affect joint time–frequency scattering coefficients while leaving separable time–frequency scattering coefficients almost unchanged [13]. The same observation was made by [84] in the case of joint and separable Gabor filterbank features (GBFB), which bear some resemblance with joint and separable time–frequency scattering coefficients respectively.

Over our cohort of K=31 participants, separable time–frequency scattering achieves n AP@5 of 91.9%±4. This figure is noticeably worse than joint time–frequency scattering (99.0%±1), all other things being equal. This gap in performance, together with the theory of STRFs in auditory neuroscience (see Section 2), demonstrates the importance of joint spectrotemporal modulations in the modeling of timbre similarity across instrumental playing techniques.

5.8 Comparison with mel–frequency cepstral coefficient (MFCC) baseline

Lastly, we train a baseline system in which joint time–frequency scattering coefficients are replaced by MFCCs. Specifically, we extract a 40-band mel-frequency spectrogram by means of the RASTAMAT library, apply the pointwise logarithm, and compute a discrete cosine transform (DCT) over the mel-frequency axis.Footnote 7 This operation results in a 40-dimensional feature vector over frames of duration T=25ms. Over our cohort of K=31 participants, we report an AP@5 of 81.8%±7.

Arguably, this figure is not directly comparable with our best performing system (AP@5 of 99.0%±1), due to the mismatch in dimensionality between MFCC and joint time–frequency scattering coefficients. In order to clarify the role of feature dimensionality in our computational pipeline, we apply a feature engineering technique involving multiplicative combinations of MFCC. We construct the following Gram matrix:

$$ {}\mathbf{G} \boldsymbol{x}(\alpha, \beta) = \int_{0}^{+\infty} \text{MFCC}(\boldsymbol{x})(t, \alpha) \text{MFCC}(\boldsymbol{x})(t, \beta) \;\mathrm{d}t, $$

where α and β represent different dimensions (“quefrencies”) of the MFCC feature vector. The symmetric matrix Gx contains 40 rows and 40 columns, hence 800 unique coefficients. Concatenating these coefficients to the 40 averaged MFCC features results in a feature vector of 840 coefficients. This dimension is of the same order of magnitude as the dimension of our joint time–frequency scattering representation (d=1180, see Section 4).

Training the LMNN algorithm on this 840-dimensional representation is analogous to a “kernel trick” in support vector machines [85]. In our case, the implicit similarity kernel is a homogeneous quadratic kernel. Despite this increase in representational power, we obtain an AP@5 of 81.5%±7, i.e., essentially the same as MFCC under a linear kernel. Therefore, it appears that the gap in performance between MFCC (81.8%±7) and joint time–frequency scattering (99.0%±1) is primarily attributable to the multiscale extraction of joint spectrotemporal modulations, whereas high-dimensional embedding likely plays a minor role.

6 Conclusion

We see from the ablation study conducted above that each step of the proposed model is necessary for its high performance. Among other things, it indicates the joint time–frequency scattering transform itself should not be used directly for similarity comparisons, but is best augmented by a learned stage. This explains the relative success of such models [4, 14] over others where distances on raw scattering coefficients were used for judging audio similarity.

On the other hand, we note that the complexity of the learned model does not need to be high. Indeed, for this task, a linear model on the scattering coefficients is sufficient. There is no need for deep networks with large numbers of parameters to accurately represent the similarity information. In other words, the scattering transform parametrizes the signal structure in a way that many relevant quantities (such as timbre similarity) can be extracted through a simple linear mapping. This is in line with other classification results, where linear support vector machines applied to scattering coefficients have achieved significant success [12, 13].

We also see the necessity of a fully joint time–frequency model for accessing timbre, as opposed to a purely spectral model or one that treats the two axes in a separable manner. This fact has also been observed in other contexts, such as the work of Patil et al. [18]. A related observation is the need for capturing large-scale structure. Indeed, reducing the window size to 25 ms means that we lose a great deal of time–frequency structure, bringing results closer to that of the separable model.

The success of the above system in identifying timbral similarities has immediate applications in browsing music databases. These are typically organized based on instrumental and playing techniques taxonomies, with additional keywords offering a more flexible organization. Accessing the sounds in these databases therefore requires some knowledge of the taxonomy and keywords used. Furthermore, the user needs to have some idea of the particular playing technique they are searching for.

As an alternative, content-based searches allow the user to identify sounds based on similarity to some given query sound. This query-by-example approach provides an opportunity to search for sounds without having a specific instrument of playing technique in mind, yielding a wider range of available sounds. A composer with access to such a system would therefore be able to draw on a more diverse palette of musical timbre.

The computational model proposed in this work is well suited to such a query-by-example task. We have shown that it is able to adequately approximate the timbral judgments of a wide range of participants included in our study. Not only that, but the system can be easily retrained to approximate an individual user’s timbre perception by having that user perform the clustering task on the reduced set of 78 IPTs and running the LMNN training step on those clustering assignments. This model can then be applied to new data, or, alternatively, be retrained with these new examples if the existing model proves unsatisfactory.

We shall note, however, that the current model has several drawbacks. First, it is only applied to instrumental sounds. While this has the advantage of simplifying the interpretation of the results, the range of timbre under consideration is necessarily limited (although less restricted than only considering ordinario PTs). This also makes applications such as query-by-humming difficult, since we cannot guarantee that the timbral similarity measure is accurate for such sounds.

That being said, the above model is general enough to encompass a wide variety of recordings, not just instrumental sounds. Indeed, we have strong assumptions that the tools used (scattering transforms and LMNN weighting matrices) do not depend strongly on the type of sound being processed. Future work will investigate whether more general classes of sounds are also well modeled. To extend the model, it is only necessary to retrain the LMNN weighting matrix by supplying it with new cluster assignments. These can again be obtained by performing a new clustering experiment with one or more human subjects.

Another aspect is the granularity of the similarity judgments. In the above method, we have used hard clustering assignments to build our model. A more nuanced similarity judgment would ask users to rate the similarity of a pair of IPTs on a more graduated scale, which would yield a finer, or soft, assignment. This however, comes with additional difficulties in providing a consistent scale across participants, but could be feasible if the goal is to only adapt the timbral space to a single individual. An approach not based on clustering would also have to replace the LMNN algorithm with one that accounts for such soft assignments.

Availability of data and materials

The processing code, the Cyberlioz interface and an anonymized version of the perceptual judgments gathered using this interface are available at the companion website:

The SOL dataset is available online:


  1. For the sake of research reproducibility, the source code for the experimental protocol of this paper is available online, alongside anonymized data from human subjects:

  2. CNSMDP: Conservatoire National Supérieur de Musique et de Danse de Paris. Official website:

  3. The Web application for efficient audio annotation, as well as the raw anonymized responses of all 31 participants to our study, is available at:

  4. For more information about these mailing lists, please visit:

  5. Link to download the scattering.m library:

  6. Link to Orchidea software and OrchideaSOL dataset:

  7. Link to RASTAMAT library:



Average precision at rank 5


Conservatoire national supérieur de musique et de danse de Paris


Community for open and sustainable music information research


Discrete cosine transform


Free music archive


Gabor filterbank features




Large-margin nearest neighbor


Mel-frequency cepstral coefficients


Music information retrieval


Precision at rank five


Rectified linear unit


Spectrotemporal receptive fields


Studio on line dataset Musical instruments










Concert harp


Spanish guitar






Soprano clarinet


Alto saxophone






Trumpet in C


French horn


Tenor trombone


Bass tuba


Bass tuba Musical nuances

pp :


p :


mf :

Mezzo forte

f :


ff :



  1. J. S. Downie, Music information retrieval. Ann. Rev. Inf. Sci. Technol.37(1), 295–340 (2003).

    Article  Google Scholar 

  2. K. Siedenburg, C. Saitis, S. McAdams, in Timbre: Acoustics, Perception, and Cognition, ed. by K. Siedenburg, C. Saitis, S. McAdams, A. N. Popper, and R. R. Fay. The Present, Past, and Future of Timbre Research (Springer International PublishingCham, 2019), pp. 1–19.

    Chapter  Google Scholar 

  3. A. Faure, S. McAdams, V. Nosulenko, in Proceedings of the International Conference on Music Perception and Cognition (ICMPC). Verbal correlates of perceptual dimensions of timbre, (1996), pp. 79–84.

  4. V. Lostanlen, J. Andén, M. Lagrange, in Proceedings of the International Conference on Digital Libraries for Musicology (DLfM). Extended playing techniques: the next milestone in musical instrument recognition (ACM, 2018), pp. 1–10.

  5. A. Antoine, E. R. Miranda, in Proceedings of the International Symposium on Musical Acoustics (ISMA). Musical Acoustics, Timbre, and Computer-Aided Orchestration Challenges, (2018), pp. 151–154.

  6. S. Kolozali, M. Barthet, G. Fazekas, M. B. Sandler, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. Knowledge Representation Issues in Musical Instrument Ontology Design, (2011), pp. 465–470.

  7. J. Calvo-Zaragoza, J. Hajič Jr., A. Pacha, Understanding optical music recognition. ACM Comput. Surv., 1–42 (2020).

  8. R. Erickson, Sound structure in music (University of California Press, Oakland, 1975).

    Google Scholar 

  9. E. Thoret, B. Caramiaux, P. Depalle, S. McAdams, Human dissimilarity ratings of musical instrument timbre: a computational meta-analysis. J. Acoust. Soc. Am.143(3), 1745–1746 (2018).

    Article  Google Scholar 

  10. Y. Maresz, On computer-assisted orchestration. Contemp. Music. Rev.32(1), 99–109 (2013).

    Article  Google Scholar 

  11. M. Caetano, A. Zacharakis, I. Barbancho, L. J. Tardón, Leveraging diversity in computer-aided musical orchestration with an artificial immune system for multi-modal optimization. Swarm Evol. Comput.50:, 100484 (2019).

    Article  Google Scholar 

  12. J. Andén, V. Lostanlen, S. Mallat, in Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP). Joint time-frequency scattering for audio classification (IEEE, 2015), pp. 1–6.

  13. J. Andén, V. Lostanlen, S. Mallat, Joint Time–Frequency Scattering. IEEE Trans. Signal Process.67(14), 3704–3718 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  14. V. Lostanlen, G. Lafay, J. Andén, M. Lagrange, Relevance-based quantization of scattering features for unsupervised mining of environmental audio. EURASIP J. Audio. Speech. Music. Process.2018(1), 15 (2018).

    Article  Google Scholar 

  15. J. Andén, S. Mallat, in Proceedings of the International Conference on Digital Audio Effects (DAFx). Scattering Representation of Modulated Sounds, (2012), pp. 1–4.

  16. K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res.10:, 207–244 (2009).

    MATH  Google Scholar 

  17. S. McAdams, S. Winsberg, S. Donnadieu, G. De Soete, J. Krimphoff, Perceptual scaling of synthesized musical timbres: common dimensions, specificities, and latent subject classes. Psychol. Res.58(3), 177–192 (1995).

    Article  Google Scholar 

  18. K. Patil, D. Pressnitzer, S. Shamma, M. Elhilali, Music in our ears: the biological bases of musical timbre perception. PLoS Comput. Biol.8(11), e1002759 (2012).

    Article  Google Scholar 

  19. C. Joder, S. Essid, G. Richard, Temporal integration for audio classification with application to musical instrument classification. IEEE Trans. Audio. Speech. Lang. Process.17(1), 174–186 (2009).

    Article  Google Scholar 

  20. K. Siedenburg, I. Fujinaga, S. McAdams, A comparison of approaches to timbre descriptors in music information retrieval and music psychology. J. New. Music. Res.45(1), 27–41 (2016).

    Article  Google Scholar 

  21. K. D. Martin, Y. E. Kim, in Proceedings of the Acoustical Society of America. Musical instrument identification: A pattern recognition approach, (1998), pp. 1–12.

  22. J. C. Brown, Computer identification of musical instruments using pattern recognition with cepstral coefficients as features. J. Acoust. Soc. Am.105(3), 1933–1941 (1999).

    Article  Google Scholar 

  23. A. Eronen, A. Klapuri, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Musical instrument recognition using cepstral coefficients and temporal features, (2000).

  24. P. Herrera Boyer, G. Peeters, S. Dubnov, Automatic classification of musical instrument sounds. J. New. Music. Res.32(1), 3–21 (2003).

    Article  Google Scholar 

  25. A. A. Wieczorkowska, J. M. żytkow, Analysis of feature dependencies in sound description. J. Intell. Inf. Syst.20(3), 285–302 (2003).

    Article  Google Scholar 

  26. A. Livshin, X. Rodet, in Proceedings of the International Conference on Digital Audio Effects (DAFx). Musical instrument identification in continuous recordings, (2004).

  27. A. G. Krishna, T. V. Sreenivas, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Music instrument recognition: from isolated notes to solo phrases, (2004).

  28. I. Kaminskyj, T. Czaszejko, Automatic recognition of isolated monophonic musical instrument sounds using kNNC. J. Intell. Inf. Syst.24(2-3), 199–221 (2005).

    Article  Google Scholar 

  29. E. Benetos, M. Kotti, C. Kotropoulos, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Musical instrument classification using non-negative matrix factorization algorithms and subset feature selection, (2006).

  30. D. G. Bhalke, C. B. R. Rao, D. S. Bormane, Automatic musical instrument classification using fractional Fourier transform based-MFCC features and counter propagation neural network. J. Intell. Inf. Syst.46(3), 425–446 (2016).

    Article  Google Scholar 

  31. E. Humphrey, S. Durand, B. McFee, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. OpenMIC-2018: an open dataset for multiple instrument recognition, (2018).

  32. B. McFee, E. J. Humphrey, J. Urbano, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. A plan for sustainable MIR evaluation, (2016).

  33. M. Defferrard, K. Benzi, P. Vandergheynst, X. Bresson, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. FMA: A dataset for music analysis, (2017).

  34. V. Lostanlen, C. E. Cella, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. Deep convolutional networks on the pitch spiral for musical instrument recognition, (2016).

  35. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, J. P. Bello, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. MedleyDB: A multitrack dataset for annotation-intensive MIR research, (2014).

  36. B. McFee, E. J. Humphrey, J. P. Bello, in Proceedings of the International Society on Music Information Retrieval (ISMIR). A software framework for musical data augmentation, (2015).

  37. J. Pons, O. Slizovskaia, R. Gong, E. Gómez, X. Serra, in 25th European Signal Processing Conference (EUSIPCO). Timbre analysis of music audio signals with convolutional neural networks, (2017), pp. 2744–2748.

  38. S. Gururani, C. Summers, A. Lerch, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. Instrument Activity Detection in Polyphonic Music using Deep Neural Networks, (2018).

  39. M. A. Loureiro, H. B. de Paula, H. C. Yehia, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. Timbre Classification Of A Single Musical Instrument, (2004).

  40. Y. Han, J. Kim, K. Lee, Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE Trans. Audio. Speech. Lang. Process.25(1), 208–221 (2017).

    Article  Google Scholar 

  41. S. McAdams, B. L. Giordano, in The Oxford handbook of music psychology. The perception of musical timbre, (2009), pp. 72–80.

  42. K. Siedenburg, K. Jones-Mollerup, S. McAdams, Acoustic and categorical dissimilarity of musical timbre: evidence from asymmetries between acoustic and chimeric sounds. Front. Psychol.6:, 1977 (2016).

    Article  Google Scholar 

  43. D. A. Depireux, J. Z. Simon, D. J. Klein, S. A. Shamma, Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. J. Neurophysiol.85(3), 1220–1234 (2001).

    Article  Google Scholar 

  44. A. M. H. J. Aertsen, P. I. M. Johannesma, The spectro-temporal receptive field. Biol. Cybernet.42(2), 133–143 (1981).

    Article  MATH  Google Scholar 

  45. E. De Boer, P. Kuyper, Triggered correlation. IEEE Trans. Biomed. Eng.3:, 169–179 (1968).

    Article  Google Scholar 

  46. P. Flandrin, Time-frequency/time-scale analysis (Academic press, Salt Lake City, 1998).

    MATH  Google Scholar 

  47. J. Eggermont, Wiener and Volterra analyses applied to the auditory system. Hear. Res.66(2), 177–201 (1993).

    Article  Google Scholar 

  48. D. J. Klein, D. A. Depireux, J. Z. Simon, S. A. Shamma, Robust spectrotemporal reverse correlation for the auditory system: optimizing stimulus design. J. Comput. Neurosci.9(1), 85–111 (2000).

    Article  MATH  Google Scholar 

  49. F. E. Theunissen, K. Sen, A. J. Doupe, Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. J. Neurosci.20(6), 2315–2331 (2000).

    Article  Google Scholar 

  50. T. Chi, P. Ru, S. A. Shamma, Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am.118(2), 887–906 (2005).

    Article  Google Scholar 

  51. K. Patil, M. Elhilali, Biomimetic spectro-temporal features for music instrument recognition in isolated notes and solo phrases. EURASIP J. Audio. Speech. Music. Process.2015(1), 27 (2015).

    Article  Google Scholar 

  52. E. Thoret, P. Depalle, S. McAdams, Perceptually salient spectrotemporal modulations for recognition of sustained musical instruments. J. Acoust. Soc. Am.140(6), EL478–EL483 (2016).

    Article  Google Scholar 

  53. S. Mishra, B. L. Sturm, S. Dixon, in Proceedings of the International Society on Music Information Retrieval (ISMIR) Conference. Understanding a Deep Machine Listening Model Through Feature Inversion, (2018), pp. 755–762.

  54. E. Hemery, J. J. Aucouturier, One hundred ways to process time, frequency, rate and scale in the central auditory system: a pattern-recognition meta-analysis. Front. Comput. Neurosci.9:, 80 (2015).

    Article  Google Scholar 

  55. M. Andreux, T. Angles, G. Exarchakis, R. Leonarduzzi, G. Rochette, L. Thiry, J. Zarka, S. Mallat, E. Belilovsky, J. Bruna, et al, Kymatio: Scattering Transforms in Python. J. Mach. Learn. Res.21(60), 1–6 (2020).

    MATH  Google Scholar 

  56. V. Lostanlen, F. Hecker, in Proceedings of the Digital Audio Effects Conference (DAFX). The Shape of RemiXXXes to Come: Audio texture synthesis with time–frequency scattering, (2019).

  57. S. Mallat, Understanding deep convolutional networks. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.374(2065), 20150203 (2016).

    Article  Google Scholar 

  58. M. Caetano, C. Saitis, K. Siedenburg, in Timbre: Acoustics, perception, and cognition. Audio content descriptors of timbre (SpringerNew York, 2019), pp. 297–333.

    Chapter  Google Scholar 

  59. C. -W. Wu, C. Dittmar, C. Southall, R. Vogl, G. Widmer, J. Hockman, M. Muller, A. Lerch, A review of automatic drum transcription. IEEE Trans. Audio. Speech. Lang. Process.26(9), 1457–1483 (2018).

    Article  Google Scholar 

  60. A. Pearce, T. Brookes, R. Mason, Modelling Timbral Hardness. Appl. Sci.9(3), 466 (2019).

    Article  Google Scholar 

  61. B. L. Giordano, C. Guastavino, E. Murphy, M. Ogg, B. K. Smith, S. McAdams, Comparison of methods for collecting and modeling dissimilarity data: applications to complex sound stimuli. Multivar. Behav. Res.46(5), 779–811 (2011).

    Article  Google Scholar 

  62. T. M. Elliott, L. S. Hamilton, F. E. Theunissen, Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones. J. Acoust. Soc. Am.133(1), 389–404 (2013).

    Article  Google Scholar 

  63. B. W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J.49(2), 291–307 (1970).

    Article  MATH  Google Scholar 

  64. E. -H. Han, G. Karypis, V. Kumar, Eui-HongandKarypis Han George and Kumar, Scalable parallel data mining for association rules. 26(2) (1997). ACM.

  65. A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res.3(Dec), 583–617 (2002).

    MathSciNet  MATH  Google Scholar 

  66. C. Schörkhuber, A. Klapuri, in Proceedings of the Sound and Music Computing (SMC) Conference. Constant-Q transform toolbox for music processing, (2010).

  67. V. Lostanlen, S. Sridhar, A. Farnsworth, J. P. Bello, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Learning the helix topology of musical pitch, (2020).

  68. S. Mallat, Group invariant scattering. Commun. Pure Appl. Math.65(10), 1331–1398 (2012).

    Article  MathSciNet  MATH  Google Scholar 

  69. A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu, R. C. Moore, R. A. Saurous, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Unsupervised learning of semantic audio representations (IEEE, 2018), pp. 126–130.

  70. A. Bellet, A. Habrard, M. Sebban, Metric learning (Morgan & Claypool Publishers, San Rafael, 2015).

    Book  MATH  Google Scholar 

  71. S. Handel, M. L. Erickson, A rule of thumb: The bandwidth for timbre invariance is one octave. Music. Percept.19(1), 121–126 (2001).

    Article  Google Scholar 

  72. J. Marozeau, A. de Cheveigné, S. McAdams, S. Winsberg, The dependency of timbre on fundamental frequency. J. Acoust. Soc. Am.114(5), 2946–2957 (2003).

    Article  Google Scholar 

  73. K. M. Steele, A. K. Williams, Is the bandwidth for timbre invariance only one octave?Music. Percept.23(3), 215–220 (2006).

    Article  Google Scholar 

  74. C. Wang, V. Lostanlen, E. Benetos, E. Chew, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Playing technique recognition by joint time–frequency scattering, (2020).

  75. M. Elhilali, T. Chi, S. A. Shamma, A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Commun.41(2-3), 331–348 (2003).

    Article  Google Scholar 

  76. A. Bellur, M. Elhilali, in Proceedings of the Annual Conference on Information Sciences and Systems (CISS). Detection of speech tokens in noise using adaptive spectrotemporal receptive fields (IEEE, 2015), pp. 1–6.

  77. D. Emmanouilidou, K. Patil, J. West, M. Elhilali, in Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS). A multiresolution analysis for detection of abnormal lung sounds (IEEE, 2012), pp. 3139–3142.

  78. J. Black, N. Hashimzade, G. Myles, A dictionary of economics (Oxford university press, Oxford, 2012).

    Book  Google Scholar 

  79. C. -E. Cella, D. Ghisi, V. Lostanlen, F. Lévy, J. Fineberg, Y. Maresz, in Proceedings of the International Computer Music Conference (ICMC). OrchideaSOL: A Dataset of Extended Instrumental Techniques for Computer-aided Orchestration, (2020).

  80. K. Siedenburg, M. R. Schädler, D. Hülsmeier, Modeling the onset advantage in musical instrument recognition. J. Acoust. Soc. Am.146(6), EL523–EL529 (2019).

    Article  Google Scholar 

  81. V. Lostanlen, in Florian Hecker: Halluzination, Perspektive, Synthese, ed. by N. Schafhausen, V. J. Müller. On Time-frequency Scattering and Computer Music (Sternberg PressBerlin, 2019).

    Google Scholar 

  82. C. Baugé, M. Lagrange, J. Andén, S. Mallat, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Representing environmental sounds using the separable scattering transform (IEEE, 2013), pp. 8667–8671.

  83. C. Wang, E. Benetos, V. Lostanlen, E. Chew, in Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference. Adaptive time–frequency scattering for periodic modulation recognition in music signals, (2019).

  84. M. R. Schädler, B. Kollmeier, Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J. Acoust. Soc. Am.137(4), 2047–2059 (2015).

    Article  Google Scholar 

  85. Y. -W. Chang, C. -J. Hsieh, K. -W. Chang, M. Ringgaard, C. -J. Lin, Training and testing low-degree polynomial data mappings via linear SVM. J. Mach. Learn. Res.11(Apr), 1471–1490 (2010).

    MathSciNet  MATH  Google Scholar 

Download references


We wish to thank Philippe Brandeis, Étienne Graindorge, Stéphane Mallat, Adrien Mamou-Mani, and Yan Maresz for contributing to the TICEL research project. We also wish to thank the students of the Paris Conservatory and all anonymous participants to our study.


This work is partially supported by the Paris sciences et lettres (PSL) TICEL project. This work is partially supported by the European Research Council (ERC) award 320959 (InvariantClass). This work is partially supported by the National Science Foundation (NSF) award 1633259 (BIRDVOX). This work is partially supported by the Flatiron Institute, a division of the Simons Foundation.

Author information

Authors and Affiliations



VL provided guidance in the design of the computational experiments and was a major contributor in writing the manuscript. CEH conducted part of the computational experiments. MR designed and implemented the listening tests. GL designed the listening tests and provided guidance in the design of the computational experiments. JA provided guidance in the design of the computational experiments, participated to the data analysis and to the writing of the manuscript. ML designed and conducted the computational experiments, the data analysis and participated to the writing of the paper. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Mathieu Lagrange.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lostanlen, V., El-Hajj, C., Rossignol, M. et al. Time–frequency scattering accurately models auditory similarities between instrumental playing techniques. J AUDIO SPEECH MUSIC PROC. 2021, 3 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: