Time–frequency scattering accurately models auditory similarities between instrumental playing techniques

Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called “ordinary” technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time–frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99.0%±1. An ablation study demonstrates that removing either the joint time–frequency scattering transform or the metric learning algorithm noticeably degrades performance.


Introduction
Music information retrieval (MIR) operates at two levels: symbolic and auditory [1]. By relying on a notation system, the symbolic level allows the comparison of musical notes in terms of quantitative attributes, such as duration, pitch, and intensity at the source. Timbre, in contrast, is a qualitative attribute of music and is thus not reducible to a one-dimensional axis [2]. As a result, symbolic representations describe timbre indirectly, either via visuotactile metaphors (e.g., bright, rough, and so forth [3]) or via an instrumental playing technique (e.g., bowed or plucked) [4].
*Correspondence: mathieu.lagrange@ls2n.fr 1 LS2N, CNRS, Centrale Nantes, Nantes University, 1, rue de la Noe, 44000 Nantes, France Full list of author information is available at the end of the article Despite their widespread use, purely linguistic references to timbre fail to convey the intention of the composer. On the one hand, adjectives such as bright or rough are prone to misunderstanding, as they do not prescribe any musical gesture that is capable of achieving them [5]. On the other hand, the sole mention of a playing technique does not specify its effect in terms of auditory perception. For instance, although the term breathy alludes to a playing technique that is specific to wind instruments, a cellist may accomplish a seemingly breathy timbre by bowing near the fingerboard, i.e., sul tasto in the classical terminology. Yet, in a diverse instrumentarium, the semantic similarity between playing technique denominations does not reflect such acoustical similarity [6].
Although a notation-based study of playing techniques in music has research potential in music information retrieval [7], the prospect of modeling timbre perception necessarily exceeds the symbolic domain. Instead, it involves a cognitive process which arises from the subjective experience of listening [8]. The simulation of this cognitive process amounts to the design of a multidimensional feature space wherein some distance function evaluates pairs of stimuli. Rather than merely discriminating instruments as mutually exclusive categories, this function must reflect judgments of acoustic dissimilarity, all other parameters-duration, pitch, and intensity-being equal [9].

Use case
Behind the overarching challenge of coming up with a robust predictive model for listening behaviors in humans, the main practical application of timbre similarity retrieval lies in the emerging topic of computerassisted orchestration [10]. In such context, the composer queries the software with an arbitrary audio signal. The outcome is another audio signal which is selected from a database of instrumental samples and perceptually similar to the query.
The advantage of this search is that, unlike the query, the retrieved sound is precisely encoded in terms of duration, pitch, intensity, instrument, and playing technique. Thus, following the esthetic tradition of spectralism in contemporary music creation, the computer serves as a bridge from the auditory level to the symbolic level, i.e., from a potentially infinite realm of timbral sensations to a musical score of predefined range [11].
Composers may personalize their search engine by manually defining a cluster graph via the Cyberlioz interface (see Section 3). This preliminary annotation lasts 30 to 60 min, which is relatively short in comparison with the duration of the N = 9346 audio samples in the database: i.e., roughly three hours of audio.

Goal
This article proposes a machine listening system which computes the dissimilarity in timbre between two audio samples. 1 Crucially, this dissimilarity is not evaluated in terms of acoustic tags, but in terms of ad hoc clusters, as defined by a human consensus of auditory judgments. Our system consists of two stages: unsupervised feature extraction and supervised metric learning. The feature extraction stage is a nonlinear map which relies on the joint time-frequency scattering transform [12,13], followed by per-feature Gaussianization [14]. It encodes patterns of spectrotemporal modulation in the acoustic query while offering numerical guarantees of stability to local deformations [15]. The metric learning stage is a linear map, optimized via large-margin nearest neighbors (LMNN) [16]. It reweights scattering coefficients so that pairwise distances between samples more accurately reflect human judgments on a training set. These human judgments may be sourced from a single subject or the intersubjective consensus of multiple participants [17]. Figure 1 summarizes our experimental protocol: it illustrates how visual annotations (top) can inform feature extraction (center) to produce a nearest-neighbor search engine which is consistent with human judgments of timbre similarity (bottom).

Approach
The main contribution of this article can be formulated as the intersection between three topics. To the best of our knowledge, prior literature has addressed these topics separately, but never in combination.
First, our dataset encompasses a broad range of extended playing techniques, well beyond the so-called "ordinary" mode of acoustic production. Specifically, we fit pairwise judgments for 78 different techniques arising from 16 instruments, some of which include removable timbre-altering devices such as mutes.
Secondly, we purposefully disregard the playing technique metadata underlying each audio sample during the training phase of our model. In other words, we rely on listeners, not performers, to define and evaluate the task at hand.
Thirdly, we supplement our quantitative benchmark with visualizations of time-frequency scattering coefficients in the rate-scale domain for various typical samples of instrumental playing techniques. These visualizations are in line with visualizations of the modulation power spectrum in auditory neurophysiology [18], while offering an accelerated algorithm for scalable feature extraction.
Our paper strives to fill the gap in scholarship between MIR and music cognition in the context of extended playing techniques. From the standpoint of MIR, the model presented here offers an efficient and generic multidimensional representation for timbre similarity, alongside theoretical guarantees of robustness to elastic deformations in the time-frequency domain. Conversely, from the standpoint of music cognition, our model offers a scalable and biologically plausible surrogate for stimulus-based collection of acoustic dissimilarity judgments, which is readily tailored to subjective preferences.

Related work
Timbre involves multiple time scales in conjunction, from a few microseconds for an attack transient to several seconds for a sustained tone. Therefore, computational models of timbre perception must summarize acoustic information over a long analysis window [19]. Mapping this input to a feature space in which distances denote timbral dissimilarity requires a data-driven stage of dimensionality reduction. In this respect, the scientific literature exhibits a methodological divide as regards the collection of human-annotated data [20]: while the field of MIR mostly encodes timbre under the form of "audio tags, " music psychology mostly measures timbre similarity directly from pairwise similarity judgments.

Automatic classification of musical instruments and playing techniques
On the one hand, most publications in music information retrieval cast timbre modeling as an audio classification problem [21][22][23][24][25][26][27][28][29][30]. In this context, the instrumentation of each musical excerpt serves as an unstructured set of "audio tags, " encoded as binary outputs within some predefined label space. Because such tags often belong to the metadata of music releases, the process of curating a training set for musical instrument classification requires little or no human intervention. Although scraping usergenerated content from online music platforms may not always reflect the true instrumentation with perfect accuracy, it offers a scalable and ecologically valid insight onto the acoustic underpinnings of musical timbre. Furthermore, supplementing user-generated content with the outcome of a crowdsourced annotation campaign allows an explicit verification of instrument tags. For instance, the Open-MIC dataset [31], maintained by the Community for Open and Sustainable Music Information Research (COSMIR) [32], comprises a vast corpus of 20k polyphonic music excerpts spanning 20 instruments as a derivative of the Free Music Archive (FMA) dataset [33]. Another example is the Medley-solos-DB dataset [34], which comprises 21k monophonic excerpts from eight instruments as a derivative of the MedleyDB dataset of multitrack music [35].
Over the past decade, the availability of large digital audio collections, together with the democratization of high-performance computing on dedicated hardware, has spurred the development of deep learning architectures in music instrument recognition [36][37][38]. Notwithstanding the growing accuracy of these architectures in the largescale data regime, it remains unclear how to extend them from musical instrument recognition to playing technique recognition, where labeled samples are considerably more scarce [39]. We refer to [40] for a recent review of the state of the art in this domain.

Spectrotemporal receptive fields (STRF) in music cognition
On the other hand, the field of music psychology investigates timbre with the aim of discovering its physiological and behavioral foundations [41]. In this setting, prior knowledge of instrumentation, however accurate, does not suffice to conduct a study on timbre perception: rather, the timbre perception relies on an interplay of acoustic and categorical information [42]. Yet, collecting subjective responses to acoustic stimuli is a tedious and unscalable procedure, which restricts the size of the musical corpus under study. These small corpus sizes hamper the applicability of optimization algorithms for representation learning, such as stochastic gradient descent in deep neural networks. While training artificial neurons is prone to statistical overfitting, advanced methods in electrophysiology allow to observe the firing patterns of biological neurons in the presence of controlled stimuli. This observation, originally carried out on the ferret, has led to a comprehensive mapping of the primary auditory cortex in terms of its spectrotemporal receptive fields (STRFs) [43]. The STRF of a neuron is a function of time and frequency which represents the optimal predictor of its post-stimulus time histogram during exposure to a diverse range of auditory stimuli [44]. The simplest method to compute it in practice is by reverse correlation, i.e., by averaging all stimuli that trigger an action potential [45]. Historically, STRFs were defined by their Wigner-Ville distribution [46], thereby sparing the choice of a tradeoff in timefrequency localization, but eliciting cross-term interferences [47]. Since then, the STRF of a neuron was redefined as a spectrographic representation of its spike-triggered average [48].
Although this new definition is necessarily tied to a choice of spectrogram parameters, it yields more interpretable patterns than a Wigner-Ville distribution. In particular, a substantial portion of spectrographic STRFs exhibit a ripple-like response around a given region (t, λ) of the time-frequency domain [49]. This response can be approximately described by a pair of scalar values: a temporal modulation rate α in Hertz and a frequential modulation rate β in cycles per octave.
Interestingly, both α and β appear to be arranged in a geometric series and independent from the center time t and center frequency λ. This observation has led auditory neuroscientists to formulate an idealized computational model for STRF, known as the "full cortical model" [50], which densely covers the rate-scale domain (α, β) using geometric progressions. Because they do not require a data-driven training procedure, STRF yield a useful form of domain-specific knowledge for downstream machine listening applications, especially when the number of annotated samples is relatively small.

Spectrotemporal receptive fields (STRFs) as a feature extractor
Over recent years, several publications have employed the full cortical model as a feature extractor for a task of musical instrument classification, both in isolated recordings [18] and in solo phrases [51]. These biologically inspired features outperform the state of the art, especially in the small data regime where deep learning is inapplicable. Furthermore, the confusion matrix of the full cortical model in the label space of musical instruments is strongly correlated with the confusion matrix between a human listener and the ground truth. Another appeal of the full cortical model is that the three-way tensor of frequency λ, rate α, and scale β can be segmented into contiguous regions of maximal perceptual relevance for each instrument [52]. This is unlike fully end-to-end learning architectures, whose post hoc interpretability requires advanced techniques for feature inversion [53]. Lastly, beyond the realm of supervised classification, a previous publication [54] has shown that query-by-example search with STRFs allows to discriminate categories of environmental soundscapes, even after temporal integration and unsupervised dimensionality reduction. The reasons above make STRFs an appealing feature extractor for a perceptual description of timbral similarity across instrumental playing techniques. Nonetheless, current implementations of STRF suffer from a lack of scalability, which explains why they have found few applications in MIR thus far. Indeed, the full cortical model is usually computed via two-dimensional Fourier transforms over adjacent time-frequency regions, followed by averaging around specific rates and scales. This approach requires a uniform discretization of the scalogram, and thus an oversampling of the lower-frequency subbands to the Nyquist frequency of the original signal. In contrast, joint time-frequency scattering offers a faster extraction of spectrotemporal modulations while preserving properties of differentiability [55] and invertibility [56]. Such acceleration is made possible by discretizing the wavelet transforms involved in time-frequency scattering according to a multirate scheme, both along the time and the log-frequency variables [13]. In this multirate scheme, every subband is discretized at its critical sample rate, i.e., in proportion to its center frequency. As a by-product, the multirate approach draws an explicit connection between scattering networks and deep convolutional networks, because both involves numerous convolutions with small kernels, pointwise rectifying nonlinearities, and pooling operations [57].
Moreover, a quantitative benchmark over Medleysolos-DB has demonstrated that joint time-frequency scattering, unlike purely temporal scattering, outperforms deep convolutional networks in supervised musical instrument classification, even in a relatively large data regime with 500 to 5k samples per class [13]. However, it remains to be seen whether joint time-frequency scattering is capable of fine-grained auditory categorization, involving variability in instrument, mute, and playing technique. In addition, previous publications on joint time-frequency scattering lack a human-centric evaluation, independently from any classification task. Beyond the case of STRF, we refer to [58] for a detailed review of the state of the art on audio descriptors of timbre.

Perceptual data collection
The philharmonic orchestra encompasses four families of instruments: strings, woodwinds, brass, and percussion. In this article, we focus on the first three, and leave the question of learning auditory similarities between percussion instruments to future research. We refer to [59] and [60] for reviews of the recent literature on the timbre modeling of percussive instruments, from the standpoints of MIR and music cognition, respectively.

Dataset
We consider a list of 16 instruments: violin (Vn), viola (Va), cello (Vc), contrabass (Cb), concert harp (Hp), Spanish guitar (Gtr), accordion (Acc), flute (Fl), soprano clarinet (BbCl), alto saxophone (ASax), oboe (Ob), bassoon (Bn), trumpet in C (TpC), French horn (Hn), tenor trombone (TTbn), and bass tuba (BBTb). Among this list, the first six are strings, the next six are woodwind, and the last four are brass. Some of these instruments may be temporarily equipped with timbre-altering mutes, such as a rubber sordina on the bridge of a violin or an aluminum "wah-wah, " also known as harmon, inside in the bell of a trumpet. Once augmented with mutes, the list of 16 instruments grows to 33. Furthermore, every instrument, whether equipped with a mute or not, affords a panel of playing techniques ranging in size between 11 (for the accordion) and 41 (for the bass tuba). In the rest of this paper, we abbreviate instrument-mute-technique by means of the acronym "IMT. " One example of IMT is TpC+S-ord, i.e., trumpet in C with a straight mute played in the ordinary technique. Another example of IMT is Vn-pont, i.e., violin without any mute played in the sul ponticello technique (bowing near the bridge).
Performers can play each IMT at various pitches according to the tessitura of their instrument. This tessitura may depend on the choice of playing technique but is independent of the choice of mute. Among the 16 instruments in this study, the two instruments with widest and narrowest tessituras, in their respective ordinary techniques, are the accordion (81 semitones) and the trumpet in C (32 semitones) respectively. Lastly, each IMT may be played at up to five intensity dynamics, ranging from quietest to loudest as pianissimo (pp), piano (p), mezzo forte (mf ), forte (f ), and fortissimo (ff ). The resort to a non-ordinary playing technique may restrict both the tessitura and the dynamics range of the instrument-mute pair under consideration. For example, the pitch of pedal tones in brass instruments is tied to the fundamental mode of the bore, i.e., usually B or F. Likewise, the intensity of key clicks in the oboe is necessarily pp, while the intensity of snap pizzicato à la Bartók in plucked strings is necessarily ff.
In summary, audio signals from isolated musical notes may vary across three categorical variables (instrument, mute, and technique) and two quantitative variables (intensity and pitch). The Studio On Line dataset (SOL), recorded at Ircam in 1998, offers a joint sampling of these variables. The version of SOL that we use throughout this paper, named "0.9 HQ, " amounts to a total of 25444 audio signals. Beyond playing techniques, we should note that SOL erases other factors of acoustic variability, such as identity of performer, identity of instrument manufacturer, audio acquisition equipment, and room response characteristics, which are all restricted to singletons. Addressing these factors of variability is beyond the scope (2021) 2021:3 Page 6 of 21 of this paper, which focuses on the influence of playing technique. Despite this restriction, the SOL dataset remains impractically large for collecting human similarity judgments. Our protocol addresses this problem by means of three complementary approaches: disentanglement of factors, expert pre-screening, and the use of an efficient annotation interface.

Disentanglement of factors
First, we purposefully disentangle categorical variables (IMTs) from continuous variables (pitch and intensity) in the SOL dataset. Indeed, under first approximation, the perception of timbre is invariant to pitch and intensity. Therefore, we select auditory stimuli according to a reference pitch and a reference intensity, in our case, middle C (C 4 ) and mf. After this selection, every IMT triplet contains a single acoustic exemplar, regarded as canonical in the following. The number of canonical stimuli for the entire SOL dataset is equal to 235. We should note, however, that the proposed pitch and intensity cannot be strictly enforced across all IMTs. Indeed, as explained above, a fraction of IMTs can only be achieved at restricted values of pitch and intensity parameters, e.g., pedal tones or key clicks. Therefore, at a small cost of consistency, we only enforce the pitch-intensity reference (i.e., C 4 and mf ) when practically feasible, and fall back to other pitches and intensities if necessary.

Expert pre-screening
Secondly, we reduce the number of IMTs in our study by focusing on those which are deemed to be most relevant.
Here, we define the relevance of an IMT as the possibility of imitating it by means of another IMT from a different instrument. One example of such imitation is the acoustic similarity between slap tonguing in reed instruments and a snap pizzicato in string instruments. To collect perceptual ratings of relevance, we recruited two professors in music composition at the Paris Conservatory (CNS-MDP 2 ). Each of them inspected the entire corpus of 235 IMTs and annotated them in terms of relevance according to a Likert scale with seven ticks. In this Likert scale, the value 1 (least relevant) denotes that the IMT under consideration has a timbre that is idiosyncratic, and that therefore, it is unlikely that humans will pair it with other IMTs. Conversely, the value 7 (most relevant) denotes that the IMT under consideration bears a strong similarity with some other IMT from the corpus.
Once both experts completed their annotations, we retained all IMTs whose average score was judged equal to 3 or higher, thus resulting in a shortlist of N = 78 IMTs (see Tables 1 and 2). It is worth noting that, according to both experts, the timbre of the accordion was judged too idiosyncratic to be relevant for this experiment, regardless of playing technique. Indeed, the accordion is the only instrument in the aforementioned list of instrument to have free reeds, keyboard-based actuation, or handheld airflow. Consequently, regardless of mute and technique, the set of instruments I in our study contains 15 elements.

Efficient annotation interface
Thirdly, we design a graphical user interface for partitioning a corpus of short audio samples. The need for such an interface arises from the unscalability of Likert scales in the context of pairwise similarity judgments. Assuming that similarity is a symmetric quantity, collecting a dense matrix of continuously valued ratings of similarity among a dataset of N items would require 1 2 (N 2 − N) Likert scales. In the case of N = 78 IMTs, the task would amount to about 3k horizontal sliders, i.e., several hours of cumbersome work for the human annotator.
Engaging as many participants as possible in our study called for a more streamlined form of human-computer interaction, even if it sacrificed the availability of continuously valued ratings. To this end, we implemented a web application, named Cyberlioz, in which the user can spontaneously listen and arrange sounds into clusters of timbre similarity. 3 The name Cyberlioz is a portmanteau between the prefix cyber-and the French composer Hector Berlioz. The choice is by no means coincidental: Berlioz is famous for having, in his Treatise on Orchestration (1844), shed a particular focus on the role of timbre as a parameter for musical expression.
Cyberlioz consists of a square panel on which is displayed a collection of circular gray dots, each of them corresponding to one of the IMTs, and initially distributed uniformly at random. Hovering the screen pointer onto each dot results in a playback of a representative audio sample of this IMT, i.e., C 4 and mf in most cases. Furthermore, each dot can be freely placed on the screen by clicking, dragging, and dropping. Lastly, the user can assign a color to each dot among a palette of 20 hues. The goal of the Cyberlioz interface is to form clusters of timbre similarity between IMTs, expressed by sameness of color.
Cyberlioz implements a data collection procedure known as "free sorting. " In comparison with the direct collection of timbre dissimilarity ratings, free sorting is more efficient yet less accurate [61]. We refer to [62] for an example protocol in which timbre similarity judgments rely on stimuli pairs rather than on a free sorting task.

Trumpet in C Cup
TpC+H-ord-C4-mf Trumpet in C Harmon

TpC+S-ord-C4-mf Trumpet in C Straight
TpC+W-ord-closed-C4-mf Trumpet in C Wah (closed) Vc+SP-ord-C4-mf-1c Cello Piombo In comparison with web-based forms, Cyberlioz offers a more intuitive and playful user experience, while limiting the acquisition of similarity judgments to a moderate duration of 30 to 60 min for each participant. Another advantage of Cyberlioz is that it allows to present all stimuli at once rather than according to a randomized sequence.
In May and June 2016, we recruited volunteers to use Cyberlioz on their own computers, via a web browser, and equipped with a pair of earphones. The subjects were asked to "cluster sounds into groups by assigning the same color to the corresponding dots according to how similar the sounds are. " We publicized this study on the internal mailing list of students at CNSMDP, as well as two international mailing lists for research in music audio processing: AUDITORY and ISMIR Community. 4 Within 2 months, K = 31 participants accessed Cyberlioz and completed the task.
Personal information on the age, sex, nor musical background of participants is not collected, because the goal of our perceptual study is to build a consensus of similarity judgments, rather than to compare demographic subgroups.
In particular, we leave the important question of the effect of musical training on the perception of auditory similarities between playing techniques as future work.

Hypergraph partitioning
Once the data collection campaign was complete, we analyzed the color assignments of each subject k and converted them into a cluster graph G k , where the integer k is an anonymized subject index, ranging between 1 and K. For a given k, the graph G k contains N vertices, each representing a different IMT in the corpus. In G k , an edge connects any two vertices m and n if the corresponding dots in Cyberlioz have the same color. Otherwise, there is no edge connecting m and n. Thus, G k contains as many connected components as the number of similarity clusters for the subject k, i.e., the number of distinct colors on the Cyberlioz interface in the response of k.
For a particular subject k, let us denote by C k the number of clusters in the graph G k . Figure 2a shows the histogram of C k across the cohort of K = 31 participants. We observe that the number of clusters varies between 3 and 19 with a median value of 10. Accordingly, the number of samples belonging to a cluster varies between 1 (the most frequent value) and 50, as shown in Fig. 2b.
We aggregate the similarity judgments from all K participants by embedding them into a hypergraph H, that is, a graph whose edges may connect three or more vertices at once. Specifically, H contains N vertices, each representing an IMT, and each "hyperedge" in H corresponds to some connected component in one of the graphs G 1 , . . . , G K . Then, we convert the hypergraph H back into a conventional graph G 0 by means of a combinatorial optimization algorithm known as hypergraph partitioning [63].
To construct G 0 , we select a number of clusters that is equal to the maximal value of the C k 's, that is, C 0 = 19. Then, we run hypergraph partitioning on H to assign each vertex i to one of the C 0 clusters in G 0 . Intuitively, hypergraph partitioning optimizes a tradeoff between two objectives: first, balancing the size of all clusters in terms of their respective numbers of vertices, and secondly, keeping most hyperedges enclosed within as few distinct clusters as possible [64,65].
While the graphs G 1 , . . . , G K encode the subjective similarity judgments of participants 1 to K, the graph G 0 represents a form of consensual judgment that is shared across all participants while discarding intersubjective variability. Although the rest of our paper focuses on the consensus G 0 , it is worth pointing out that the same technical framework could apply to a single subject k, or to a subgroup of the K = 31 participants. This remark emphasizes the potential of our similarity learning method as a customizable tool for visualizing and extrapolating the timbre similarity space of a new subject.

Machine listening methods
The previous section described our protocol for collecting timbral similarity judgments between instrumental playing techniques. In this section, we aim to recover these similarity judgments from digital audio recordings according to a paradigm of supervised metric learning. To this end, we present a machine listening system composing joint time-frequency scattering and LMNN.

Joint time-frequency scattering transform
Let ψ ∈ L 2 (R, C) be a complex-valued filter with zero average, dimensionless center frequency equal to one, and an equivalent rectangular bandwidth (ERB) equal to 1/Q. We define a constant-Q wavelet filterbank as the family ψ λ : t → λψ(λt). Each wavelet ψ λ has a center frequency of λ, an ERB of λ/Q, and an effective receptive field of (2πQ/λ) in the time domain. In practice, we define ψ as a Morlet wavelet: where the Gaussian width σ ψ grows in proportion with the quality factor Q and the corrective term κ ψ ensures that ψ has a zero average. Moreover, we discretize the frequency variable λ according to a geometric progression of common ratio 2 1 Q . Thus, the base-two logarithm of center frequency, denoted by log 2 λ, follows an arithmetic progression. We set the constant quality factor of the wavelet filterbank ψ λ λ to Q = 12, thus matching twelve-tone equal temperament in music.
Convolving the wavelets in this filterbank with an input waveform x ∈ L 2 (R), followed by an application of the pointwise complex modulus yields the wavelet scalogram (2) which is discretized similarly to the constant-Q transform of [66]. Then, we define a two-dimensional Morlet wavelet of the form taking two real variables t and u as input. In the rest of this paper, we shall refer to as a time-frequency wavelet. The former is the time variable while the latter is the base-two logarithm of frequency: u = log 2 λ. Note that u roughly corresponds to the human perception of relative pitch [67].
We choose the Gaussian width σ in Eq. 3 such that the quality factor of the Morlet wavelet is equal to one, both over the time dimension and over the log-frequency dimension. Furthermore, the corrective term κ ensures that has a zero average over R 2 , similarly to Eq. 1. From , we define a two-dimensional wavelet filterbank of the form: In the equation above, α is a temporal modulation rate and β is a frequential modulation scale, following the terminology of spectrotemporal receptive fields (STRF, see Section 2). While α is measured in Hertz and is strictly positive, β is measured in cycles per octaves and may take positive as well as negative values. Both α and β are discretized by geometric progressions of common ratio equal to two. Furthermore, the edge case β = 0 corresponds to α,β being a Gaussian low-pass filter over the log-frequency dimension, while remaining a Morlet bandpass filter of center frequency α over the time dimension. We denote this low-pass filter by φ F and set its width to F = 2 octaves. We now convolve the scalogram U 1 with timefrequency wavelets α,β and apply the complex modulus, yielding the four-way tensor where the circled asterisk operator denotes a joint convolution over time and log-frequency. In the equation above, the sample rate of time t is proportional to α. Conversely, the sample rate of log-frequency u = log 2 λ is proportional to |β| if β = 0 and proportional to F −1 otherwise. Let φ T be a Gaussian low-pass filter. We define the joint time-frequency scattering coefficients of the signal x as the four-way tensor where the symbol ⊗ denotes the outer product over time and log-frequency. In the equation above, the sample rate of time t is proportional to T −1 and the sample rate of log-frequency u = log 2 λ is proportional to F −1 . Furthermore, the rate α spans along a geometric progression ranging from T −1 to λ/Q. In the following, we set the time constant to T = 1000 ms unless specified otherwise. The tensor S 2 bears a strong resemblance with the idealized response of an STRF at the rate α and the scale β. Nevertheless, in comparison with the "full cortical model" [18], joint time-frequency scattering enjoys a thirtyfold reduction in dimensionality while covering a time span that is four times larger (1000 ms) and an acoustic bandwidth that is also four times larger (0-16 kHz). This is due to the multirate discretization scheme applied throughout the application of wavelet convolutions and pointwise modulus nonlinearities.
In addition to second-order scattering coefficients (Eq. 6), we compute joint time-frequency scattering at the first order by convolving the scalogram U 1 x (Eq. 2) with the low-pass filter φ T over the time dimension with wavelets ψ β (β ≥ 0) over the log-frequency dimension, and by applying the complex modulus: where the time constant T is the same as in Eq. 6, i.e., T = 1000 ms by default. Over the time variable, we set the modulation rate α of S 1 to zero in the equation above. Conversely, over the log-frequency variable, the edge case β = 0 corresponds to replacing the wavelet ψ β by the low-pass filter φ F . We refer to [13] for more details on the implementation of joint time-frequency scattering.
We adopt the multi-index notation p = (λ, α, β) as a shorthand for the tuple of frequency, rate, and scale. The tuple p is known as a scattering path (see [68]), and may apply to index both first-order (S 1 ) and second-order (S 2 ) coefficients. Given an input waveform x, we denote by Sx the feature vector resulting from the concatenation of S 1 x and S 2 x:

Median-based logarithmic compression and affine standardization
Now, we apply a pointwise nonlinear transformation on averaged joint time-frequency scattering coefficients. The role of this transformation, which is adapted to the dataset in an unsupervised way, is to Gaussianize the histogram of amplitudes of each scattering path p. We consider a collection X of N waveforms x 1 , . . . , x N . For every path p in the joint time-frequency scattering transform operator S, we average the response of each scattering coefficient Sx n over time and take its median value across all samples n from 1 to N: If the collection is split between a training set and a test set (see Section 5), we compute μ on the training set only. Then, to match a decibel-like perception of loudness, we apply the following adaptive transformation, which composes a median-based renormalization and a logarithmic compression: where ε is a predefined constant. The offset of one before the application of the pointwise logarithm ensures that the transformation is nonexpansive in the sense of Lipschitzian maps: there exists a constant c such that for every pair of samples (x m , x n ). On a dataset of environmental audio recordings, a previous publication has shown empirically that Eq. 10 brings the histogram of Sx n (p) closer to a Gaussian distribution [14]. Since then, this finding has also been confirmed in the case of musical sounds [4]. Lastly, we standardize every feature Sx n to null mean and unit variance, across the dataset X = {x 1 . . . x N }, independently for each scattering path p. Again, if X is split between training and test sets, we measure means and variances over the training set only and propagate them as constants to the test set. With a slight abuse of notation, we still denote by Sx n (p) the standardized logscattering features at path p for sample n, even though its value differs from Eq. 10 by an affine transformation.

Metric learning with large-margin nearest neighbors (LMNN)
Let x be some arbitrary audio sample in the dataset X . Let G be a cluster graph with N = card X vertices and C clusters in total. We denote by G(x) the cluster to which the sample x belongs. Given another sample y in X , y is similar to x if and only if belongs to the cluster G(x). Because G is a disjoint union of complete graphs, this relation is symmetric: x ∈ G(y) is equivalent to y ∈ G(x). In our protocol, x contains the sound of an isolated musical note and the cluster graph G encodes auditory similarities within the dataset X . Moreover, we take G to be the equal to the "consensus" cluster graph G 0 , i.e., arising from the partition of a hypergraph H which contains the judgments of all K participants from our perceptual study (see Section 3).
We denote by Sx the feature vector of joint timefrequency scattering resulting from x. This vector includes both first-order and second-order scattering coefficients, after median-based logarithmic compression and affine standardization (see subsections above). Furthermore, we In all of the following, we set the constant R to 5; this is in accordance with our chosen evaluation metric, average precision at rank 5 (AP@5, see Section 5). Let P be the number of scattering paths in the operator S. The LMNN algorithm learns a matrix L with P rows and P columns by minimizing an error function of the form: where, intuitively, E pull tends to shrink local Euclidean neighborhoods in feature space while E push tends to penalize small distances between samples that belong to different clusters in G.
The definition of E pull is: Note that the error term E pull is unsupervised, in the sense that it does not depend on the cluster assignments of x and y in G. While the term E pull operates on pairs of samples, the term E push , operates on triplets (x, y, z) ∈ X 3 . The first sample, x, is known as an "anchor. " The second sample, y, is known as a "positive", and is assumed to belong to the Euclidean neighborhood of the anchor: y ∈ Y R (x). The third sample, z, is known as a "negative" and is assumed to belong to a different similarity cluster as the anchor: z ∈ G(x). The term E push penalizes L unless the positiveto-anchor distance is smaller than the negative-to-anchor distance by a margin of at least 1.
The definition of E push is: where the function ρ : u → max(0, u) denotes the activation function of the rectified linear unit (ReLU), also known as hinge loss. The cost function described in the equation above is known in deep learning as "triplet loss" and has recently been applied to train large-vocabulary audio classifiers in an unsupervised way [69]. We refer to [70] for a review of the state of the art in metric learning.

Extension to diverse pitches and dynamics
In order to suit the practical needs of contemporary music composers, computer-assisted orchestration must draw from a diverse realm of instruments and techniques.
Therefore, whereas our data collection procedure for timbre similarity judgments focused on a single pitch (middle C) and a single intensity level (mf ), we formulate our machine listening experiment on an expanded dataset of audio samples, containing variations in pitch and dynamics.
Given an audio stimulus x n from our perceptual study, we seek its position in the cluster graph G 0 . Then, we identify its IMT triplet, scrape for audio samples in SOL matching this triplet, and assign them all to the same cluster as the original audio stimulus. We repeat the same procedure for all N = 78 nodes in G 0 , resulting in N = 9346 samples in total. Thus, from a limited amount of human annotation, we curate a relatively large subset of SOL, amounting to about one third of the entire dataset (9346 out of 25444 samples).
In doing so, we assume that the perception of timbre is fully invariant to frequency transposition as well as changes in dynamics. This assumption coincides with the commonplace definition of timbre as an autonomous parameter of musical sound. Previous studies on realworld musical sounds have confirmed that listeners are able to ignore salient pitch differences while rating timbre similarity [71,72], insofar as these differences do not exceed one octave. Since then, another study has shown that trained musicians can identify the similarity within pairs of notes from two instruments (horn and bassoon) spanning a range of 2.5 octaves with an accuracy of 80% [73]. In comparison, non-musicians has great difficulty identifying whether two notes were produced by the same instrument or not when the notes were separated by one octave. Thus, the influence of pitch on timbre perception appears to depend upon musical training. With this important caveat in mind, we leave as future work the question of disentangling the effects of pitch and musical training in the modeling of auditory similarities between instrumental playing techniques.

Evaluation metric
Let us denote by x 1 , . . . , x N the N audio samples associated with our annotated dataset. Given a sample n and a human subject k, we denote by G k the cluster graph associated to the subject k, and by G k (n) the cluster to which the sample x n belongs. Our machine listening system takes the waveform x n as input and returns a ranked list of nearest neighbors: 1 (x n ), 2 (x n ), 3 (x n ), and so forth.
In the context of browsing an audio collection by timbre similarity, x n is a user-defined query while the function plays the role of a search engine. We consider the first retrieved sample, 1 (x n ), to be relevant to user k if and only if it belongs to the same cluster as x n in the cluster graph G k , hence the Boolean condition 1 (x n ) ∈ G k (n). Likewise, the second retrieved sample is relevant if and only if 2 (x n ) ∈ G k (n). To evaluate on the query x n , we measure the relevance of all nearest neighbors r (x n ) up to some fixed rank R and average the result: In the equation above, the indicator function 1 converts Booleans to integers, i.e., 1(b) returns one if b is true and return zero if b is false. Thus, the function p takes fractional values between 0 and 1, which are typically expressed in percentage points. The precision at rank R of the system is defined as the average value taken by the function p over the entire corpus of N audio samples: Lastly, the "average precision at rank R" (henceforth, AP@R) is the average value of P , for constant R, across all K = 31 participants from our perceptual study: It appears from the above that an effective system should retrieve sounds whose IMT triplets are similar according to all of the K cluster graphs G 1 . . . G K .
In the rest of this paper, we set R to 5. This is in accordance with the protocol of [4], in which the authors trained a metric learning algorithm on the SOL dataset to search for similar instruments and playing techniques, yet without the intervention of a human subject.

Results
The previous section described our methods for extracting spectrotemporal modulations in audio signals, as well as learning a non-Euclidean similarity metric between them. We now turn to apply these methods to the problem of allocating isolated musical notes to clusters of some timbre similarity graph G. In practice, for training purposes, the cluster graph G represents the consensus of the K = 31 clustering provided by the users interacting with the Cyberlioz web application, which was described in the Section 3 (G = G 0 ). However, for evaluation purposes, this cluster graph corresponds to the subjective preferences of a single user k ≥ 1, in which case we take G = G k .

Best performing system
Our best performing system comprises five computational blocks: 1 Joint time-frequency scattering up to a maximal time scale of T = 1000 ms, 2 Temporal averaging at the scale of the whole musical note, 3 Median-based logarithmic compression, 4 Affine standardization so that each feature has zero mean and unit variance, and 5 Nearest-neighbor search according to previously learned non-Euclidean metric.
Note that the non-Euclidean metric is learned via LMNN (see Section 4) on the "consensus" cluster graph G 0 . Therefore, the system performs timbre similarity retrieval in a user-agnostic way and can serve as a convenient default algorithm for newcoming users. That being said, it is conceivable to replicate the five-stage protocol above on the cluster graph G k of a specific user k, instead of the cluster graph G 0 . This operation would lead to a new configuration of the search engine that is better tailored to the perceptual idiosyncrasy of user k in terms of timbre similarity.
Within the default setting (G = G 0 ), our system achieves an average precision at rank five (AP@5) of 99.0%, with a standard deviation across K = 31 participants of the order of 1%. This favorable result suggests that joint time-frequency scattering provides a useful feature map for learning similarities between instrumental playing techniques. In doing so, it is in line with a recent publication [74], in which the authors successfully trained a supervised classifier on joint time-frequency scattering features in order to detect and classify playing techniques from the Chinese bamboo flute (dizi). However, the originality of our work is that relies purely on auditory information (i.e., timbre similarity judgments), and does not require any supervision from the symbolic domain. In particular, it does not assume the metadata (instrument, mute, technique, pitch, dynamics, and so forth) of any musical sample x n to be observable, in part or in full, at training time.

Visualization of joint-time frequency scattering coefficients
In the second order, joint time-frequency scattering coefficients depend upon four variables: time t, log-frequency λ, temporal modulation rate α in Hertz, and frequential modulation rate β in cycles per octave. From a data visualization standpoint, rendering the four-dimensional tensor S 2 x(t, λ, α, β) is impossible. To address this limitation, a recent publication has projected this tensor into a twodimensional "slice, " thus yielding an image raster [13]. In accordance with their protocol, we compute the following matrix: Observe that the equation above is a limit case of S 2 (Eq. 6) in which the constants T and F tend towards infinity. As a result, Vx depends solely upon scale α and rate β. In the scientific literature on STRF, the matrix Vx is known as the "cortical output collapsed on the rate-scale axes" [75].
Previous publications on STRFs have demonstrated the interest of visualizing the slice Vx in the case of speech [76], lung sounds [77], and music [18]. However, the visualization of musical sounds has been restricted to some of most common playing techniques, i.e., piano played staccato and violin played pizzicato. Furthermore, prior publications on time-frequency scattering have displayed slices of S 2 x in the case of synthetic signals; but there is a gap in literature as regards the interpretability of the scale-rate domain in the case of real-world signals.
To remedy this gap, we select twelve isolated notes from the SOL dataset from two instruments: violin (Fig. 3) and flute (Fig. 4). By and large, we find that joint timefrequency scattering produces comparable patterns in the scale-rate domain as the "cortical output" of the STRF. For example, Fig. 3a shows that a violin note played in the ordinario technique has a local energy maximum at the rate α = 6Hz. A visual inspection of U 1 demonstrates that this rate coincides with the rate of vibrato of the left hand in the violin note. As seen in Fig. 3b, this local energy maximum is absent when the playing technique is denoted as nonvibrato. Furthermore, Fig. 3c shows that the local energy maximum is displaced to a higher rate (α = 12Hz) when the vibrato is replaced by a tremolo.
The visual interpretation of playing techniques in terms of their joint time-frequency scattering coefficients is not restricted to periodic modulation, such as vibrato or tremolo: rather, it also encompasses the analysis of attack transients. Figure 3d, e, and f show the matrix Vx for three instances of impulsive violin sounds: sforzando, pizzicato, and staccato respectively. These three techniques create ridges in the scale-rate domain (α, β), where the cutoff rate α is lowest with sforzando and highest with staccato. These variations in cutoff rate coincide with perceptual variations in "hardness", i.e. impulsivity, of the violin sound. Moreover, in the case of staccato, we observe a slight asymmetry in the frequential scale parameter β. This asymmetry could be due to the fact that higher-order harmonics decay faster than the fundamental, thus yielding a triangular shape in the time-frequency domain. Figure 4 shows six playing techniques of the flute. Similarly to the violin (Fig. 3), we observe that periodic modulations, such as a trill (Fig. 4b) or a beating tone (Fig. 4c), cause local energy maxima whose rate α is physically interpretable. Likewise, impulsive flute sounds such as sforzando (Fig. 4d), key click (Fig. 4e), and vibrato ( Fig. 4f ) create ridges in the scale-rate domain of varying cutoff rates α. We distribute the implementation of these figures as part of the MATLAB library scattering.m, which is released under the MIT license. 5

Ablation study
We now turn to alter certain key choices in the design of the above-described computational blocks, and discuss their respective impacts on downstream performance. Figure 5 summarizes our results. Interestingly, the system is not only best on average, but also best for every subject in the cohort. Specifically, replacing by a simpler model (see subsections below for examples of such models) results in P (k, 5) < P (k, 5) for every k. Borrowing from the terminology of welfare economics, can be said to be uniquely Pareto-efficient [78]. This observation suggests that the increase in performance afforded by the state-of-the-art model with respect to the baseline does not come at the detriment of user fairness.

Role of dataset size
First of all, it is worth noting that the presented AP@5 figure of 99.0% ± 1 does not abide to a conventional training set vs. test set paradigm, as is most often done in machine learning research. Rather, the LMNN algorithm is trained on all available samples (N = 9346 isolated notes) with the "consensus" cluster graph as training objective (G = G 0 ). Then, it is evaluated on the same samples with individual cluster graphs as ground truth (G = G k for k ≥ 1). The reason behind this choice is that, in the context of computer-assisted orchestration, the collection of audio samples has a fixed size, as it is shipped alongside the software itself. This is the case, for example, of Orchidea, 6 which comes paired with its own version of SOL, named OrchideaSOL [79]. Henceforth, our main goal was to evaluate the generalization ability of our metric learning algorithm beyond the restricted set of samples for which human annotations are directly available (see Section 3), that is, beyond one pitch class (middle C) and one intensity level (mf ).
Despite this caveat, we may adopt a "query-by-example" (QbE) paradigm by partitioning the database and audio samples in half 1 2 N = 4673 , training the LMNN algorithm on the first half, and querying it with samples from the other half. In this evaluation framework, our system reaches an AP@5 of 96.2% ± 2. Interestingly, querying the system with samples from the training set leads to an AP@5 of 96.5% ± 2, i.e., roughly on par with the test set. Thus, it appears that the gap in performance between the evaluation presented in the Section 5 (99.0% ± 1) and query-by-example evaluation (96.2% ± 2) is primarily attributable to a reduction of the size of training set by half, whereas statistical overfitting of the training set with respect to the test set likely plays a minor role. These findings are in line with a previous publication [18] which modeled perceived dissimilarity between musical sounds by means of STRF features. Likewise, a recent publication has observed similar performance in instrument identification of a human and machine classifier using separable Gabor filterbank (GBFB) features [80].
We note that both STRF and GBFB bear a strong computational resemblance with time-frequency scattering.

Role of metric learning
Replacing the LMNN metric learning algorithm by linear discriminant analysis (LDA) leads to an AP@5 of 76.6% ± 11. Moreover, we evaluate the nearest neighbor algorithm in the absence of any metric learning at all. This corresponds to using a Euclidean distance to compare scattering coefficients, i.e., to set L to the identity matrix. Note that the runtime complexity of Euclidean nearest-neighbor search is identical to LMNN search. We report an average precision at rank five (AP@5) of 92.9%± 3, which is noticeably worse than the best performing system. This gap in performance demonstrates the importance of complementing unsupervised feature extraction by supervised metric learning in the design of computa-tional models for timbre similarity between instrumental playing techniques.

Role of temporal context T
Our best performing system operates with joint time-frequency scattering coefficients as spectrotemporal modulation features. These features are extracted within a temporal context of duration equal to T = 1000ms. This value is considerably larger than the frame size of purely spec- The performance achieved for each reference clustering is depicted by a lozenge whose color is chosen arbitrarily but consistently across conditions. See Section 5 for details tral features, such as spectral centroid, spectral flux, or mel-frequency cepstral coefficients (MFCCs). Indeed, the frame size of spectral features for machine listening is typically set to T = 23ms, i.e., 2 10 = 1024 samples at a sampling rate of 44, 1kHz [22,23]. As a point of comparison, we set the maximum time scale of joint time-frequency scattering coefficients to T = 25ms, hence a 40-fold reduction in context size. Over our cohort of K = 31 participants, we report an AP@5 of 90.9% ± 4, which is noticeably worse than the best performing system. This gap in performance extends the findings of a previous publication [4], which reported that metric learning with temporal scattering coefficients tends to improve with growing values of T, until reaching a plateau of performance around T ∼ 500ms.

Role of joint time-frequency scattering
Let us recall the full definition of second-order joint timefrequency scattering (see Section 4): where the ψ λ denotes a Morlet wavelet of center frequency λ and resp. φ T denotes a Gaussian low-pass filter of width T. Besides the joint time-frequency scattering transform, the generative grammar of scattering transforms [81] also encompasses the separable timefrequency scattering transform: where the wavelet ψ λ has a quality factor of Q = 12 whereas the wavelets ψ α and ψ β have a quality factor of one. Previous publications have successfully applied separable time-frequency scattering in order to classify environmental sounds [82] as well as playing techniques from the Chinese bamboo flute [83].
In comparison with its joint counterpart, separable time-frequency scattering contains about half as many coefficients. This is because the temporal wavelet transform with ψ α and the frequential wavelet transform with ψ β are separated by an operation of complex modulus. Hence, ψ β operates on a real-valued input. Because separable time-frequency scattering cannot distinguish ascending chirps from descending chirps, flipping the sign of the scale variable β is no longer necessary. Moreover, separable time-frequency scattering has a lower algorithmic complexity than joint time-frequency scattering. Indeed, in Eq. 20, the frequential wavelet transform with ψ β operates on a tensor whose time axis is subsampled at a fixed rate T −1 , thus allowing vectorized computations. Conversely, in Eq. 19, the frequential wavelet transform must operate on a multiresolution input, whose sample rate varies depending on α: it ranges between T −1 and the sample rate of x itself.
Yet, despite its greater simplicity, separable timefrequency scattering suffers from known weaknesses in its ability to represent spectrotemporal modulations. In particular, a previous publication has shown that frequencydependent time shifts affect joint time-frequency scattering coefficients while leaving separable time-frequency scattering coefficients almost unchanged [13]. The same observation was made by [84] in the case of joint and separable Gabor filterbank features (GBFB), which bear some resemblance with joint and separable time-frequency scattering coefficients respectively. Over our cohort of K = 31 participants, separable time-frequency scattering achieves n AP@5 of 91.9% ± 4. This figure is noticeably worse than joint time-frequency scattering (99.0% ± 1), all other things being equal. This gap in performance, together with the theory of STRFs in auditory neuroscience (see Section 2), demonstrates the importance of joint spectrotemporal modulations in the modeling of timbre similarity across instrumental playing techniques.

Comparison with mel-frequency cepstral coefficient (MFCC) baseline
Lastly, we train a baseline system in which joint timefrequency scattering coefficients are replaced by MFCCs. Specifically, we extract a 40-band mel-frequency spectrogram by means of the RASTAMAT library, apply the pointwise logarithm, and compute a discrete cosine transform (DCT) over the mel-frequency axis. 7 This operation results in a 40-dimensional feature vector over frames of duration T = 25ms. Over our cohort of K = 31 participants, we report an AP@5 of 81.8% ± 7. Arguably, this figure is not directly comparable with our best performing system (AP@5 of 99.0% ± 1), due to the mismatch in dimensionality between MFCC and joint time-frequency scattering coefficients. In order to clarify the role of feature dimensionality in our computational pipeline, we apply a feature engineering technique involving multiplicative combinations of MFCC. We construct the following Gram matrix: (21) where α and β represent different dimensions ("quefrencies") of the MFCC feature vector. The symmetric matrix Gx contains 40 rows and 40 columns, hence 800 unique coefficients. Concatenating these coefficients to the 40 averaged MFCC features results in a feature vector of 840 coefficients. This dimension is of the same order of magnitude as the dimension of our joint time-frequency scattering representation (d = 1180, see Section 4).
Training the LMNN algorithm on this 840-dimensional representation is analogous to a "kernel trick" in support vector machines [85]. In our case, the implicit similarity kernel is a homogeneous quadratic kernel. Despite this increase in representational power, we obtain an AP@5 of 81.5%±7, i.e., essentially the same as MFCC under a linear kernel. Therefore, it appears that the gap in performance between MFCC (81.8%±7) and joint time-frequency scattering (99.0%±1) is primarily attributable to the multiscale extraction of joint spectrotemporal modulations, whereas high-dimensional embedding likely plays a minor role.

Conclusion
We see from the ablation study conducted above that each step of the proposed model is necessary for its high performance. Among other things, it indicates the joint time-frequency scattering transform itself should not be used directly for similarity comparisons, but is best augmented by a learned stage. This explains the relative success of such models [4,14] over others where distances on raw scattering coefficients were used for judging audio similarity.
On the other hand, we note that the complexity of the learned model does not need to be high. Indeed, for this task, a linear model on the scattering coefficients is sufficient. There is no need for deep networks with large numbers of parameters to accurately represent the similarity information. In other words, the scattering transform parametrizes the signal structure in a way that many relevant quantities (such as timbre similarity) can be extracted through a simple linear mapping. This is in line with other classification results, where linear support vector machines applied to scattering coefficients have achieved significant success [12,13].
We also see the necessity of a fully joint time-frequency model for accessing timbre, as opposed to a purely spectral model or one that treats the two axes in a separable manner. This fact has also been observed in other contexts, such as the work of Patil et al. [18]. A related observation is the need for capturing large-scale structure. Indeed, reducing the window size to 25 ms means that we lose a great deal of time-frequency structure, bringing results closer to that of the separable model.
The success of the above system in identifying timbral similarities has immediate applications in browsing music databases. These are typically organized based on instrumental and playing techniques taxonomies, with additional keywords offering a more flexible organization. Accessing the sounds in these databases therefore requires some knowledge of the taxonomy and keywords used. Furthermore, the user needs to have some idea of the particular playing technique they are searching for.
As an alternative, content-based searches allow the user to identify sounds based on similarity to some given query sound. This query-by-example approach provides an opportunity to search for sounds without having a specific instrument of playing technique in mind, yielding a wider range of available sounds. A composer with access to such a system would therefore be able to draw on a more diverse palette of musical timbre.
The computational model proposed in this work is well suited to such a query-by-example task. We have shown that it is able to adequately approximate the timbral judg-ments of a wide range of participants included in our study. Not only that, but the system can be easily retrained to approximate an individual user's timbre perception by having that user perform the clustering task on the reduced set of 78 IPTs and running the LMNN training step on those clustering assignments. This model can then be applied to new data, or, alternatively, be retrained with these new examples if the existing model proves unsatisfactory.
We shall note, however, that the current model has several drawbacks. First, it is only applied to instrumental sounds. While this has the advantage of simplifying the interpretation of the results, the range of timbre under consideration is necessarily limited (although less restricted than only considering ordinario PTs). This also makes applications such as query-by-humming difficult, since we cannot guarantee that the timbral similarity measure is accurate for such sounds.
That being said, the above model is general enough to encompass a wide variety of recordings, not just instrumental sounds. Indeed, we have strong assumptions that the tools used (scattering transforms and LMNN weighting matrices) do not depend strongly on the type of sound being processed. Future work will investigate whether more general classes of sounds are also well modeled. To extend the model, it is only necessary to retrain the LMNN weighting matrix by supplying it with new cluster assignments. These can again be obtained by performing a new clustering experiment with one or more human subjects.
Another aspect is the granularity of the similarity judgments. In the above method, we have used hard clustering assignments to build our model. A more nuanced similarity judgment would ask users to rate the similarity of a pair of IPTs on a more graduated scale, which would yield a finer, or soft, assignment. This however, comes with additional difficulties in providing a consistent scale across participants, but could be feasible if the goal is to only adapt the timbral space to a single individual. An approach not based on clustering would also have to replace the LMNN algorithm with one that accounts for such soft assignments.