A computational study of auditory models in music recognition tasks for normal-hearing and hearing-impaired listeners

Friedrichs, Klaus; Bauer, Nadja; Martin, Rainer; Weihs, Claus

doi:10.1186/s13636-017-0103-7

Research
Open access
Published: 02 March 2017

A computational study of auditory models in music recognition tasks for normal-hearing and hearing-impaired listeners

Klaus Friedrichs ORCID: orcid.org/0000-0001-7062-6672¹,
Nadja Bauer¹,
Rainer Martin² &
…
Claus Weihs¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2017, Article number: 7 (2017) Cite this article

3608 Accesses
11 Citations
1 Altmetric
Metrics details

Abstract

The benefit of auditory models for solving three music recognition tasks—onset detection, pitch estimation, and instrument recognition—is analyzed. Appropriate features are introduced which enable the use of supervised classification. The auditory model-based approaches are tested in a comprehensive study and compared to state-of-the-art methods, which usually do not employ an auditory model. For this study, music data is selected according to an experimental design, which enables statements about performance differences with respect to specific music characteristics. The results confirm that the performance of music classification using the auditory model is comparable to the traditional methods. Furthermore, the auditory model is modified to exemplify the decrease of recognition rates in the presence of hearing deficits. The resulting system is a basis for estimating the intelligibility of music which in the future might be used for the automatic assessment of hearing instruments.

1 Introduction

Hearing-impaired listeners like to enjoy music as well as normal-hearing listeners although this is impeded by a distorted perception of music signals. Recently, several listening experiments have been conducted to assess the impact of hearing loss on music perception for hearing-impaired listeners (e.g., [1–4]). For many applications like optimization of hearing instruments, it is desirable to measure this impact automatically using a simulation model. Therefore, we investigate the potential of emulating certain normal-hearing and hearing-impaired listeners by automatically assessing their ability to discriminate music attributes via an auditory model. Auditory models are computational models which mimic the human auditory process by transforming acoustic signals into neural activity of simulated auditory nerve fibers (channels). Since these models do not explain the whole listening comprehension of higher central auditory stages, a back end is needed relying on the output of the auditory periphery. Similar ideas have already been proposed for measuring speech intelligibility in [5, 6] where this back end is an automatic speech recognition system, resulting in the word recognition rate as a natural metric. However, no such straightforward method exists to measure the corresponding “music intelligibility” in general. Unlike speech, music spectra are highly variable and have a much greater dynamic range [7]. For estimating “music intelligibility,” its constituent elements (pitch, harmony, rhythm, and timbre) have to be assessed in an independent manner [8]. Therefore, we focus on three separate music recognition tasks, i.e., onset detection, pitch estimation, and instrument recognition. Contrary to state-of-the-art methods, here, we extract information from auditory output only. In fact, some recent proposals in the field of speech recognition and music data analysis use auditory models, thus exploiting the superiority of the human auditory system (e.g., [9–11]). However, in most of these proposals, the applied auditory model is not sufficiently detailed to provide adequate options for implementing realistic hearing deficits. In the last decades, auditory models have been developed which are more sophisticated and meanwhile can simulate hearing deficits [12–15]. In [16, 17], it is shown that simple parameter modifications in the auditory model are sufficient to realistically emulate auditory profiles of hearing-impaired listeners.

In this study, we restrict our investigation on chamber music which includes a predominant melody instrument and one or more accompanying instruments. For further simplification, we are only interested in the melody track which means that all accompanying instruments are regarded as interferences. This actually means that the three recognition tasks are described more precisely as predominant onset detection, predominant pitch estimation, and predominant instrument recognition.

The article is organized as follows. In Section 2, related work is discussed. The contribution of this paper is summarized in Section 3. In Section 4, the applied auditory model of Meddis [18] (Section 4.1) and our proposals for the three investigated music recognition tasks are described (Sections 4.2–4.4). At the end of that section, the applied classification methods—Random Forest (RF) and linear SVM—are briefly explained (Section 4.5). Section 5 provides details about the experimental design. Plackett-Burman (PB) designs are specified for selecting the data set, which enable assessments about performance differences w.r.t. the type of music. In Section 6, we present the experimental results. First, the proposed approaches are compared to state-of-the-art methods, and second, performance losses due to the emulation of hearing impairments are investigated. Finally, Section 7 summarizes and concludes the paper and gives some suggestions for future research.

2 Related work

Combining predominant onset detection and predominant pitch estimation results in a task which is better known as melody detection. However, the performance of approaches in that research field are rather poor to date compared to human perception [19]. In particular, onset detection is still rather error-prone for polyphonic music [20]. Hence, in this study, all three musical attributes of interest are estimated separately, which means the true onsets (and offsets) are assumed to be known for pitch estimation and instrument recognition, excluding error propagation from onset detection.

2.1 Onset detection

The majority of onset detection algorithms consists of optional pre-processing stage, a reduction function (called onset detection function), which is derived at a lower sampling rate, and a peak-picking algorithm [21]. They all can be summarized into one algorithm with several parameters to optimize. In [22], we systematically solve this by using sequential model-based optimization. The onset detection algorithm can also be applied channel-wise to the output of the auditory model where each channel corresponds to a different frequency band. Here, the additional challenge lies in the combination of different onset predictions of several channels. In [23], a filter bank is used for pre-processing, and for each band, onsets are estimated which together build a set of onset candidates. Afterwards, a loudness value is assigned to each candidate and a global threshold and a minimum distance between two consecutive onsets are used to sort out candidates. A similar approach, but this time for combining the estimates of different onset detection functions, is proposed in [24] where the individual estimation vectors are combined via summing and smoothing. Instead of combining the individual estimations at the end, in [25], we propose a quantile-based aggregation before peak-picking. However, the drawback of this approach is that the latency of the detection process varies for the different channels, which is difficult to compensate before peak-picking. Onset detection of the predominant voice is a task which to our best knowledge has not been investigated, yet.

2.2 Pitch estimation

Most pitch estimation algorithms are either based on the autocorrelation function (ACF), or they work in the frequency domain by applying a spectral analysis of potential fundamental frequencies and their corresponding partials. For both approaches, one big challenge is to pick the correct peak which is particularly difficult for polyphonic music where the detection is disturbed by overlapping partials. In order to solve that issue, several improvements are implemented in the popular YIN algorithm [26] which in fact uses the difference function instead of the ACF. A further extension is the pYIN method which is introduced in [27]. It is a two-stage method which takes past estimations into account. First, for every frame, several fundamental frequency candidates are predicted, and second, the most probable temporal path is estimated, according to a hidden Markov model. In [28], a maximum-likelihood approach is introduced in the frequency domain. Another alternative is a statistical classification approach which is proposed in [29].

For pitch estimation, also, a few approaches using an auditory model—or at least some of its components—have been introduced. In [11], an outer/middle ear filter is proposed for pre-processing which reduces the number of octave errors. A complete auditory model is applied in [30, 31]. In those studies, an autocorrelation method is proposed where the individual running ACFs of each channel are combined by summation (averaging) across all channels (SACF). The results of that approach are equivalent to human performance for some specific sounds. However, the approach is not tested for complex music signals, yet. Also here, the challenge of picking the correct peak remains. All previously discussed approaches are originally designed for monophonic pitch detection. However, pitch estimation can be extended to its predominant variant by identifying the most dominant pitch, which many peak-picking methods implicitly calculate.

Also for polyphonic pitch estimation, approaches exist. One approach is proposed in [10]. Instead of just picking the maximum peak of the SACF, the strength of each candidate (peak) is calculated as a weighted sum of the amplitudes of its harmonic partials. Another approach is introduced in [32], where the EM algorithm is used‘ to estimate the relative dominance of every possible harmonic structure.

2.3 Instrument recognition

The goal of instrument recognition is the automatic detection of music instruments playing in a given music piece. Different music instruments have different compositions of partial tones, e.g., in the sound of a clarinet, mostly odd partials occur. This composition of partials is, however, also dependent on other factors like the pitch, the room acoustic, and the performer [33]. For building a classifier, meaningful information of each observation has to be extracted, which is achieved by appropriate features. Timbral features based on the one-dimensional acoustic waveform are the most common features for instrument recognition. However, features based on an auditory model have also been introduced in [34]. Also, biomimetic spectro-temporal features, requiring a model of higher central auditory stages, have been successfully investigated for solo music recordings in [35]. Predominant instrument recognition can be solved similarly to the monophonic variant, but is much harder due to the additional “noise” from the accompanying instruments [36]. An alternative is starting with sound source separation in order to apply monophonic instrument recognition afterwards [37]. Naturally, this concept can only work if the sources are well separated, a task which itself is still a challenge.

3 Contribution of the paper

As there exist only very few approaches for music recognition tasks using a comprehensive auditory model, in this study, new methods are proposed. For onset detection, we adapt the ideas of [23, 24] to develop a method for combining onset estimations of different channels. The main drawback of the approach in [23] is that the selection procedure of onset candidates is based on a loudness estimation and a global threshold which makes it unsuitable for music with high dynamics. Instead, in [24] and also in our approach, relative thresholds are applied. However, the proposal in [24] can only combine synchronous onset estimations, i.e., the same sampling rate has to be used for the onset detection functions of all basic estimators. Our new approach can handle asynchronous estimations which enables the use of different hop sizes. Furthermore, we propose parameter optimization to adapt the method to predominant onset detection. Sequential model-based optimization (MBO) is applied to find optimal parameter settings for three considered variants of onset detection: (1) monophonic, (2) polyphonic, and (3) predominant onset detection. For pitch estimation, inspired by [29], we propose a classification approach for peak-picking, where each channel nominates one candidate.

In [29], potential pitch periods derived from the original signal are used as features, whereas in our approach, features need to be derived using the auditory model. Our approach is applicable to temporal autocorrelations as well as to frequency domain approaches. Additionally, we test the SACF method, where we investigate two variants for peak-picking. For instrument recognition, we adapt common timbral features for instrument recognition by extracting them channel-wise from the auditory output. This is contrary to [34], where the features are defined across all channels. The channel-wise approach preserves more information, can be more easily adapted to the hearing-impaired variants, and enables assessments of the contribution of specific channels to the recognition rates.

All approaches are extensively investigated using a comprehensive experimental design. The experimental setup is visualized in Fig. 1. The capability of auditory models to discriminate the three considered music attributes is shown via the normal-hearing auditory model which is compared to the state-of-the-art methods. For instrument recognition, the approach using the auditory model output even performs distinctly better than the approach using standard features. As a prospect of future research, performance losses based on hearing deficits are exemplified using three so-called hearing dummies as introduced in [17].

4 Music classification using auditory models

4.1 Auditory models

The auditory system of humans and other mammals consists of several stages located in the ear and the brain. While the higher stages located in the brainstem and cortex are difficult to model, the auditory periphery is much better investigated. This stage models the transformation from acoustical pressure waves to release events of the auditory nerve fibers. Out of the several models simulating the auditory periphery, we apply the popular and widely analyzed model of Meddis [18], for which simulated hearing profiles of real hearing impaired listeners exist [17].

The auditory periphery consists of the outer ear, the middle ear, and the inner ear. The main task of the outer ear is collecting sound waves and directing them further into the ear. At the back end of the outer ear, the eardrum transmits vibrations to the stapes in the middle ear and then further to the cochlea in the inner ear. Inside the cochlea, a traveling wave deflects the basilar membrane at specific locations dependent on the stimulating frequencies. On the basilar membrane, inner hair cells are activated by the velocity of the membrane and evoke spike emissions (neuronal activity) to the auditory nerve fibers.

The auditory model of Meddis [18] is a cascade of several consecutive modules, which emulate the spike firing process of multiple auditory nerve fibers. A block diagram of this model can be seen in Fig. 2. Since auditory models use filter banks, the simulated nerve fibers are also called channels within the simulation. Each channel corresponds to a specific point on the basilar membrane. In the standard setting of the Meddis model, 41 channels are examined. As in the human auditory system, each channel has an individual best frequency (center frequency) which defines the frequency that evokes maximum excitation. The best frequencies are equally spaced on a log scale with 100 Hz for the first and 6000 Hz for the 41st channel.

In the last plot of Fig. 3, an exemplary output of the model can be seen. The 41 channels are located on the vertical axis according to their best frequencies. The grayscale indicates the probability of spike emissions (white means high probability). The acoustic stimulus of this example is a harmonic tone which is shown in the first plot of the figure. The first module of the Meddis model corresponds to the middle ear where sound waves are converted into stapes displacement. The resulting output of the sound example is shown in the second plot. The second module emulates the basilar membrane where stapes displacement is transformed into the velocity of the basilar membrane at different locations, implemented by a dual-resonance-non-linear (DRNL) filter bank, a bank of overlapping filters [38]. The DRNL filter bank consists of two asymmetric bandpass filters which are processed in parallel: one linear path and one nonlinear path. The output of the basilar membrane for our sound example can be seen in the third plot of the figure. Next, time-dependent basilar membrane velocities are transformed into time-dependent inner hair cell cilia displacements. Afterwards, these displacements are transformed by a calcium-controlled transmitter release function into spike probabilities p(t,k), the final output of the considered model, where t is the time, and k is the channel number. For details about the model equations, the reader is refered to the appendix in [18].

For the auditory model with hearing loss, we consider three examples, called “hearing-dummies,” which are described in [16, 17]. These are modified versions of the Meddis auditory model. The goal of the hearing-dummies is to mimic the effect of real hearing impairments [39]. In the original proposal [17], channels with best frequencies between 250 Hz and 8 kHz are considered, whereas in the normal-hearing model described above, channel frequencies between 100 Hz and 6 kHz are used. Note that this difference is just a matter of the user’s interesting frequency range and not influenced by any hearing damage. For a better comparison, the same best frequencies will be taken into account for all models. Since the range between 100 Hz and 6 kHz seems to be more suitable to music, we adjust the three hearing-dummies accordingly.

The first hearing dummy simulates a strong mid- and high-frequency hearing loss. In the original model, this is implemented by retaining the channel with the best frequency of 250 Hz only and by disabling the nonlinear path. In our modified version of that dummy, the first ten channels are retained—all of them having best frequencies lower than or equal to 250 Hz—and the nonlinear path is disabled for all of them. The second hearing dummy simulates a mid-frequency hearing loss indicating a clear dysfunction in a frequency region between 1 and 2 kHz. Therefore, we disable 16 channels (channels 17 to 32) for the modified version of the hearing dummy. The third hearing dummy is a steep high-frequency loss, which is implemented by disabling all channels with best frequencies above 1750 Hz corresponding to the last 12 channels in the model. The parameterization of the three hearing dummies is summarized in Table 1.

Table 1 Parameterization of the three considered hearing dummies and the normal hearing model

A computational study of auditory models in music recognition tasks for normal-hearing and hearing-impaired listeners

Abstract

1 Introduction

2 Related work

2.1 Onset detection

2.2 Pitch estimation

2.3 Instrument recognition

3 Contribution of the paper

4 Music classification using auditory models

4.1 Auditory models

4.2 Onset detection

4.2.1 Baseline onset detection approach

4.2.2 Parameter optimization

4.2.3 Onset detection using an auditory model

4.3 Predominant pitch estimation

4.3.1 Autocorrelation approach

4.3.2 Spectral approach

4.4 Predominant instrument recognition

4.5 Classification methods

4.5.1 Decision trees and Random Forests

4.5.2 Support vector machines

4.5.3 Feature selection

5 Design of experiments

5.1 Data

5.2 Structure of the comparison experiments

5.3 Software

6 Results

6.1 Comparison of proposed approaches

6.1.1 Onset detection

6.1.2 Pitch estimation

6.1.3 Instrument recognition

6.2 Evaluation of hearing dummies

7 Conclusions

References

Acknowledgements

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords