 Research
 Open Access
Discriminative frequency filter banks learning with neural networks
 Teng Zhang^{1}Email authorView ORCID ID profile and
 Ji Wu^{1}
https://doi.org/10.1186/s1363601801446
© The Author(s) 2019
 Received: 1 April 2018
 Accepted: 29 November 2018
 Published: 3 January 2019
Abstract
Filter banks on spectrums play an important role in many audio applications. Traditionally, the filters are linearly distributed on perceptual frequency scale such as Mel scale. To make the output smoother, these filters are often placed so that they overlap with each other. However, fixedparameter filters are usually in the context of psychoacoustic experiments and selected experimentally. To make filter banks discriminative, the authors use a neural network structure to learn the frequency center, bandwidth, gain, and shape of the filters adaptively when filter banks are used as a feature extractor. This paper investigates several different constraints on discriminative frequency filter banks and the dual spectrum reconstruction problem. Experiments on audio source separation and audio scene classification tasks show performance improvements of the proposed filter banks when compared with traditional fixedparameter triangular or gaussian filters on Mel scale. The classification errors on LITIS ROUEN dataset and DCASE2016 dataset are reduced by 13.9% and 4.6% relatively.
Keywords
 Discriminative frequency filter banks
 Networks
 Audio scene classification
 Audio source separation
1 Introduction
1.1 Related work
Filter banks are parameterized in the frequency domain with the frequency center c_{n}, bandwidth s_{n}, gain α_{n}, shape g, and frequency scale p. The result w_{n} is a continuous function defined in the frequency domain. When p is a linear function, filter banks are uniformly distributed in the frequency domain. However, there is a strong desire to analyze audio signals similar to human ears, which means a nonlinear function named auditory filter banks [18–20]. Based on psychoacoustics experiments, three nonlinear mappings between the frequency and perceptual domain are commonly used, including the Bark scale [21], ERB scale [22], and Mel scale [23]. The parameters α_{n}, c_{n}, and s_{n} in Eq. 1 represent the frequency properties of w_{n}, which simulate the frequency selectivity in human ears. In [16], g is selected as a gaussian function because of its smoothness and tractability, correspondingly, the Mel filter banks use triangular filters [17]. When g is totally independent and not limited to any specific shape, w_{n} for each filter can be parameterized as a fully connected mapping from all frequency bins to a value.
Auditory filters of different shapes have been trained discriminatively for robust speech recognition [24]. Filter banks can also be trained discriminatively using Fisher discriminant analysis (FDA) method [25]. In recent years, deep neural networks (DNN) have achieved significant success in the field of audio processing and recognition because of its advantages in discriminative feature extraction. Standard filter banks computed in the time domain have been simulated using unsupervised convolutional restricted Boltzmann machine(ConvRBM) [26]. The speech recognition performance of ConvRBM features is improved compared to the Melfrequency cepstrum coefficients (MFCCs), and the relative improvements are 5% on TIMIT test set and 7% on WSJ0 database using GMMHMM systems. Discriminative frequency filter banks can also be learned together with the recognition error using a timeconvolutional layer and a temporal pooling layer over the raw waveform [27]. The results in [27] show that the filter size and pooling operation play an important role in the performance improvement, but the temporal convolutional operation is timeconsuming.
Filter banks implemented in the frequency domain are also studied with DNNs in recent years. When g in Eq. 1 is parameterized in all frequency bins, and the parameters are restricted to be positive using exponential function exp [28] or sigmoid [29], filter banks with multiple peaks and complicated shape are learned for specific tasks. However, further experiments show that the positive constraint is too weak to learn smooth and robust filter banks. When g in Eq. 1 is restricted to a gaussian shape, the gain, frequency center, and bandwidth in Eq. 1 can be learned using a neural network [30]. The triangular filter shape (commonly used to compute Mel scale features) is not investigated since it is piecewise differentiable and difficult to be incorporated into the scheme of a backpropagation algorithm.
1.2 Contribution of this paper
In this paper, we use a neural network structure to learn the frequency center, bandwidth, gain, and shape of filter banks adaptively, and investigate several different constraints on filter banks and the dual spectrum reconstruction problem.
This condition means that there are more transformed subband coefficients per second than the original data points. In this case, the filter banks are overcomplete [31] and a perfect reconstruction from the subband coefficients is possible. However, in some scenarios, audio reconstruction from incomplete information is necessary because of the limitation of storage and computing resources, especially when the signals are sampled at a higher rate greater than or equal to 44.1 kHz. Speech reconstruction from MFCCs has been studied by predicting the fundamental frequency and voicing of a frame as intermediation [32–34]. The simplest case is that n_{i} in Eq. 2 equals to the frame length N, which is equivalent to filter banks implemented in the frequency domain in this paper.
As shown in Eq. 1, when filter banks are parameterized and learned using neural networks, a major concern is the constraint to the shape of its responses in the frequency range. When the constraint is weak [28, 29], the number of parameters is too large to learn smooth and robust filter banks in some scenarios. When the constraint is a basic shape function and this function is piecewise differentiable such as the triangular shape [30], the model cannot be trained using a backpropagation algorithm.
At the same time, the subband processing module in Fig. 1 may introduce distortions, particularly if the subbands are not equally processed, in this case, signal reconstruction in the frequency domain is not analytical.

Approximate continuous shape function: shape constraints play an important role in discriminative frequency filter banks. Few investigations have been conducted to compare different shape constraints, because that commonly used shapes such as triangular shapes are piecewise differentiable. We use steep sigmoid functions and other basic functions to approximate desired shapes. This makes a further study on shape constraints possible.

Comparison of different constraints: in Eq. 1, different selections of trainable parameters can result in different implementations of filter banks. In this paper, we select six different constraints to investigate their applicable condition. When all parameters are constant, we adopt triangular and gaussian shapes whose frequency centers distribute uniformly in the Melfrequency scale. For weak constraints, we conduct experiments similar to [28, 29]. For strong constraints, both gaussian and triangular constraints are used to train the frequency center, bandwidth, and gain in Eq. 1.

Reconstruction from incomplete filter bank coefficients: in this paper, the amount of filter bank coefficients is much less than original data points, so the reconstruction can be seen as a process of solving overdetermined linear equations. We use a neural network to implement this reconstruction process, and a welldesigned regularization method is used to make sure that the filter banks are bounded input bounded output (BIBOstable).
The paper is organized as follows. Next section briefly describes the Melfrequency scale used in this paper and introduce the uniformly distributed filter banks with constant parameters as the baseline. Section 3 introduces the analytical and experimental settings of our proposed filter bank learning framework. Then, network structures used in our proposed methods are introduced in Section 4. Section 5 conducts several experiments to show the performance of discriminative frequency filter banks in terms of source separation and audio scene classification tasks. Finally, we conclude our paper and give directions for future work in Section 6.
2 Background
Filter banks are used to model the frequency selectivity of an auditory system in many applications. Traditionally, the design of filter banks is motivated by psychoacoustic experiments, such as the detection of tones in noise maskers [35], or by physiological experiments such as observing the mechanical responses of the cochlea when a sound reaches the ear [36, 37]. The frequency center, bandwidth, and energy gain in the frequency response of filter banks are consistent with the position and vibration patterns in the ear. In the history of auditory filter banks [35], rounded exponential family [38] and gammatone family [39] are the most widely used families. We use the simplest form of these two families, triangular case for the rounded exponential family and gaussian case for the gammatone family, to construct our filter banks in the frequency domain. In this section, we introduce the commonly used Melfrequency filter banks.
2.1 Melfrequency scale
2.2 Melfrequency filter banks
The commonly used MFCC features in the field of speech recognition are computed based on Melfrequency filter banks. It is a common practice to construct filters distributing uniformly in the Melfrequency scale, and the bandwidth is often 50% overlapped between neighboring filters.
3 Discriminative filter bank learning
The input audio signal is first transformed to a sequence of vectors using STFT, the STFT result can be represented as X_{1...T}={x_{1},x_{2},...,x_{T}}. T is determined by the frame shift in STFT, corresponding to the time resolution in the frame theory [42]. The dimension of each vector x can be labeled as N, which is determined by the frame length.
The discriminative frequency filter banks in Fig. 2 can be simplified as linear transformations f_{θ}, the output of this module can be represented as Y_{1...T}={f_{θ}(x_{1}),f_{θ}(x_{2}),...,f_{θ}(x_{T}}). θ are the parameters of filter banks defined similar to Eq. 1. The dimension of each y_{t}=f_{θ}(x_{t}) here is equal to M, which is the number of filters.
The backend application modules in Fig. 2 vary from different applications. For audio scene classification task, they will be deep convolutional neural networks followed by a softmax layer to convert the feature maps to the corresponding categories. However, for audio source separation task, the modules will be composed by a binary gating layer and some spectrogram reconstruction layers. We simplify all these situations and define the backend application modules as nonlinear functions f_{β}. The filter bank parameters θ can be trained jointly with the backend parameters β using a backpropagation method in neural networks.

Shape constraint: in this case, the amplitude of filter’s frequency response is constrained to be a special shape, and only the frequency center, bandwidth, and gain of the filter remain to be trained. The gaussian shape has been investigated in [16, 30]. We will focus on the piecewise differentiable situation such as the triangular shape.

Positive constraint: when all the weights of filters are independent but only constrained to be positive, more complicated filter banks can be learned. Exponential functions such as exp [28] and sigmoid [29] have been used together with a bandwidth constraint for the filters. We investigate two new positive constraints ReLU and square, and discuss their performances associated with the bandwidth constraint.
3.1 Shape constraints of discriminative frequency filter banks
Triangular filters are commonly used to compute Melscale filter bank features in many audio applications such as speech recognition. However, when we use a triangular shape described in Eq. 4 to restrict the discriminative frequency filter banks in Fig. 2, the backward propagation process is blocked because of the discontinuous point in the triangular shape.
The trainable parameters in Eq. 9 are the frequency center c_{n}, bandwidth s_{n}, and gain α_{n}. The goal of the training procedure is to minimize some objective loss ε. The derivative of an objective loss given trainable parameters can be calculated by backpropagating error gradients.
3.2 Positive constraint of discriminative frequency filter banks

Exponent: for every parameter w_{ij}, we make it positive by transform it to v_{ij}=exp(w_{ij})[28]. If w_{ij}∼N(μ,σ), v_{ij} satisfies the lognormal distribution, where the mean of v_{ij} is \(e^{\mu +\frac {\sigma ^{2}}{2}}\) and the variance of v_{ij} is \(\left (e^{\sigma ^{2}}1\right)e^{2\mu +\sigma ^{2}}\).

Sigmoid: for every parameter w_{ij}, we use the sigmoid function \(v_{ij}=\frac {1}{1+{\text {exp}}(w_{ij})} \)[29] to ensure the parameters positive. If w_{ij}∼N(μ,σ), v_{ij} satisfies a logitnormal distribution, where the moments of v_{ij} is not analytical, but the numerical calculating results have been discussed in [43].

ReLU: for every parameter w_{ij}, we simply make v_{ij}=0, when w_{ij}<0 and v_{ij}=w_{ij}, when w_{ij}≥0. This will lead to a folded normal distribution. When w_{ij}∼N(μ,σ), the mean of v_{ij} is \(\sigma \sqrt {\frac {2}{\pi }}e^{\frac {\mu ^{2}}{2\sigma ^{2}}}\) and the variance of v_{ij} is μ^{2}+σ^{2}−[mean(v_{ij})]^{2}.

Square: the last option to make the parameters positive is that \(v_{ij}=w_{ij}^{2}\). Then, v_{ij} is a variable satisfying a chisquared distribution. The mean of v_{ij} is σ^{2}(1+μ^{2}), and the variance of v_{ij} is σ^{4}(2+4μ^{2}).

Exponent: mean = 1.0, variance = 0.01.

Sigmoid: mean = 0.5, variance ≈ 0.01.

ReLU: mean ≈ 0.08, variance≈0.01.

Square: mean ≈ 0.01, variance≈0.0002.
In this section, we consider two variants of discriminative frequency filter banks. If the frequency center c_{n} and bandwidth s_{n} in Eq. 1 are constant and the filter weights are restrained to be positive, the filter weights are limited in the range of bandwidth. All the above distributions can be good solutions. Another case is that the filter weights are totally independent. In this case, the resulting distributions of the exponent and sigmoid constraints mean that most filter weights are not zero, which violates the physical meaning of filter banks. In order to fulfill the physical meaning, the moments of positive transformations should be around N(0.1,0.01), which is approximately calculated using the Melfrequency triangular filter banks defined in Section 2.2. The inverse calculation of these positive transformations shows that when the parameters are initialized w∼N(−3.0,2.0), the exponent and sigmoid constraints may result in meaningful distributions.
Thus, when the filter banks are constrained by constant bandwidths and frequency centers, all these positive constraints are suitable. But when the filter weights are totally independent, only ReLU and square constraints are suitable, unless we can perform elaborate initialization for different positive transformations. Our experiments in Section 3.3 demonstrate our conclusion.
3.3 Reconstruction from filter bank coefficients
In the traditional design of filter banks as Fig. 1, the completeness of filter banks is determined by the number of filters M and the channel decimation rate n_{k}. In our proposal of discriminative frequency filter banks, n_{k} is equivalent to the frame length N. And in general, M is less than N for the purpose to reduce the computational cost and extracting significant features. In this case, the filter banks are incomplete and hence, the perfect spectral reconstruction from the filter bank coefficients is impossible.
In Eq. 13, cond(R) means the condition number of R and ∥·∥ means the Frobenius norm of a matrix.
A large condition number implies that the linear system is illconditioned in the sense that small errors in the input can lead to huge errors in the output. So, we modify the reconstruction loss by adding an L2regularization constraint to keep the linear system stable. This is also known as the boundedinput, boundedoutput (BIBO) stability [47].

Shape constraint: for shape constraints in Section 3.1, parameters such as the frequency center c_{n} and bandwidth s_{n}, do not contribute to the regularization. Regularization of the gain α_{n} should be added up across the bandwidth.

Positive constraint: for positive constraints in Section 3.2, all parameters contribute to the regularization. The positive weights v_{ij} should replace the filter bank parameters w_{ij} to calculate the regularization, but the regularization of reconstruction parameters r_{ij} remain unchanged.
3.4 Reconstruction vs classification
For spectrum reconstructionrelated tasks as described in Eq. 11, the output size of the reconstruction system is NT, where N is the FFT length, and T is the number of frames. Thus, the number of equations in optimizing the reconstruction matrix R and filter bank matrix F is DNT, where D is the number of audio samples. Meanwhile, for positive constraints, the number of parameters in R and F is about 2NM, where M is the number of filter banks. For shape constraints, the number of parameters is about 3M+NM. M is usually much less than DT, so the reconstruction usually can be seen as a process of solving overdetermined linear equations.
Correspondingly, when the output of filter banks is followed by a classifier, the number of equations in solving the classification task is DC, where C is the number of classes. The number of parameters is MN+MC for positive constraints, and 3M+MC for shape constraints. In some smallscale applications, DC is less than MN. The classification is equivalent of solving underdetermined linear equations for positive constraints. Overfitting is a notorious issue in this scenario. This phenomenon can be seen in Section 5.5.
4 Model description
As described in Section 3, the discriminative frequency filter banks we proposed here can be integrated into a neural network (NN) structure. The parameters of the models are learned jointly with the target of a specific task. In this section, we introduce two NNbased structures respectively for audio source separation and audio scene classification tasks.
4.1 Audio source separation
4.2 Audio scene classification
In Fig. 4b, a feature extraction structure including the discriminative frequency filter banks is proposed to systematically train the overall recognizer in a manner consistent with the minimization of recognition errors.
5 Experiments
To illustrate the properties and performance of the discriminative frequency filter banks proposed in this paper, we conduct three experiments respectively on spectrum reconstruction, audio source separation and audio scene classification tasks. In the first experiment, several groups of comparisons are made on reconstruction errors to verify the assumption and conclusion we proposed in Section 3. Moreover, we have two more experiments to test the applications of the discriminative frequency filter banks to audio source separation and audio scene classification tasks.
5.1 Filter bank settings

TriFB: frequency centers of the filters distribute uniformly in the Melfrequency scale, bandwidths are 50% overlapped between neighboring filters, the gain is 1, and the shape is restrained with Eq. 4.

GaussFB: frequency centers of the filters distribute uniformly in the Melfrequency scale, bandwidths are 4σ of an gaussian distribution as Eq. 5, the gain is 1, and the shape is restrained with Eq. 5.

TriFBDN: in order to achieve a fair comparison with TriFB, the initialization of the frequency centers, bandwidths, and gain of the filters are the same as TriFB, the shape is restrained with Eq. 9, and the gain and bandwidths are guaranteed to be positive with a square constraint described in Section 3.2.

GaussFBDN: in order to achieve a fair comparison with GaussFB, the initialization of the frequency centers, bandwidths, and gain of the filters are the same as GaussFB, the shape is restrained with Eq. 5. Other settings are the same as TriFBDN.

BandPosFBDN: frequency centers and bandwidths are the same as GaussFB, all parameters are initialized using N(0,0.1), and are guaranteed to be positive with the square constraint described in Section 3.2. The shape is not restrained.

PosFBDN: the parameters are initialized using N(0,0.1) and are guaranteed to be positive with the square constraint described in Section 3.2. There are no constraints for the frequency centers, bandwidths, and shape of the filters.
5.2 Dataset and experimental setup
In this section, we employ three datasets to conduct the experiments. MIR1K dataset [51] is utilized to implement the spectrum reconstruction and audio source separation experiments. LITIS ROUEN [52] and DCASE2016 [53] datasets are used for audio scene classification experiments.

MIR1K dataset: this dataset consists of 1000 song clips recorded at a sample rate of 16,000 Hz, with durations ranging from 4 to 13 s. The dataset is then utilized with four training/testing splits. In each split, 700 examples are randomly selected for training and the others for testing. We use the mean average accuracy over the four splits as the evaluation criterion.

LITIS ROUEN dataset: this is the largest publicly available dataset for ASC to the best of our knowledge. The dataset contains about 1500 min of audio scene recordings belonging to 19 classes. Each audio recording is divided into 30s examples without overlapping, thus obtaining 3026 examples in total. The sampling frequency of the audio is 22,050 Hz. The dataset is provided with 20 training/testing splits. In each split, 80% of the examples are kept for training and the other 20% for testing. We use the mean average accuracy over the 20 splits as the evaluation criterion.

DCASE2016 dataset: the dataset is released as task 1 of the DCASE2016 challenge. We use the development data in this paper. The development data contains about 585 min of audio scene recordings belonging to 15 classes. Each audio recording is divided into 30s examples without overlapping, thus obtaining 1170 examples in total. The sampling frequency of the audio is 44,100 Hz. The dataset is divided into fourfolds. Our experiments obey this setting, and the average performance will be reported.
In all experiments, the audio signal is first transformed using STFT with the frame length of 1024 and the frame shift of 10 ms, so the size of audio spectrums is 513×128. The minibatch size is set to be 50, and the learning rate is initialized with 0.001.
In our audio source separation experiments, the number of discriminative filters is set to be 64, other parameters are set as described in Section 4.1. When the spectrum reconstruction is needed, the regularization coefficient is set to be 0.0001. Training is done using the Adam [54] update method and is stopped after 500 training epochs.
In our audio scene classification experiments, the number of discriminative filters is also set to be 64. For both LITIS ROUEN and DCASE2016 datasets, we use rectified linear units; the window sizes of convolutional layers are 64×2×64, 64×3×64, and 64×4×64, and the fully connected layers are 196×128×19(15). For DCASE2016 dataset, we use the dropout rate of 0.5. Training is done using the Adam update method and is stopped after 100 training epochs.
5.3 Properties of discriminative frequency filter banks
Reconstruction SDR under different positive constraints in decibel
Initialization  Constraint  M = 32  M = 64  

FB  TI  FB  TI  
N(0, 0.1)  Exponent  8.57  6.72  13.10  6.74 
Sigmoid  8.54  8.89  12.84  9.43  
ReLU  8.45  14.44  12.84  18.54  
Square  8.57  14.44  12.84  18.24  
N(− 3.0, 2.0)  Exponent  9.02  14.44  13.25  17.70 
Sigmoid  8.36  14.31  12.84  17.97 
Audio scene classification performance under different positive constraints
Initialization  Constraint  FB  TI  

Accuracy  MCC  Accuracy  MCC  
N(0, 0.1)  Exponent  77.21  75.85  71.20  69.48 
Sigmoid  77.87  76.60  70.75  68.97  
ReLU  77.37  76.05  76.92  75.59  
Square  77.89  76.63  75.96  74.62  
N(− 3.0, 2.0)  Exponent  77.97  76.70  73.52  72.04 
Sigmoid  77.16  75.82  71.44  69.81 
Reconstruction SDR with/without regularization in decibel
Method  M = 32, R = T  M = 64, R = T  M = 64, R = F 

TriFB  8.45  13.01  12.92 
GaussFB  8.12  12.44  12.44 
TriFBDN  10.32  14.69  13.28 
GaussFBDN  9.55  12.92  12.22 
BandPosFBDN  8.57  12.84  12.15 
PosFBDN  14.44  18.24  17.21 
5.4 Audio source separation
In this experiment, we investigate the application of discriminative frequency filter banks in audio source separation tasks using the MIR1K dataset. We attempt the music separation from a vocal and music mixture using Fig. 4a.
Reconstruction SDR of audio source separation in decibel. M/V represents the energy ratio between music and voice
M/V  0.1  1  10 

TriFB  4.47  8.30  12.01 
GaussFB  4.85  8.39  12.22 
TriFBDN  5.19  8.51  12.92 
GaussFBDN  5.13  8.45  13.01 
BandPosFBDN  5.33  8.39  12.84 
PosFBDN  5.70  9.14  16.99 
5.5 Audio scene classification (ASC)
When filter banks are used as a feature extractor, the filter banks proposed in this paper can extract more salient features. In this section, we apply the discriminative frequency filter banks to the ASC task. The NN structure is implemented as Fig. 4b. We employ LITIS ROUEN and DCASE2016 datasets in our experiments.
In the data preprocessing step, we first divide a 30s example into 1s clips with 50% overlap. Then each clip is processed as Fig. 2 for feature extraction. The classification results of all these clips will be averaged to get an ensemble result for the 30 s example.
Performance comparison on LITIS ROUEN dataset
Method  Accuracy  Fmeasure  MCC  Error 

TriFB  96.24  96.19  96.01  3.76 
GaussFB  96.33  96.44  96.11  3.67 
TriFBDN  96.61  96.50  96.39  3.39 
GaussFBDN  96.83  96.71  96.63  3.17 
BandPosFBDN  96.84  96.71  96.64  3.16 
PosFBDN  96.15  96.04  95.91  3.85 
CNNGam [9]  95.8  95.8  –  4.2 
CNNMFCC [9]  94.0  93.7  –  6.0 
CNNLog [9]  95.1  95.0  –  4.9 
RNNGam [8]  96.4  96.6  –  3.6 
RNNMFCC [8]  95.4  95.8  –  4.6 
RNNLog [8]  95.9  96.2  –  4.1 
Performance comparison on DCASE2016 dataset
6 Conclusion
The construction of discriminative frequency filter banks that can be learned by neural networks has been presented in this paper. The filter banks are implemented on FFTbased spectrums and can be constrained under different conditions to express different aspects of physical meanings. For shaperelated constraints, a piecewise differentiable triangular shape is approximated using several differentiable basic functions. For positive constraints, ReLU and square constraints are proposed to fulfill the demand for the probability distribution of weights. Then, a spectrum reconstruction method from incomplete filter bank coefficients is implemented using neural networks. A welldesigned regularization strategy is also studied to guarantee the filter banks to be BIBOstable. Overall, this paper provides a practical and complete framework to learn discriminative frequency filter banks for different tasks.
The discriminative frequency filter banks proposed in this paper are compared with traditional fixedparameter filter banks using several experiments. The results show performance improvements for both music reconstruction and audio classification tasks. However, not all variants of discriminative frequency filter banks are suitable for all situations. In our experiments, positive constrained filter banks perform best on music reconstruction tasks, and shape constrained filter banks obtain the best results on ASC tasks.
Discriminative frequency filter banks on FFTbased spectrums have the ability to get adaptive resolution on the frequency domain. To achieve adaptive resolution on the time domain, the future work will include introducing temporal information into filter banks, for example, the filter banks may span several frames. We will also perform crossdomain experiments to learn filter banks on one dataset and use it for classification tasks on another dataset to see if the generalized filter banks can be learned as done in [55].
Declarations
Acknowledgments
Not applicable.
Funding
This work was partly funded by National Natural Science Foundation of China (Grant No. 61571266).
Availability of data and materials
The datasets analysed during the current study are available in the MIR1K repository, http://sites.google.com/site/unvoicedsoundseparation/mir1k/, LITIS ROUEN repository, https://sites.google.com/site/alainrakotomamonjy/home/audioscene, and DCASE2016 repository, http://www.cs.tut.fi/sgn/arg/dcase2016/download.
Authors’ contributions
TZ designed the core methodology of the study, carried out the implementation and experiments, and drafted the manuscript. JW participated in the study and helped to draft the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 J. Allen, Short term spectral analysis, synthesis, and modification by discrete fourier transform. IEEE Trans. Acoust. Speech Signal Process.25(3), 235–238 (1977).View ArticleGoogle Scholar
 I. Daubechies, The wavelet transform, timefrequency localization and signal analysis. IEEE Trans. Inf. Theory. 36(5), 961–1005 (1990).MathSciNetView ArticleGoogle Scholar
 S. Akkarakaran, P. Vaidyanathan, in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference On, vol 3. New results and open problems on nonuniform filterbanks (IEEEPiscataway, 1999), pp. 1501–1504.Google Scholar
 A. Biem, S. Katagiri, B. H. Juang, in Neural Networks for Processing [1993] III. Proceedings of the 1993 IEEESP Workshop. Discriminative feature extraction for speech recognition (IEEEPiscataway, 1993), pp. 392–401.View ArticleGoogle Scholar
 Á de la Torre, A. M. Peinado, A. J. Rubio, V. E. Sánchez, J. E. Diaz, An application of minimum classification error to feature space transformations for speech recognition. Speech Comm. 20(34), 273–290 (1996).View ArticleGoogle Scholar
 N. Chen, Y. Qian, H. Dinkel, B. Chen, K. Yu, in INTERSPEECH. Robust deep feature for spoofing detection—the sjtu system for asvspoof 2015 challenge (International Speech Communication Association (ISCA)Dresden, 2015), pp. 2097–2101.Google Scholar
 Y. Qian, N. Chen, K. Yu, Deep features for automatic spoofing detection. Speech Comm. 85:, 43–52 (2016).View ArticleGoogle Scholar
 H. Phan, P. Koch, F. Katzberg, M. Maass, R. Mazur, A. Mertins, Audio scene classification with deep recurrent neural networks. arXiv preprint arXiv:1703.04770 (2017).Google Scholar
 H. Phan, L. Hertel, M. Maass, P. Koch, R. Mazur, A. Mertins, Improved audio scene classification based on labeltree embeddings and convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1278–1290 (2017).View ArticleGoogle Scholar
 B. Gao, W. Woo, L. Khor, Cochleagrambased audio pattern separation using twodimensional nonnegative matrix factorization with automatic sparsity adaptation. J. Acoust. Soc. Am.135(3), 1171–1185 (2014).View ArticleGoogle Scholar
 J. Le Roux, E. Vincent, Consistent wiener filtering for audio source separation. IEEE Signal Process Lett.20(3), 217–220 (2013).View ArticleGoogle Scholar
 P. Majdak, P. Balazs, W. Kreuzer, M. Dörfler, in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference On. A timefrequency method for increasing the signaltonoise ratio in system identification with exponential sweeps (IEEEPiscataway, 2011), pp. 3812–3815.View ArticleGoogle Scholar
 D. L. Donoho, Denoising by softthresholding. IEEE Trans. Inf. Theory. 41(3), 613–627 (1995).MathSciNetView ArticleGoogle Scholar
 R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification (Wiley, New York, 1973).MATHGoogle Scholar
 A. Biem, S. Katagiri, in Acoustics, Speech, and Signal Processing, 1993. ICASSP93., 1993 IEEE International Conference On, vol 2. Feature extraction based on minimum classification error/generalized probabilistic descent method (IEEEPiscataway, 1993), pp. 275–278.Google Scholar
 A. Biem, S. Katagiri, E. McDermott, B. H. Juang, An application of discriminative feature extraction to filterbankbased speech recognition. IEEE Trans. Speech Audio Process.9(2), 96–110 (2001).View ArticleGoogle Scholar
 S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics Speech Signal Process.28(4), 357–366 (1980).View ArticleGoogle Scholar
 V. Hohmann, Frequency analysis and synthesis using a gammatone filterbank. Acta Acustica U. Acustica. 88(3), 433–442 (2002).Google Scholar
 T. Irino, R. D. Patterson, A dynamic compressive gammachirp auditory filterbank. IEEE Trans. Audio Speech Lang. Process.14(6), 2222–2232 (2006).View ArticleGoogle Scholar
 E. A. LopezPoveda, R. Meddis, A human nonlinear cochlear filterbank. J. Acoust. Soc. Am.110(6), 3107–3118 (2001).View ArticleGoogle Scholar
 E. Zwicker, E. Terhardt, Analytical expressions for criticalband rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am.68(5), 1523–1525 (1980).View ArticleGoogle Scholar
 B. R. Glasberg, B. C. Moore, Derivation of auditory filter shapes from notchednoise data. Hear. Res.47(1), 103–138 (1990).View ArticleGoogle Scholar
 R. P. Lippmann, Speech recognition by machines and humans. Speech Commun.22(1), 1–15 (1997).View ArticleGoogle Scholar
 B. Mak, Y. C. Tam, R. Hsiao, in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference On, vol 2. Discriminative training of auditory filters of different shapes for robust speech recognition (IEEEPiscataway, 2003), p. 45.Google Scholar
 T. Kobayashi, J. Ye, in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference On. Discriminatively learned filter bank for acoustic features (IEEEPiscataway, 2016), pp. 649–653.View ArticleGoogle Scholar
 H. B. Sailor, H. A. Patil, in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference On. Filterbank learning using convolutional restricted boltzmann machine for speech recognition (IEEEPiscataway, 2016), pp. 5895–5899.View ArticleGoogle Scholar
 T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, O Vinyals, in INTERSPEECH. Learning the speech frontend with raw waveform cldnns (International Speech Communication Association (ISCA)Dresden, 2015), pp. 2097–2101.Google Scholar
 T. N. Sainath, B. Kingsbury, A. R. Mohamed, B. Ramabhadran, in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop On. Learning filter banks within a deep neural network framework (IEEEPiscataway, 2013), pp. 297–302.View ArticleGoogle Scholar
 H. Yu, Z. H. Tan, Y. Zhang, Z. Ma, J. Guo, Dnn filter bank cepstral coefficients for spoofing detection. IEEE Access. 5:, 4779–4787 (2017).View ArticleGoogle Scholar
 H. Seki, K. Yamamoto, S. Nakagawa, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. A deep neural network integrated with filterbank learning for speech recognition (IEEEPiscataway, 2017), pp. 5480–5484.View ArticleGoogle Scholar
 S. Strahl, A. Mertins, Analysis and design of gammatone signal models. J. Acoust. Soc. Am.126(5), 2379–2389 (2009).View ArticleGoogle Scholar
 B. Milner, X. Shao, Prediction of fundamental frequency and voicing from melfrequency cepstral coefficients for unconstrained speech reconstruction. IEEE Trans. Audio Speech Lang. Process.15(1), 24–33 (2007).View ArticleGoogle Scholar
 D. Chazan, R. Hoory, G. Cohen, M. Zibulski, in Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference On, vol 3. Speech reconstruction from mel frequency cepstral coefficients and pitch frequency (IEEEPiscataway, 2000), pp. 1299–1302.Google Scholar
 B. Milner, X. Shao, in ICSLP. Speech reconstruction from melfrequency cepstral coefficients using a sourcefilter model (International Speech Communication Association (ISCA)Denver, 2002), pp. 2421–2424.Google Scholar
 R. F. Lyon, A. G. Katsiamis, E. M. Drakakis, in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium On. History and future of auditory filter models (IEEEPiscataway, 2010), pp. 3809–3812.View ArticleGoogle Scholar
 T. Necciari, N. Holighaus, P. Balazs, Z. Prusa, A perceptually motivated filter bank with perfect reconstruction for audio signal processing. arXiv preprint arXiv:1601.06652 (2016).Google Scholar
 W. A. Yost, R. R. Fay, Auditory perception of sound sources, vol 29 (Springer Science & Business Media, Berlin, 2007).View ArticleGoogle Scholar
 S. Rosen, R. J. Baker, A. Darling, Auditory filter nonlinearity at 2 khz in normal hearing listeners. J. Acoust. Soc. Am.103(5), 2539–2550 (1998).View ArticleGoogle Scholar
 R. Patterson, I. NimmoSmith, J. Holdsworth, P. Rice, in a Meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol 2. An efficient auditory filterbank based on the gammatone function, (1987).Google Scholar
 S. S. Stevens, J. Volkmann, The relation of pitch to frequency: A revised scale. Am. J. Psychol.53(3), 329–353 (1940).View ArticleGoogle Scholar
 S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, et al., The htk book. Camb. Univ. Eng. Dept.3:, 175 (2002).Google Scholar
 P. Balazs, M. Dörfler, F. Jaillet, N. Holighaus, G. Velasco, Theory, implementation and applications of nonstationary gabor frames. J. Comput. Appl. Math.236(6), 1481–1496 (2011).MathSciNetView ArticleGoogle Scholar
 P. Frederic, F. Lad, Two moments of the logitnormal distribution. Commun. Stat.–Simul. Comput.®. 37(7), 1263–1269 (2008).MathSciNetView ArticleGoogle Scholar
 M. James, The generalised inverse. Math. Gaz.62(420), 109–114 (1978).MathSciNetView ArticleGoogle Scholar
 A. BenIsrael, T. N. Greville, Generalized Inverses: Theory and Applications, vol 15 (Springer Science & Business Media, Berlin, 2003).MATHGoogle Scholar
 R. Hagen, S. Roch, B. Silbermann, C*algebras and Numerical Analysis (CRC Press, Boca Raton, 2000).View ArticleGoogle Scholar
 P. Varaiya, R. Liu, Boundedinput boundedoutput stability of nonlinear timevarying differential systems. SIAM J. Control.4(4), 698–704 (1966).MathSciNetView ArticleGoogle Scholar
 X. Zhao, Y. Shao, D. Wang, Casabased robust speaker identification. IEEE Trans. Audio Speech Lang. Process.20(5), 1608–1616 (2012).View ArticleGoogle Scholar
 Y. Kim, Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).Google Scholar
 R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch. J. Mach. Learn. Res.12(Aug), 2493–2537 (2011).MATHGoogle Scholar
 H. ChaoLing, J. Shing, R. Jang, MIR Database (2010). http://sites.google.com/site/unvoicedsoundseparation/mir1k/. Accessed 8 Dec 2018.
 A. Rakotomamonjy, G. Gasso, Histogram of gradients of timefrequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP). 23(1), 142–153 (2015).Google Scholar
 A. Mesaros, T. Heittola, T. Virtanen, in Signal Processing Conference (EUSIPCO), 2016 24th European. Tut database for acoustic scene classification and sound event detection (IEEEPiscataway, 2016), pp. 1128–1132.View ArticleGoogle Scholar
 D. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
 H. B. Sailor, H. A. Patil, Novel unsupervised auditory filterbank learning using convolutional rbm for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.24(12), 2341–2353 (2016).View ArticleGoogle Scholar
 Q. Kong, I. Sobieraj, W. Wang, M. Plumbley, Deep neural network baseline for dcase challenge 2016. Tampere University of Technology, Department of Signal Processing. Proceedings of DCASE 2016 (2016).Google Scholar
 D. Battaglino, L. Lepauloux, N. Evans, F. Mougins, F. Biot, Acoustic scene classification using convolutional neural networks. DCASE2016 Challenge, Tech. Rep. Tampere University of Technology, Department of Signal Processing (2016).Google Scholar