Learning longterm filter banks for audio source separation and audio scene classification
 Teng Zhang^{1}Email authorView ORCID ID profile and
 Ji Wu^{1}
https://doi.org/10.1186/s1363601801277
© The Author(s) 2018
Received: 21 November 2017
Accepted: 30 April 2018
Published: 30 May 2018
Abstract
■■■
Filter banks on shorttime Fourier transform (STFT) spectrogram have long been studied to analyze and process audios. The frameshift in STFT procedure determines the temporal resolution. However, in many discriminative audio applications, longterm time and frequency correlations are needed. The authors in this work use Toeplitz matrix motivated filter banks to extract longterm time and frequency information. This paper investigates the mechanism of longterm filter banks and the corresponding spectrogram reconstruction method. The time duration and shape of the filter banks are well designed and learned using neural networks. We test our approach on different tasks. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classification error in audio scene classification task is reduced by relatively 6.5%, when compared with the traditional frequency filter banks. The experiments also show that the time duration of longterm filter banks in classification task is much larger than in reconstruction task.
Keywords
1 Introduction
Audios in a realistic environment are typically composed of different sound sources. Yet humans have no problem in organizing the elements into their sources to recognize the acoustic environment. This process is called auditory scene analysis [1]. Studies in the central auditory system [2–4] have inspired numerous hypotheses and models concerning the separation of audio elements. One prominent hypothesis that underlies most investigations is that audio elements are segregated whenever they activate wellseparated populations of auditory neurons that are selective to frequency [5, 6], which emphasizes the audio distinction on the frequency dimension. At the same time, other studies [7, 8] also suggest that auditory scenes are essentially dynamic, containing many fastchanging, relatively brief acoustic events. Therefore an essential aspect of auditory scene analysis is the linking over time [9].
1.1 Related work
For audio separation [10, 11] and recognition [12, 13] tasks, the time and frequency analysis is usually implemented using well designed filter banks.
Filter banks are traditionally composed of finite or infinite response filters in principle [14], but the stability of the filters is usually difficult to be guaranteed. For simplicity, filter banks on STFT spectrogram have been investigated for a long time [15]. In this case, the time resolution is determined by the frameshift in the STFT procedure and the frequency resolution is modelled by the frequency response of the filter banks. Frequency filter banks can be parameterized in the frequency domain with filter centre, bandwidth, gain and shapes [16]. If these parameters are learnable, deep neural networks (DNNs) can be utilized to learn them discriminatively [17–19]. These frequency filter banks are usually used to model the frequency selectivity of the auditory system, but cannot represent the temporal coherence of audio elements.
DNNs are often used as classifiers when the inputs are dynamic acoustic features such as filter bankbased cepstral features and Melfrequency cepstral coefficients [20, 21]. When the input to DNNs is a magnitude spectrogram, timefrequency structure of the spectrogram can be learned. Neural networks organized into a twodimensional space have been proposed to model the time and frequency organization of audio elements by Wang and Chang [22]. They utilized twodimensional Gaussian lateral connectivity and global inhibition to parameterize the network, where the two dimensions correspond to frequency and time respectively. In this model, time is converted into a spatial dimension, temporal coherence can take place in auditory organization much like in visual organization where an object is naturally represented in spatial dimensions. However, these two dimensions are not equivalent in a spectrogram according to our analysis. And what is more, the parameters of the network are set empirically and not learnable, which is still significantly dependent on domain knowledge and modelling skill.
In recent years, neural networks with special structures such as convolutional neural network (CNN) [23, 24] and long shortterm memory (LSTM) [25, 26] have been used to extract the longterm information of audios. But in both network structures, the temporal coherence is considered to be the same in different frequency bins, which is in contradiction with Fig. 1.
1.2 Contribution of this paper
As shown in Fig. 1, when perceptual frequency scale is utilized to map the linear frequency domain to the nonlinear perceptual frequency domain [27], the major concern comes to be how to model the energy distribution and temporal coherence in different frequency bins.
To obtain better time and frequency analysis results, we divide the audio processing procedure into two stages. In the first stage, traditional frequency filter banks are implemented on STFT spectrogram to extract frequency features. Without loss of generality, the parameters of the frequency filter banks are set experimentally. In the second stage, a novel longterm filter bank spanning several frames is constructed in each frequency bin. The longterm filter banks proposed here can be implemented by neural networks and trained jointly with the target of the specific task.

Toeplitz matrix motivated longterm filter banks: Unlike filter banks in frequency domain, our proposal of longterm filter banks spreads over the time dimension. They can be parameterized with the time duration and shape constraints. For each frequency bin, the time duration is different, but for each frame, the filter shape is constant. This mechanism can be implemented using a Toeplitz matrix motivated network.

Spectrogram reconstruction from filter bank coefficients: Consistent with the audio processing procedure, we also divide the reconstruction procedure into two stages. The first stage is a dual inverse process of the longterm filter banks and the second stage is a dual inverse process of the frequency filter banks. This paper investigates the spectrogram reconstruction problem using an elaborate neural network.
This paper is organized as follows. The next section describes the detailed mechanism of the longterm filter banks and the spectrogram reconstruction method. Then network structures used in our proposed method are introduced in Section 3. Section 4 conducts several experiments to show the performance of longterm filter banks regarding source separation and audio scene classification. Finally, we conclude our paper and give directions for future work in Section 5.
2 Longterm filter banks
The input audio signal is first transformed to a sequence of vectors using STFT [28]; the STFT result can be represented as X_{1...T}={x_{1},x_{2},...,x_{ T }}. T is determined by the frame shift in STFT, the dimension of each vector x can be labelled as N, which is determined by the frame length.
When the number of frequency filters is equal to m, the longterm filter banks can be parameterized by m linear transformations. The parameters will be labelled as θ and discussed in the following part of this section in detail.
The backend processing modules vary from different applications. For audio scene classification task, they will be deep convolutional neural networks followed by a softmax layer to convert the feature maps to the corresponding categories. However, for audio source separation task, the modules will be composed by a binary gating layer and some spectrogram reconstruction layers. We define them as nonlinear functions f_{ γ }. The longterm filter bank parameters θ can be trained jointly with the backend parameters γ using back propagation method [33] in neural networks.
2.1 Toeplitz motivation
2.2 Shape constraint
If W is totally independent, S_{ k } is a dense Toeplitz matrix, which means that the time duration of the filter in each frequency bin is T. This assumption is unreasonable especially when T is extremely large. The longterm correlation should be limited to a certain range according to our intuition. Inspired by traditional frequency filter banks, we attempt to use the parameterized window shape to limit the time duration of longterm filter banks.
When we initialize the parameters α_{ k } and σ_{ k } randomly, we believe that the learning will be well behaved, which is the socalled “no bad local minim” hypothesis [36]. However, a different view presented in [37] is that the underlying easiness of optimizing deep networks is rather tightly connected to the intrinsic characteristics of the data these models are run on. Thus for us, the initialization of parameters is a tricky problem, especially when α_{ k } and σ_{ k } have clear physical meanings.
If σ_{ k } in Eq. 3 is initialized with a value larger than 1.0, the corresponding S_{ k } is approximately equal to a ktridiagonal Toeplitz matrix [38], where k is less than 3. Thus, if the totally independent W is initialized with an identity matrix, similar results with limited time durations should be obtained. Whether it is the Gaussian shapeconstrained algorithm as Eq. 3 or is the totally independent W in Eq. 2, the initialization of parameters is important and intractable when adapting to different tasks. More details will be discussed and tested in Section 4.
2.3 Spectrogram reconstruction
In our proposal of learning framework as Fig. 2, STFT spectrogram is transformed into subband coefficients after frequency filter banks and longterm filter banks. The dimension of subband coefficients z_{ t } is usually much less than x_{ t } to reduce computational cost and extract significant features. In this case, the subband coefficients are incomplete, perfect spectrogram reconstruction from subband coefficients is impossible.
Note that a and b can also be regarded as the solutions of two linear systems, which can be learned using a fully connected neural work layer. In this case, the number of parameters reduces from mT^{2} to 2mT.
In conclusion, the spectrogram reconstruction procedure can be implemented using a twolayer neural network. When the first layer is implemented as Eq. 5, the total number of parameters is mN+mT^{2}. While when the first layer is represented as Eq. 6, the total number is mN+2mT. Experiments in Section 4.1 will show the difference between these two methods.
3 Training the models
As described in Section 2, the longterm filter banks we proposed here can be integrated into a neural network (NN) structure. The parameters of the models are learned jointly with the target of the specific task. In this section, we introduce two NNbased structures respectively for audio source separation and audio scene classification tasks.
3.1 Audio source separation
3.2 Audio scene classification
In early pattern recognition studies [42], the input is first converted into some features, which are usually defined empirically by experts and believed to be identified with the recognition targets. In Fig. 5b, a feature extraction structure including the longterm filter banks is proposed to systematically train the overall recognizer in a manner consistent with the minimization of recognition errors.
4 Experiments

Models: The models tested in this section are different from each other in two aspects. The variants of frequency filter banks include TriFB and GaussFB, as described in Section 2. For longterm filter banks, Gaussian shapeconstrained filters introduced in Section 2.2 are named GaussLTFB and totally independent filters are named FullLTFB. The baseline of our experiments has no longterm filter banks, which is labelled as Null. The initials of the names are used to differentiate models. For example, when TriFB and FullLTFB are used in the model, the model is named TriFBFullLTFB.

Initialization: When we use totally independent filters as the longterm filter banks, two initialization methods discussed in Section 2.2 are tested in this section. When the parameters are initialized randomly, the method is named Random, while when the parameters are initialized using an identity matrix, the method is named Identity.

Reconstruction: When the spectrogram reconstruction is implemented as Eq. 5, the method is named Re_inv, while when the reconstruction is implemented as Eq. 6, the method is named Re_toep.
In all experiments, the audio signal is first transformed using shorttime Fourier transform with a frame length of 1024 and a frameshift of 220. The number of frequency filters is set to be 64; the detailed settings of NN structures are shown in Fig. 5. All parameters in the neural network are trained jointly using Adam [45] optimizer; the learning rate is initialized with 0.001.
4.1 Audio source separation
In this experiment, we investigate the application of longterm filter banks in audio source separation task using the MIR1K dataset [46]. The dataset consists of 1000 song clips recorded at a sample rate of 16kHz, with durations ranging from 4 to 13 s. The dataset is then utilized with 4 training/testing splits. In each split, 700 of the examples are randomly selected for training and the others for testing. We use the mean average accuracy over the 4 splits as the evaluation criterion. In order to achieve a fair comparison, we use this dataset to create 3 sets of mixtures. For each clip, we mix the vocal and music track under various conditions, where the energy ratio between music and voice takes 0.1, 1 and 10 respectively.
We first test our methods on the outputs of frequency filter banks. In this case, the combination of classical frequency filter banks and our proposed temporal filter banks work as twodimensional filter banks on magnitude spectrograms. Classical CNN models can learn twodimensional filters on spectrograms directly. Thus we introduce a 1layer CNN model as a comparison. The CNN model is implemented as [22], but the convolutional layer here is composed of learnable parameters, instead of constant Gaussian lateral connectivity in [22]. This convolution layer works as a twodimensional filter whose size is set to be 5×5, the outputs of this layer is then processed as Fig. 5a. We use the NN model in [47] and the onelayer CNN model as our baseline models. For our proposed longterm filter banks, we test two variant modules: GaussLTFB and FullLTFB which have been defined at the beginning of Section 4. For FullLTFB situation, two initialization methods discussed in Section 2.2 are tested respectively. The three variant modules GaussLTFB, FullLTFBRandom and FullLTFBIdentity can be utilized on two types of frequency filter banks TriFB and GaussFB respectively, thus a total of six longterm filter banks related experiments are conducted in this part.
Reconstruction error of audio source separation using frequency filter banks as input
Init  Method  Re_toep  Re_inv  

M/V =0.1  M/V =1  M/V =10  M/V =0.1  M/V =1  M/V =10  
–  TriFBNull  3.49  1.51  0.55  3.49  1.51  0.55 
–  GaussFBNull  3.28  1.47  0.58  3.28  1.47  0.58 
–  TriFBCNN1layer  2.85  1.51  0.61  2.85  1.51  0.61 
–  GaussFBCNN1layer  2.91  1.50  0.64  2.91  1.50  0.64 
–  TriFBGaussLTFB  2.66  1.38  0.50  3.65  1.80  0.74 
–  GaussFBGaussLTFB  2.60  1.39  0.56  3.91  1.67  0.67 
Random  TriFBFullLTFB  3.90  41.37  2.28  3.84  1.83  0.78 
Random  GaussFBFullLTFB  3.55  1.99  0.86  3.85  1.64  0.66 
Identity  TriFBFullLTFB  2.69  1.39  0.52  3.92  1.63  0.62 
Identity  GaussFBFullLTFB  2.62  1.39  0.56  3.85  1.51  0.59 
We now test our methods on magnitude spectrograms as described in [47]. In this situation, longterm filter banks are used as onedimensional filter banks to extract temporal information. The size of magnitude spectrograms is 513×128. The settings of NN structures in Fig. 5a are modified correspondingly to adapt to this size. We also use the NN model in [47] and the 1layer CNN model as our baseline models. The three variant modules GaussLTFB, FullLTFBRandom and FullLTFBIdentity are utilized on magnitude spectrograms directly in this part.
Reconstruction error of audio source separation using magnitude spectrograms as input
4.2 Audio scene classification
In this section, we apply the longterm filter banks to the audio scene classification task. We employ LITIS ROUEN dataset [48] and DCASE2016 dataset [49] to conduct acoustic scene classification experiments.

LITIS ROUEN dataset: This is the largest publicly available dataset for ASC to the best of our knowledge. The dataset contains about 1500 min of acoustic scene recordings belonging to 19 classes. Each audio recording is divided into 30s examples without overlapping, thus obtain 3026 examples in total. The sampling frequency of the audio is 22,050 Hz. The dataset is provided with 20 training/testing splits. In each split, 80% of the examples are kept for training and the other 20% for testing. We use the mean average accuracy over the 20 splits as the evaluation criterion.

DCASE2016 dataset: The dataset is released as Task 1 of the DCASE2016 challenge. We use the development data in this paper. The development data contains about 585 min of acoustic scene recordings belonging to 15 classes. Each audio recording is divided into 30s examples without overlapping, thus obtain 1170 examples in total. The sampling frequency of the audio is 44,100 Hz. The dataset is divided into fourfold. Our experiments obey this setting, and the average performance will be reported.
For both datasets, the examples are 30 s long. In the data preprocessing step, we first divide the 30s examples into 1s clips with 50% overlap. Then each clip is processed using neural networks as Fig. 5b. The classification results of all these clips will be averaged to get an ensemble result for the 30s examples. The size of audio spectrograms is 64×128. For CNN structure in Fig. 5b, the window sizes of convolutional layers are 64×2×64, 64×3×64 and 64×4×64, the fully connected layers are 196×128×19(15). For DCASE2016 dataset, we use dropout rate of 0.5. For all these methods, the learning rate is 0.001, l_{2} weight is 1e^{−4}, training is done using the Adam [45] update method and is stopped after 100 training epochs. In order to compute the results for each trainingtest split, we use the classification error over all classes. The final classification error is its average value over all splits.
We begin with experiments where we train different neural network models without longterm filter banks on both datasets. As described at the beginning of Section 4, our baseline systems take the outputs of frequency filter banks as input. TriFB and GaussFB are placed in the frequency domain to integrate the frequency information. Classical CNN models have the ability to learn twodimensional filters on the spectrum directly. We introduce two CNN structures as a comparison. The first CNN model is implemented as [50], which has multiple convolutional layers, pooling layers, and fully connected layers. The window size of convolutional kernels are 5×5, the pooling size is 3, the output channels are [8, 16, 23], the fully connected layers are 196×128×19(15). Another CNN structure is the same as the onelayer CNN model described in Section 4.1, the outputs of this model is then processed as Fig. 5b.
Average performance comparison with related works on LITIS Rouen dataset and DCASE2016 dataset
Method  DCASE2016 (%)  LITIS Rouen (%)  

Error  Fmeasure  Error  Fmeasure  
TriFBNull  23.12  76.08  3.76  96.19 
GaussFBNull  22.69  76.56  3.48  96.44 
CNNmultilayer [50]  26.45  72.44  4.00  95.80 
CNN1layer [22]  23.29  75.82  2.97  96.91 
RNNGam [26]  –  –  3.4  – 
CNNGam [24]  –  –  4.2  – 
MFCCGMM [49]  27.5  –  –  – 
DNNCQT [51]  –  78.1  –  96.6 
DNNMel [53]  23.6  –  –  – 
CNNMel [54]  24.0  –  –  – 
We now test our longterm filter banks on both datasets. We also test three variant modules in this part: GaussLTFB, FullLTFBRandom and FullLTFBIdentity. These three variant modules can be injected into neural networks directly as Fig. 5b.
Average performance comparison using different configurations on LITIS Rouen dataset and DCASE2016 dataset
Init  Method  DCASE2016 (%)  LITIS Rouen (%)  

Error  Fmeasure  Error  Fmeasure  
–  TriFBNull  23.12  76.08  3.76  96.19 
–  GaussFBNull  22.69  76.56  3.48  96.44 
–  TriFBGaussLTFB  22.40  76.79  2.82  97.05 
–  GaussFBGaussLTFB  22.15  77.11  2.97  96.91 
Random  TriFBFullLTFB  22.67  76.49  3.47  96.35 
Random  GaussFBFullLTFB  21.21  78.05  2.96  96.92 
Identity  TriFBFullLTFB  23.35  75.69  3.67  96.18 
Identity  GaussFBFullLTFB  23.13  75.83  3.21  96.61 
4.3 Reconstruction vs classification
In the experiment of audio source separation task, when the parameters of totally independent longterm filter banks are initialized randomly, the result seems to be unable to converge effectively. However, it is completely the opposite in audio scene classification task.
5 Conclusions
A novel framework of filter banks that can extract longterm time and frequency correlation is proposed in this paper. The new filters are constructed after traditional frequency filters and can be implemented using Toeplitz matrix motivated neural networks. Gaussian shape constraint is introduced to limit the time duration of the filters, especially in reconstructionrelated tasks. Then a spectrogram reconstruction method using the Toeplitz matrix inversion is implemented using neural networks. The spectrogram reconstruction error in audio source separation task is reduced by relatively 6.7% and the classification error in audio scene classification task is reduced by relatively 6.5%. This paper provides a practical and complete framework to learn longterm filter banks for different tasks.
The former frequency filter banks are somehow interrelated with the longterm filter banks. Combining the idea of these two types of filter banks, future work will be an investigation on twodimensional filter banks.
Declarations
Funding
This work was partly funded by National Natural Science Foundation of China (Grant No: 61571266).
Authors’ contributions
TZ designed the core methodology of the study, carried out the implementation and experiments, and he drafted the manuscript. JW participated in the study and helped to draft the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 AS Bregman, Auditory scene analysis: the perceptual organization of sound (MIT Press, Cambridge, 1994).Google Scholar
 S McAdams, A Bregman, Hearing musical streams. Comput. Music J.3(4), 26–60 (1979).Google Scholar
 AS Bregman, Auditory streaming is cumulative. J. Exp. Psychol. Hum. Percept. Perform.4(3), 380 (1978).View ArticleGoogle Scholar
 GA Miller, GA Heise, The trill threshold. J. Acoust. Soc. Am.22(5), 637–638 (1950).View ArticleGoogle Scholar
 MA Bee, GM Klump, Primitive auditory stream segregation: a neurophysiological study in the songbird forebrain. J. Neurophysiol.92(2), 1088–1104 (2004).View ArticleGoogle Scholar
 D Pressnitzer, M Sayles, C Micheyl, IM Winter, Perceptual organization of sound begins in the auditory periphery. Curr. Biol.18(15), 1124–1128 (2008).View ArticleGoogle Scholar
 H Attias, CE Schreiner, in Advances in Neural Information Processing Systems. Temporal loworder statistics of natural sounds (MIT PressCambridge, 1997), pp. 27–33.Google Scholar
 NC Singh, FE Theunissen, Modulation spectra of natural sounds and ethological theories of auditory processing. J. Acoust. Soc. Am.114(6), 3394–3411 (2003).View ArticleGoogle Scholar
 SA Shamma, M Elhilali, C Micheyl, Temporal coherence and attention in auditory scene analysis. Trends. Neurosci.34(3), 114–123 (2011).View ArticleGoogle Scholar
 DL Donoho, Denoising by softthresholding. IEEE Trans. Inf. Theory.41(3), 613–627 (1995).MathSciNetView ArticleMATHGoogle Scholar
 B Gao, W Woo, L Khor, Cochleagrambased audio pattern separation using twodimensional nonnegative matrix factorization with automatic sparsity adaptation. J. Acoust. Soc. Am.135(3), 1171–1185 (2014).View ArticleGoogle Scholar
 A Biem, S Katagiri, BH Juang, in Neural Networks for Processing [1993] III. Proceedings of the 1993 IEEESP Workshop. Discriminative feature extraction for speech recognition (IEEE, 1993), pp. 392–401.Google Scholar
 Á de la Torre, AM Peinado, AJ Rubio, VE Sánchez, JE Diaz, An application of minimum classification error to feature space transformations for speech recognition. Speech Commun.20(34), 273–290 (1996).View ArticleGoogle Scholar
 S Akkarakaran, P Vaidyanathan, in Acoustics, Speech, and Signal Processing, 1999. Proceedings, 1999 IEEE International Conference On, 3. New results and open problems on nonuniform filterbanks (IEEE, 1999), pp. 1501–1504.Google Scholar
 S Davis, P Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics Speech Signal Process.28(4), 357–366 (1980).View ArticleGoogle Scholar
 A Biem, S Katagiri, E McDermott, BH Juang, An application of discriminative feature extraction to filterbankbased speech recognition. IEEE Trans. Speech Audio Process.9(2), 96–110 (2001).View ArticleGoogle Scholar
 TN Sainath, B Kingsbury, AR Mohamed, B Ramabhadran, in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop On. Learning filter banks within a deep neural network framework (IEEE, 2013), pp. 297–302.Google Scholar
 H Yu, ZH Tan, Y Zhang, Z Ma, J Guo, Dnn filter bank cepstral coefficients for spoofing detection. IEEE Access. 5:, 4779–4787 (2017).View ArticleGoogle Scholar
 H Seki, K Yamamoto, S Nakagawa, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. A deep neural network integrated with filterbank learning for speech recognition (IEEE, 2017), pp. 5480–5484.Google Scholar
 H Yu, ZH Tan, Z Ma, R Martin, J Guo, Spoofing detection in automatic speaker verification systems using dnn classifiers and dynamic acoustic features. IEEE Trans. Neural Netw. Learn. Syst.1–12 (2017).Google Scholar
 H Yu, ZH Tan, Z Ma, J Guo, Adversarial network bottleneck features for noise robust speaker verification (2017). arXiv preprint arXiv:1706.03397.Google Scholar
 D Wang, P Chang, An oscillatory correlation model of auditory streaming. Cogn. Neurodynamics.2(1), 7–19 (2008).View ArticleGoogle Scholar
 S Lawrence, CL Giles, AC Tsoi, AD Back, Face recognition: a convolutional neuralnetwork approach. IEEE Trans. Neural Netw.8(1), 98–113 (1997).View ArticleGoogle Scholar
 H Phan, L Hertel, M Maass, P Koch, R Mazur, A Mertins, Improved audio scene classification based on labeltree embeddings and convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process.25(6), 1278–1290 (2017).View ArticleGoogle Scholar
 S Hochreiter, J Schmidhuber, Long shortterm memory. Neural Comput.9(8), 1735–1780 (1997).View ArticleGoogle Scholar
 H Phan, P Koch, F Katzberg, M Maass, R Mazur, A Mertins, Audio scene classification with deep recurrent neural networks (2017). arXiv preprint arXiv:1703.04770.Google Scholar
 S Umesh, L Cohen, D Nelson, in Acoustics, Speech, and Signal Processing, 1999. Proceedings, 1999 IEEE International Conference On, 1. Fitting the Mel scale (IEEE, 1999), pp. 217–220.Google Scholar
 J Allen, Short term spectral analysis, synthesis, and modification by discrete fourier transform. IEEE Trans. Acoustics Speech Signal Process.25(3), 235–238 (1977).View ArticleMATHGoogle Scholar
 RF Lyon, AG Katsiamis, EM Drakakis, in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium On. History and future of auditory filter models (IEEE, 2010), pp. 3809–3812.Google Scholar
 S Rosen, RJ Baker, A Darling, Auditory filter nonlinearity at 2 khz in normal hearing listeners. J. Acoust. Soc. Am.103(5), 2539–2550 (1998).View ArticleGoogle Scholar
 R Patterson, I NimmoSmith, J Holdsworth, P Rice, in a Meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2. An efficient auditory filterbank based on the gammatone function, (1987).Google Scholar
 S Young, G Evermann, M Gales, T Hain, D Kershaw, X Liu, G Moore, J Odell, D Ollason, D Povey, et al, The htk book. Cambridge university engineering department. 3:, 175 (2002).Google Scholar
 DE Rumelhart, GE Hinton, RJ Williams, et al, Learning representations by backpropagating errors. Cogn. Model.5(3), 1 (1988).MATHGoogle Scholar
 EH Bareiss, Numerical solution of linear equations with Toeplitz and vector Toeplitz matrices. Numerische Mathematik. 13(5), 404–424 (1969).MathSciNetView ArticleMATHGoogle Scholar
 N Deo, M Krishnamoorthy, Toeplitz networks and their properties. IEEE Trans. Circuits Syst.36(8), 1089–1092 (1989).MathSciNetView ArticleGoogle Scholar
 YN Dauphin, R Pascanu, C Gulcehre, K Cho, S Ganguli, Y Bengio, in Advances in Neural Information Processing Systems. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization (Curran Associates, Inc., 2014), pp. 2933–2941.Google Scholar
 O Shamir, Distributionspecific hardness of learning neural networks (2016). arXiv preprint arXiv:1609.01037.Google Scholar
 J Jia, T Sogabe, M ElMikkawy, Inversion of ktridiagonal matrices with toeplitz structure. Comput. Math. Appl.65(1), 116–125 (2013).MathSciNetView ArticleMATHGoogle Scholar
 A BenIsrael, TN Greville, Generalized inverses: theory and applications, vol. 15 (Springer Science & Business Media, 2003).Google Scholar
 ST Lee, HK Pang, HW Sun, Shiftinvert arnoldi approximation to the Toeplitz matrix exponential. SIAM J. Sci. Comput.32(2), 774–792 (2010).MathSciNetView ArticleMATHGoogle Scholar
 X Zhao, Y Shao, D Wang, Casabased robust speaker identification. IEEE Trans. Audio Speech Lang. Process.20(5), 1608–1616 (2012).View ArticleGoogle Scholar
 RO Duda, PE Hart, DG Stork, Pattern classification (Wiley, New York, 1973).MATHGoogle Scholar
 Y Kim, Convolutional neural networks for sentence classification (2014). arXiv preprint arXiv:1408.5882.Google Scholar
 R Collobert, J Weston, L Bottou, M Karlen, K Kavukcuoglu, P Kuksa, Natural language processing (almost) from scratch. J. Mach. Learn. Res.12(Aug), 2493–2537 (2011).MATHGoogle Scholar
 D Kingma, J Ba, Adam: A method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980.Google Scholar
 CL Hsu, JSR Jang, MIR Database (2010). http://sites.google.com/site/unvoicedsoundseparation/mir1k/. Retrieved 10 Sept 2017.Google Scholar
 EM Grais, G Roma, AJ Simpson, MD Plumbley, Twostage singlechannel audio source separation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process.25(9), 1773–1783 (2017).View ArticleGoogle Scholar
 A Rakotomamonjy, G Gasso, IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 142–153 (2015).Google Scholar
 A Mesaros, T Heittola, T Virtanen, in Signal Processing Conference (EUSIPCO), 2016 24th European. Tut database for acoustic scene classification and sound event detection (IEEE, 2016), pp. 1128–1132.Google Scholar
 Y LeCun, Y Bengio, et al., Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. 3361(10), 1995 (1995).Google Scholar
 V Bisot, R Serizel, S Essid, G Richard, Feature learning with matrix factorization applied to acoustic scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1216–1229 (2017).View ArticleGoogle Scholar
 JC Brown, Calculation of a constant q spectral transform. J. Acoust. Soc. Am.89(1), 425–434 (1991).View ArticleGoogle Scholar
 Q Kong, I Sobieraj, W Wang, M Plumbley, in Proceedings of DCASE 2016. Deep neural network baseline for dcase challenge 2016 (Tampere University of Technology. Department of Signal Processing, 2016).Google Scholar
 D Battaglino, L Lepauloux, N Evans, F Mougins, F Biot, Acoustic scene classification using convolutional neural networks. DCASE2016 Challenge, Tech. Rep.(Tampere University of Technology. Department of Signal Processing, 2016).Google Scholar