Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.


Introduction
We are regularly surrounded by dynamic audio events, from which some are quite pleasant, such as singing birds or nice music tracks, other less so, like the sound of a chainsaw or a siren. Even at a young age, humans have the ability to analyse and understand a large number of audio activities and the interconnections between them, whilst filtering out a wide range of distractions [1]. In the era of machine learning, computer audition systems for intelligent housing systems [2,3], recognition of acoustic scenes [4,5] and sound event detection [4,6,7] are being developed. Therefore, it is essential for such systems to perform with high accuracy in real-world conditions. Despite recent developments in the field of audio analysis, contemporary machine learning systems are still facing a major challenge to perform the mentioned tasks with human-like precision. Moreover, deep learning-based technologies lack a mechanism to generalise well when faced with data scarcity problems. In this regard, we follow a threefold strategy by (i) proposing a cross-modal transfer learning strategy in the form of Ima-geNet pre-trained convolutional neural networks (CNNs) to cope with the limited data challenges, (ii) utilising a CRNN for learning tempo-spatial characteristics of audio signals, and (iii) fusing various neural network strategies to check for further performance improvements.
In particular, we investigate the performance of our methodologies to solve a 9-class audio-based classification problem of daily activities performed in a domestic environment [8], and further evaluate the system for acoustic scene and environmental sound classification.
Recently, Vecchiotti et al. [9] demonstrated the efficacy of CNNs for the task of voice activity detection in a multipurpose domestic environment, and Versperini et al. [10] showed that CNNs can achieve great performance when applied to the detection of rare audio events. At the same time, recurrent neural networks (RNNs) have been widely utilised in order to model the sequential nature of audio data and capture their long-term temporal dependencies [11][12][13][14][15]. With respect to the above-mentioned literature, we propose our hybrid CRNN approach to obtain representations from both CNNs and RNNs. It is worth mentioning that CRNNs, which have been first proposed for document classification [16], are considered as state-ofthe-art in various audio recognition tasks, including music classification [17], acoustic event detection (AED) [18] and recognition of specific acoustic vocalisation [19]. Furthermore, they have been successfully applied for speech enhancement [20] and detection of rare audio events, for example, in smart home systems [7].
In addition to our proposed CRNN system, we investigate the efficacy of a transfer learning approach by utilising VGG16 and VGG19 [21], ResNet [22] and DenseNet [23] models for the aforementioned audio classification problem [8,24]. These models are popular CNN architectures pre-trained on the ImageNet corpus [25]. The main reason behind using pre-trained CNNs is the robust performance that such systems have found across various audio classification and recognition tasks [26,27]. We further want to investigate if the features learnt for the task of visual object recognition can provide additional information for acoustic scene classification complimentary to training a deep CRNN model on the audio data from scratch. For this, we implemented a late fusion strategy based on support vector machine (SVM) classifiers which are trained on the predictions obtained from our two systems. Finally, we compare ImageNet pre-training to random weight initialisation and models trained on large-scale audio classification tasks in the form of openl3 models [28,29] and PANNs [30].
The remainder of this paper is organised as follows. In the proceeding section, the datasets used in our experiments are presented. Then, the structure of our proposed framework is introduced in Section 3. Afterwards, the experimental results are discussed and analysed in Section 4. Finally, conclusions and future work plans are given in Section 5.

Datasets
We evaluate our proposed systems on three datasets. The first set originates from the "Monitoring of domestic activities based on multi-channel acoustics" task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2018) 1 [8,24]. It contains audio data labelled with the particular domestic activity occurring in the recording. The data has been recorded with 7 microphone arrays, each consisting of four linearly arranged microphones. Those microphone arrays were placed in a studio sized holiday home and the person living there was continuously recorded for the period of 1 week. The continuous recordings were then split into 72,984 single audio segments of 10-s length and labelled with 9 different activities (absence, cooking, dish washing, eating, other, social activity, vacuum cleaning, watching TV and working). Segments containing more than one household activity were discarded. The development data of the challenge consists of audio samples recorded by four microphone arrays at different locations. For the evaluation, partition data of seven microphone arrays is used, consisting of the four microphone arrays available in the development partition, and three unknown microphone arrays [8]. We use the exact setup as provided by the challenge organisers. For detailed information about this dataset, the interested reader is referred to [8,24].
Further, we show the efficacy of the proposed fusion approach on two additional datasets: the acoustic scene classification challenge (task 1) of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop [31] and the environmental sound classification dataset ESC-50 [32]. DCASE 2017 contains 4680 10-s audio samples of 15 distinct acoustic scenes in the development partition and another 1620 samples for model evaluation. Furthermore, a crossvalidation setup is provided for the development partition which we also use for our experiments. ESC-50's 2000 samples of environmental sounds are spread evenly across 50 categories. As for DCASE, a cross-validation setup is also given. In order to have a similar setup for our experiments, we use four of the five folds during training and development while setting the fifth fold aside for evaluation. This allows us to optimise model parameters using 4-fold cross-validation and afterwards test the best configurations on unseen data.

Methods and experimental settings
An overview of our deep learning framework is given in Fig. 1. First, Mel-spectrograms are extracted from the audio data (cf. Section 3.1). After this, the extracted spectrograms are forwarded through the CRNN (cf. Section 3.2) and DEEP SPECTRUM (cf. Section 3.3) systems. Subsequently, our CRNN is trained on these Mel-spectrograms, and deep feature representations are extracted by a range of CNN networks which serve as input for SVM classification. Finally, in a decision-level fusion, the results achieved by different configurations are fused (cf. Section 3.4). We have decided to choose SVM classifiers for our experiments as they have consistently performed well on DEEP SPECTRUM features [1,26,27] and are very efficient in high-dimensional feature space [33].

Spectrogram extraction
To create the Mel-spectrograms from the audio data, we apply periodic Hann windows with length 0.32 s and overlap 0.16 s. From these, we then compute 128 of log-scaled Mel-frequency bands. Mel-spectra features have been shown to be useful for audio tasks, such as speech processing and acoustic scene classification [14,19,27,34]. The Mel-spectra are then normalised, so that the maximum amplitude is at 0 dB. In our initial experiments on DCASE 2018, we also clip the spectrograms at different amplitudes-− 30 dB, − 45 dB and − 60 dB-to minimise the effect of background noise and eliminate higher amplitude signals that are not correlated with the class of the audio recordings.

CRNN framework
As indicated in Section 1, deep models trained by CNNs and RNNs are suitable for AED and an array of other audio classification tasks. CNNs are trained by learning filters that are shifted in time and frequency. This automatically enables them to extract high-level features that are shiftinvariant in both the frequency and time axes [35,36]. This also means that those features will mostly contain short-term temporal context. Due to the inherent nature of CNNs, the ability to extract long-term temporal context is limited. In contrast, an RNN can extract long-term temporal features and struggles to capture short-term and shift-invariant information [37].
The advantages of CNNs and RNNs can be leveraged by combining them into a CRNN, replacing a specified amount of the final layers of the CNN with recurrent layers.

DCASE 2018
Our CRNN for DCASE 2018, task 5 consists of 3 convolutional blocks where each block contains one convolutional layer, batch normalisation along the channel axis [38], exponential linear units (ELUs) as activation function [39], two-dimensional max-pooling and a dropout layer with 30 % dropout [40]. The convolutional layers use a 5 × 5, 4 × 4 and 3 × 3 convolutional kernel. We have used a maxpooling with a size of 2 for the time dimension and a size of 32 for the frequency dimension in the third convolutional layer. The three convolutional blocks are followed by two gated recurrent units [41] each with 256 hidden units. We then apply a final dropout with 30% to minimise the possible overfitting effects [40]. The probabilities for each class are computed by a softmax layer with 9 logits. The loss of the CRNN is calculated with the cross entropy on the logits and the network is trained with the ADAM optimiser with β 1 = 0.9, β 2 = 0.999, a combination of two learning rates lr ∈ [ 0.01, 0.001], and batch size ∈ [ 64, 128] were evaluated. A learning rate decay of 0.002 was adjusted and the network was trained for 30 epochs.

DCASE 2017 and ESC-50
On DCASE 2017 and ESC-50, the CRNN is slightly adapted to use 4 convolutional blocks and smaller 3 × 3 convolutional kernels throughout based on initial experiments. We use the same optimiser with a learning rate of 0.001 and train for 50 epochs on DCASE 2017 and 100 epochs on ESC-50 due to the smaller dataset sizes. Furthermore, we refrained from using amplitude clipping at different rates; instead, we clip every spectrogram below − 80 dB.

Pre-trained CNNs as feature extractors
In addition to CRNNs, we also employ the DEEP SPEC-TRUM toolkit 2 [42] to extract deep features from the audio samples with VGG16, VGG19 [21], 50-layer ResNet [22] and DenseNet121 [23] networks that have been pre-trained on ImageNet. In combination with differing machine learning algorithms, these features have performed well for various audio-based recognition tasks [1,26,27,43].
For the extraction of these features, Mel-spectrograms (with 128 Mel-frequency bands) are first plotted from the audio clips with the matplotlib library and the resulting images are then forwarded through the networks. For VGG16 and VGG19, we use the neuron activations of the second to last fully connected layer as representations, while for the ResNet and DenseNet networks, global average pooling is applied to the convolutional base to form the audio features. For the work presented herein, we also evaluate the ImageNet pre-training against random initialisation of weights and using features extracted from models trained on audio data in with the open 2 https://github.com/DeepSpectrum/DeepSpectrum source toolkits openl3 3 [28,29] and PANNs 4 [30]. While openl3 uses mel-spectrograms as input just as DEEP SPECTRUM, PANNs employs a hybrid wavegram feature, combining a small 1D CNN trained directly on the raw audio waveform with mel-spectrograms by concatenation along the channel axis. Both approaches make use of CNNs as feature extractor. For openl3, we further use the network trained on environmental sounds instead of the one for music recognition as the former fits better with our target tasks. As classifier, we use a linear SVM to which we feed the DEEP SPECTRUM features after applying input standardisation. We optimise the classifier's complexity parameter on a logarithmic scale from 10 −9 to 1 to achieve the best macro averaged F1 score on the suggested 4-fold cross-validation (CV) setup. The same procedure is also applied on DCASE 2017 and ESC-50.

Decision-level fusion
In order to assess whether the different systems trained in our experiments are complimentary to each other, we apply a decision-level fusion approach to the predictions of the CRNN models and the SVM classifiers trained on the deep features extracted by the various CNNs. On all three datasets, we perform classifier stacking by utilising the predictions generated by all of the individual trained models, i. e. we concatenate the class probabilities or decision function values generated by CRNN and SVM models to form a new set of features. These features are then used to train another linear SVM as a meta-model to predict the correct class labels. We use feature standardisation and optimise the SVM's complexity parameter on a logarithmic scale from 10 −9 to 1 with the official 4-fold cross-validation schemes on the development partition for both DCASE tasks, and the first four folds of ESC-50. The best performing configuration is then trained on the whole development partition and used to predict the class labels on the test set.  Table 3.

DCASE 2018, task 5
In each fold, we train our CRNNs for each the Melspectrogram data and evaluate it on the test partition. We perform mean fusion of the class probabilities generated by each of the fourfold models to arrive at the final predictions on the test set. We further experiment with different learning rate and batch size combinations, specifically lr ∈  Table 1 demonstrate that the CRNN systems perform best when trained with a batch size of 64 and a learning rate of 0.01. We choose one model for each of the clipping values for evaluation on the test set and for decision-level fusion. On the development partition, a lone CRNN model performs best when clipping noise is below − 60 dB, achieving an F1 score of 78.8 %. Clipping more noise (at − 45 dB and − 30 dB) results in worse performance on the development partition, indicating a loss of useful information found in the input signal. When looking at the results on the evaluation partition, clipping noise below − 45 dB leads to the strongest result of 79.3% F1. This behaviour might be caused by the introduction of recordings from microphones which are not present in the development partition. Therefore, clipping further might counteract the influence of the unfamiliar sound characteristics of these microphones. Furthermore, noise clipping has a regulating effect on CRNN training, acting against overfitting on the recording setting of the development partition. While clipping less of the input signal allows the model to perform better on the development set, it in turn loses some of its generalisation capabilities.
The training procedure of the SVM models utilising various CNN networks as feature extractors is as described in Section 3.3. For DCASE 2018, we also evaluated the impact on classifier performance resulting from choosing different colour maps for the plots of the melspectrograms used in the DEEP SPECTRUM system. In Table 2, results with five different colour mappings for an ImageNet pre-trained 50-layer ResNet are presented. From these results, it can be seen that choosing different colour mappings only has a marginal effect on classification accuracy. Based on these findings, we do not use multiple colour maps for the remaining databases.
Of larger interest are the results achieved with different configurations of model architecture and pre-training, as can be seen in Table 3. Notably, ImageNet pre-trained DenseNet121 and ResNet50 achieve the highest performance on the test partition measured by macro average F1, with 81.1% and 80.3%, respectively. For all network architectures, pre-training on ImageNet improves the saliency of the extracted features when applied to domestic activity classification when compared to using randomly initialised weights. These performance deltas are in the range of 5 to 10 percentage points. Compared to the two evaluated audio pre-trained CNNs, ImageNet pretrained CNNs further are very favourable. While PANN achieves a higher F1 score of 84.6% than any of the DEEP SPECTRUM systems, openl3 features perform worse than every other feature extractor, even when taking the randomly initialised image CNNs into account. When late fusion is applied to the different system configurations,   a cross-modal fashion by fusing DEEP SPECTRUM, openl3 and PANN shows a larger performance improvement to the highest F1 of 87.0%. This perceived complementarity of features indicates the viability of transfer learning across modalities. The confusion matrix of this best result is also displayed in Fig. 2. While this result falls shortly behind the top performing submission of the challenge which utilises data augmentation with generative adversarial networks (GANs) at 88.4 %, it improves on the strong baseline of 85.0 %.

DCASE 2017, task 1
In the case of DCASE 2017's acoustic scene classification task, the CRNN trained only on the corpus performs slightly below the challenge' also the worst among the three tasks, indicating that for DCASE 2017, pre-training is not as efficient as for the other databases, regardless of source domain. A confusion matrix for the best fusion configuration can be found in Fig. 3.

ESC-50
For ESC-50, the CRNN trained only on target data achieves a test set accuracy of 68.8% which is worse than the dataset's official baseline at 72.4%. However, it has to be noted here that the dataset uses a 5-fold crossvalidation setup whereas in this paper, we transformed this to 4-  The confusion matrix (CM) of the best prediction on the test set of the DCASE 2017, task 1 database. Confusion is high for the acoustic scene "residential area" which is often mistaken for "city" or "forest_path"  The confusion matrix (CM) of the best prediction on the fifth fold of the ESC-50 database. The highest confusion can be observed for the classes "frog" and "crow" this mix, however, has a performance degrading effect. The best result on ESC-50 is also visualised via a confusion matrix in Fig. 4.

Conclusions and future work
We have proposed a deep learning framework composed of an image-to-audio transfer learning system, audio pre-trained CNNs and a CRNN. Furthermore, we performed various decision-level fusion strategies between the applied neural networks. We have tested our methodologies for audio-based classification of 15 acoustic scenes (DCASE 2017, task 1 [31]), 50 environmental sounds (ESC-50 [32]) and 9 domestic activities (DCASE 2018, task 5 [8]). We have demonstrated the suitability of our approaches for all of the mentioned tasks. In particular, we have shown that even though the domain gap between audio and images is considerably larger than what is usually found in the field of transfer learning, ImageNet pre-trained CNNs are powerful feature extractors when applied directly to spectrograms, oftentimes matching or outperforming specialised audio feature extraction networks. We further evaluated the ImageNet pre-training against random weight initialisation and found it to be more effective in general. Moreover, various late fusion configurations indicated a complementarity between DEEP SPECTRUM features and more domain-specific knowledge, either in the form of our proposed CRNN or audio pretrained networks. Whilst our systems did not outperform the current state-of-the-art on the included databases, the findings presented herein motivate further exploration of cross-modal pre-training for audio classification tasks.
In future work, we want to evaluate the impact of Ima-geNet pre-training against AudioSet pre-training as well as training from scratch in low-data settings. Furthermore, we want to investigate traditional fine-tuning and more involved domain transfer methods, such as domain adversarial neural networks (DANNs) [45] with our DEEP SPECTRUM system.