Fig. 4From: Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networksThe confusion matrix (CM) of the best prediction on the fifth fold of the ESC-50 database. The highest confusion can be observed for the classes “frog” and “crow”Back to article page