Singer identification using perceptual features and cepstral coefficients of an audio signal from Indian video songs
© Ratanpara and Patel. 2015
Received: 25 December 2014
Accepted: 17 June 2015
Published: 25 June 2015
Singer identification is a difficult topic in music information retrieval because background instrumental music is included with singing voice which reduces performance of a system. One of the main disadvantages of the existing system is vocals and instrumental are separated manually and only vocals are used to build training model. The research presented in this paper automatically recognize a singer without separating instrumental and singing sounds using audio features like timbre coefficients, pitch class, mel frequency cepstral coefficients (MFCC), linear predictive coding (LPC) coefficients, and loudness of an audio signal from Indian video songs (IVS). Initially, various IVS of distinct playback singers (PS) are collected. After that, 53 audio features (12 dimensional timbre audio feature vectors, 12 pitch classes, 13 MFCC coefficients, 13 LPC coefficients, and 3 loudness feature vector of an audio signal) are extracted from each segment. Dimension of extracted audio features is reduced using principal component analysis (PCA) method. Playback singer model (PSM) is trained using multiclass classification algorithms like back propagation, AdaBoost.M2, k-nearest neighbor (KNN) algorithm, naïve Bayes classifier (NBC), and Gaussian mixture model (GMM). The proposed approach is tested on various combinations of dataset and different combinations of audio feature vectors with various Indian male and female PS’s songs.
KeywordsAdaBoost.M2 Singer Music Timbre Pitch Signal Audio MFCC LPC Loudness
Indian music has become popular because of their playback singers. An Indian Hindi movie contains video songs which are sung by distinct playback singers. Viewer can upload/download video songs from internet and CD/DVDs. Indian video songs (IVS) can be extracted from Indian movies [1, 2] which increases collections rapidly. Indexing of such IVS requires information in a different dimension like playback singer of IVS and on-screen actor/actress performing on IVS. Currently, this information is attached manually as a textual caption with IVS. Textual caption is highly unreliable because IVS is uploaded by ordinary user. So viewer requires powerful functions for browsing [2, 3], searching , and indexing video content. Singing voice is one of the most important parameters in Indian video songs. The singing voice of a singer is the element of a song which attracts the listeners. Singing is a continuous speech. Therefore speech and synthesis analysis techniques are not the same for singing voice. There is no efficient algorithm which works fine on speech identification and singing voice characterization together. So information on the singer’s voice is essential to organize, extract, and classify music collections . Sometimes, a viewer is interested to hear Indian video songs based on their interest like favorite playback singer, actor, and actress. So there is a requirement to develop a system which provides the above features.
The proposed system can identify singing voice and recognize a singer from IVS. Significant accuracy can be achieved by extracting features of audio part from IVS. One of the usefulness of this system is famous Indian playback singer’s video songs can be identified from a big database. It can be useful to learn singer voice characteristics by listening to songs of different genres. It can be useful in categorizing unlabeled IVS and copyright protection. IVS require information in different dimensions for efficient searching and indexing. So this system can be useful to index and efficient search for query-based IVS retrieval. Here Indian video songs are considered rather than Indian audio songs because this system can be extended to for visual clues also. Indian video songs are marketed by its music, actor, and actress. So it is necessary to index Indian video song using parameters like playback singers, actor, and actress for efficient search and retrieval.
2 Related work
A significant amount of research has been performed on speaker identification from digitized speech for applications such as verification of identity. These systems use features which are used in speech recognition and speaker recognition. Systems are trained on data without background noise and performance  tend to degrade in noisy environments. They are trained on spoken data in which it produces poor result for the singing voice input. Mel frequency cepstral coefficients (MFCCs)  are originally developed for automatic speech recognition applications and can be used for music modeling. Pitch and rhythm audio features are computed. MFCC feature vectors and artificial neural network classifier are used to identify playback singer  from a database. An accuracy of 70 % is achieved by this system using 10-artist database. Instrumental and singing sounds were not separated in the system. Singer’s vibrato-based octave frequency cepstral coefficients (OFCC)  are used in singer identification. Experiments were performed using 84 popular songs only from 12-singer database. An average error rate of 16.2 % is achieved in segment level identification. In , composite transfer function-based features are extracted and polynomial classifier is used for classification. Self-recorded database for 12 female singers are used to build training model which produces 82 % accuracy. Music features are extracted for a musicological purpose using Echo Nest API . In , spectrogram is an effective parameter in time-frequency feature which is used as input classification. Several classification techniques are compared such as feed forward network  and k-nearest neighbor. Energy function, zero crossing rate, and harmonic coefficients  are used for singer identification. One of the drawbacks of the above system is training model is generated manually. Singer voice is separated manually by removing instrumental music from audio songs and it is used to build training model. Sometimes self-recorded dataset is considered for singer identification. It increases complexity of a system and requires lots of execution time. In our proposed approach, training model is built automatically, not manually. Vocals and instrumental music are not separated out manually. Both are used to build training model. In other systems, only audio songs of singers are considered. But here, video songs are taken as input. The main advantage of this system is extension using visual clues. Actor and actress can be classified from video portion. It can merge with our proposed system because sometimes users are interested to watch IVS of their favorite actor or actress on screen and listen to their favorite singer in the background.
The rest of the paper is organized as follows: Section 3 describes proposed approach. Experimental setup is given in Section 4. Experimental results are explained in Section 5 followed by conclusion in Section 6.
3 Proposed approach
Algorithm 1 represents singer recognition approach using different classifiers from IS.
Collection of N Indian video songs of M singers.
Separate audio portion and video portion from each video song.
Compute x1, x2, x3 ……x53 audio feature vectors for each segment of audio portion. Where x1-x12 timbre audio feature vectors, x13-x24 pitch class, x25-x27 loudness, x28- x40 MFCC feature vectors, and x41-x53 LPC coefficients. These features are stored in S1 structure. Size of S1 structure is S x 53, where S is total number of segments. Total number of segments depend on length of audio portion and audio feature which are explained Section 3.2.
Mean removal technique is applied on S1 structure and result is stored in S2 structure.
Principal component analysis method is used on S2 structure to compute eigenvalue, eigenvector using single value decomposition (SVD) technique. Score is obtained. Result is stored in score structure which is divided into two parts (training dataset (80 %) and testing dataset (20 %)).
PSM is obtained using back propagation and AdaBoost.M2 algorithm using neural network. PSM is also obtained using KNN, GMM, and naïve Bayes classifier.
Compute probability of each song for M singers from test sample song dataset.
Return a recognize singer name which contains maximum probability among M singers.
3.1 Song collection
In the proposed approach, dataset is generated by own because there is no standard database that is available for IVS of different singers. Six famous singers are selected from Indian Bollywood industry. For each singer, video songs are downloaded from the Internet. Then audio and video portion is separated out from IVS. Audio portion is divided into various segments to compute various audio features which is explained in Section 3.2.
3.3 Feature extraction
Music researchers have started a company named the Echo Nest  in 2005. It gives free analysis of music via API. Users can retrieve information regarding artists, blogs , and songs. A song has been uploaded by users which lead to unique song id to extract song features like tempo, timbre, loudness, and pitches. It can also collect socially oriented information like blogs, social networks, and web pages. Echo Nest version 4.2 is used in the procedure. Fifty-three audio feature vectors (×1, ×2, ×3 …×53) are computed from each segment. The following audio features are extracted using Echo Nest API: (1) Echo Nest timbre (ENT), (2) Echo Nest pitches, and (3) echo nest loudness.
3.3.1 Echo Nest timbre
3.3.2 Echo Nest pitches
Pitch is an auditory sensation in which a listener assigns musical tones to relative positions. It is measured in hertz. Pitch [16, 17] content is given by a “chroma” vector, corresponding to the 12 pitch classes C, C#, D to B, with values ranging from 0 to 1 that describe the relative dominance of every pitch in the chromatic scale . Twelve audio features (×13 to ×24) are computed for each segment by 12 pitch classes.
3.3.3 Echo Nest loudness
Loudness start: provides loudness level at the start of the segment
Loudness max time: maximum loudness value within the segment
Loudness max: highest loudness value within the segment
Mel frequency cepstral coefficients and linear predictive coding coefficients are computed using the following methods.
Mel frequency cepstral coefficients
The final step is to compute the discrete cosine transform (DCT) of the log filter bank energies in MFCC. But only 12 of the 26 DCT coefficients are kept because the higher DCT coefficients represent fast changes in the filter bank energies which reduce the performance of system. So it gives small improvement by dropping them. Thirteen MFCC coefficients (×28 to ×40) are used by our proposed approach which are extracted for each segment from IVS.
Linear predictive coding coefficients
Indian video song length is around 5–7 min. Each segment calculates 53 audio features. A typical problem in this system is a large number of audio features are extracted from each Indian video song. It requires large computational time for training database which takes more execution time for song queries. Therefore, there is a need to reduce number of feature vector without losing important information of audio features. Dimension is reduced using principal component analysis (PCA) method which is discussed in the following section.
3.4 Dimension reduction
Principal component analysis method [21, 22] is used to compute principal component which reduces the dimension of extracted audio features. It retains as much as possible variance in the audio features. Principal components are extracted by a linear transformation to a new set of features which are uncorrelated and are ordered according to their importance. Principal component is computed using singular value decomposition algorithm.
Subtract the mean from each audio feature vectors. It produces audio feature vectors whose mean is zero.
Calculate the covariance matrix.
Compute the eigenvalues and eigenvectors of the covariance matrix.
Sort eigenvalue in descending order. The number of eigenvectors represents the number of dimensions of the audio feature vectors.
Derive the new audio feature vectors. Take the transpose of the feature vector and multiply it on the left of the original data set.
3.5 Model generation
In the proposed approach, playback singer model (PSM) is obtained using a different classification [25, 26] algorithm. The following models are generated: (1) Gaussian mixture model, (2) k-nearest neighbor model, (3) naïve Bayes classifier (NBC), (4) back propagation algorithm using neural network (BPNN), and (5) AdaBoost.M2 model.
3.5.1 Gaussian mixture model
Playback singer recognition algorithm involves Gaussian mixture model [15, 27] distribution over an audio feature space like loudness, timbre, MFCC, LPC, and pitch. Mean and covariance of feature vector is computed for each song in our training dataset. Average unnormalized negative log-likelihood (average-UNLL) is calculated for a song given a Gaussian distribution which leads to prediction of singer.
3.5.2 K-nearest neighbor model
K-nearest neighbor model  is used to predict singer using values of K (K = 3). The training samples are audio feature vectors which are distributed in a multidimensional feature space. Each training sample contains a class label. Feature vectors and class labels of training samples are stored in the training phase. Euclidean distance is computed for each test sample. Test samples are classified by assigning the class labels using k nearest training samples.
3.5.3 Naïve Bayes classifier
3.5.4 Back propagation algorithm using neural network
Back propagation neural network  model is a supervised learning model used in many applications.
It is based on gradient descent method. It calculates gradient of error function with respect to all weights in the neural networks which is used to update the weights in an attempt to minimize the error function. The following algorithm is used for a three-layer network (one input layer, one hidden layer, and one output layer). Number of neurons in the hidden layer is 50.
3.5.5 AdaBoost.M2 model
3.6 Singer recognition
In our proposed approach, singer recognition is carried out using trained models. Indian video songs (IVS) which are not used in trained models are given for testing using various classification algorithms which leads to recognition of a singer.
4 Experimental setup
List of singers used in our database
Rahat Fateh khan
5 Experimental results
Normally playback singer identification system is divided into two parts: training phase and testing phase. PSM is generated using known songs of singers which are used in testing phase to test unknown song of singers and score is computed. As for the query set, 20 % dataset is used from each singer and the remaining 80 % dataset is used for training. So training samples are selected automatically, not manually, from the dataset.
Various types of division for database
Number of songs used
Different types of combinations of audio features
Figure 10 represents that TPL features is better than the TL and PL features. An accuracy of 80 % is achieved by AdaBoost.M2 model. An accuracy of 71 % is achieved by BPNN model using TPL features. Figure 11 represents performance of MLP, MT, MLPT, and MLPTPL feature sets using different classification models.
It shows that 81 % accuracy is achieved by MLPTPL feature set using AdaBoost.M2 model. AdaBoost.M2 model performs better rather than other playback singer models using other combination of feature set (Table 3). It is also observed that when MFCC coefficients are combined with LPC and timbre coefficients in AdaBoost.M2 model, then 70 % or more accuracy is achieved.
In this paper, playback singer recognition technique is proposed using perceptual features of an audio signal and cepstral coefficients from Indian Video songs. The proposed scheme first use PCA method to reduce dimensionality of audio feature vectors. Then five models (GMM, KNN, AdaBoost.M2, BPNN, and NBC) are generated using extracted audio feature vectors. An experimental result shows MLPTPL features with AdaBoost.M2 model gives more accuracy than other feature set. It is observed that AdaBoost.M2 is more efficient than GMM, BPNN, NBC, and KNN. Accuracy of AdaBoost.M2 model is increased when numbers of learning cycles are increased from 50 to 5000. Redistribution error is decreased when numbers of learning cycles are increased. The proposed system can be extended using visual clues from video portion to identify actor or actress. Various performance measures are plotted to show accuracy of our proposed approach.
- T Ratanpara, M Bhatt, A novel approach to retrieve video song using continuity of audio segments from Bollywood movies. Third International Conference on Computational Intelligence and Information Technology (CIIT), 87–92 (2013)Google Scholar
- T Ratanpara, M Bhatt, P Panchal, A novel approach for video song detection using audio clues from Bollywood movies. Emerg. Res. Comput. Inf., Commun Appl 1, 649–656 (2013)Google Scholar
- Y Fukazawa, J Ota, Automatic task-based profile representation for content-based recommendation, IOS press. Int. J. Knowl. Based Intell. Eng. Syst. 16, 247–260 (2012)Google Scholar
- A Fanelli, L Caponetti, G Castellano, C Buscicchio, A hierarchical modular architecture for musical instrument classification, IOS press. Int. J. Knowl. Based Intell. Eng. Syst. 9, 3 (2005). 173.182Google Scholar
- L Regnier, G Peeters, Singer verification: singer model vs. song model, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 437–440Google Scholar
- R Mammone, J Rechard, X Zhang, R Ramachandran, Robust speaker recognition: a feature-based approach. IEEE Signal Process. Mag. 13, 5 (1996)View ArticleGoogle Scholar
- B Logan, Mel frequency cepstral coefficients for music modeling, in International Symposium Music Information Retrieval, 2000Google Scholar
- B Whitman, G Flake, S Lawrence, Artist detection in music with minnow match, in Proceeding of the 2001 IEEE Workshop on Neural Networks for Signal Processing, 2001, pp. 559–568Google Scholar
- TL Nwe, H Li, Exploring vibrato-motivated acoustic features for singer identification. IEEE Trans. Audio Speech Lang. Process. 15(2), 519–530 (2007)View ArticleGoogle Scholar
- M Bartsch, G Wakefield, Singing voice identification using spectral envelope estimation. IEEE Trans Speech Audio Process. 12(2), 100–109 (2004)View ArticleGoogle Scholar
- J Andersen, Using the Echo Nest's automatically extracted music features for a musicological purpose. Cognitive Information Processing (CIP) 4th International Workshop, 1-6 (2014)Google Scholar
- P Doungpaisan, Singer identification using time-frequency audio feature. Advances in Neural Networks – ISSN 2011 6676, 486–495 (2011)Google Scholar
- L Gomez, H Sossa, R Barron, J Jimenez, A new methodology for music retrieval based on dynamic neural networks, IOS press. Int. J. Hybrid Intell. Syst. 9, 1–11 (2012)Google Scholar
- T Zhang, System and method for automatic singer identification, IEEE International Conference on Multimedia and Expo, 2003Google Scholar
- Tingle, Derek, YE Kim, D Turnbull, Exploring automatic music annotation with “Acoustically-Objective” Tags. Proceedings of the international conference on Multimedia information retrieval, ACM, 55-62 (2010)Google Scholar
- T Jehan, D DesRoches, The Echo Nest Analyzer documentation, 2014Google Scholar
- B Marshall, Aggregating music recommendation Web APIs by artist, IEEE conference on Information Reuse and Integration (IRI), 75-79 (2010)Google Scholar
- J Sun, H Li, L Ma, A music key detection method based on pitch class distribution theory, IOS press. Int. J. Knowl. Based Intell. Eng. Syst. 15(3), 165–175 (2011)Google Scholar
- T Ratanpara, N Patel, Singer identification using MFCC and LPC coefficients from Indian video songs, Emerging ICT for Bridging the Future Proceedings 49th Annual Convention Computer Society India (CSI) Volume 1 337, 275–282 (2015)Google Scholar
- LB Jackson, Digital Filters and Signal Processing, 2nd edn. (Kluwer Academic Publishers, Boston, 1989), pp. 255–257View ArticleGoogle Scholar
- V Panagiotou, N Mitianoudis, PCA summarization for audio song identification using Gaussian mixture models, in DSP. 2013 18th international conference on digital singal processing (DSP), 1-6 (2013)Google Scholar
- M. Zaki, J. Mohammed, W. Meira, Data mining and analysis: fundamentals of data mining algorithms, Cambridge University press, (2014)Google Scholar
- M Diez, A Varona, on the projection of PLLRs for unbounded feature distributions in spoken language recognition, signal processing letters. IEEE 21(9), 1073–1077 (2014)MathSciNetGoogle Scholar
- S Shum, N Dehak, R Dehak, J Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach, IEEE Transactions on Audio, Speech, and Language Processing 21.10, 2013, pp. 2015–2028Google Scholar
- A Khan, A Majid, A Mirza, Combination and optimization of classifiers in gender classification using genetic programming, IOS press. Int. J. Knowl. Based Intell. Eng. Syst. 9, 1–11 (2005)Google Scholar
- A Lampropoulos, P Lampropoulou, G Tsihrintzis, Music genre classification based on ensemble of signals produced by source separation methods, IOS press. Intell. Decis. Technol. 4, 229–237 (2010)Google Scholar
- K Rao, S Koolagudi, Recognition of emotions from video using acoustic and facial features. Journal of Signal Image & Video processing, 1-17 (2013)Google Scholar
- H Maniya, M Hasan, Comparative study of naïve Bayes classifier and KNN for tuberculosis, in International Conference on Web Services Computing (ICWSC), 2011Google Scholar
- MAW Saduf, Comparative study of back propagation learning algorithms for neural networks. Int J. Adv. Res. Comp. Sci. Softw. Eng. 3(12), 1151–1156 (2013)Google Scholar
- O Martinez, M Cyrill, S Burgard, Supervised learning of places from range data using Adaboost. Proceeding of the 2005 IEEE international conference on robotics and automation, 2005, pp. 1730–1735Google Scholar
- Y Jizheng, X Mao, Y Xue, Compare, Facial expression recognition based on t-SNE and AdaboostM2. IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, GreenCom-iThings-CPSCom, 1744-1749, (2013)Google Scholar
- Y Freund, R Schapiro, A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst Sci. 55(1), 119–139 (1997)MATHView ArticleGoogle Scholar
- J Rodrı´guez, A Pe´rez, J Lozano, Sensitivity analysis of k-fold cross validation in prediction error estimation. Pattern Anal. Mach. Intell., IEEE Trans. 32(3), 569–575 (2010)View ArticleGoogle Scholar
- I Junejo, A Bhutta, H Foroosh, Single-class SVM for dynamic scene modeling. J. Sig. Image Video Process. 7(1), 45–52 (2011)View ArticleGoogle Scholar
- G Williams, Instantaneous receiver operating characteristic (ROC) performance of multi-gain-stage APD photoreceivers. IEEE J Electr Dev. Soc. 1(6), 145–153 (2013)View ArticleGoogle Scholar
- J Youden, Index for rating diagnostic tests. Cancer 3, 32–35 (1950)View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.