G. L. Trager, Paralanguage: a first approximation. Stud. Linguist.13:, 1–11 (1958).
Google Scholar
R. Fernandez, R. Picard, Recognizing affect from speech prosody using hierarchical graphical models. Speech Commun.53(9-10), 1088–1103 (2011).
Article
Google Scholar
B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, G. Rigoll, in 2005 IEEE International Conference on Multimedia and Expo. Speaker independent speech emotion recognition by ensemble classification (IEEEPiscataway, 2005), pp. 864–867. https://doi.org/10.1109/ICME.2005.1521560.
Chapter
Google Scholar
T. L. Nwe, S. W. Foo, L. C. De Silva, Speech emotion recognition using hidden Markov models. Speech Commun.41(4), 603–623 (2003).
Article
Google Scholar
H. Fletcher, Loudness, pitch and the timbre of musical tones and their relation to the intensity, the frequency and the overtone structure. J. Acoust. Soc. Am.6(2), 59–69 (1934).
Article
Google Scholar
J. Kreiman, D. Vanlancker-Sidtis, B. R. Gerratt, in ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis and Synthesis. Defining and measuring voice quality, (2003), pp. 115–120. https://www.isca-speech.org/archive_open/voqual03/voq3_115.html.
M. J. Ball, J. Esling, C. Dickson, The VOQS system for the transcription of voice quality. J. Int. Phon. Assoc.25(2), 71–80 (1995).
Article
Google Scholar
M. S. De Bodt, F. L. Wuyts, P. H. Van de Heyning, C. Croux, Test-retest study of the grbas scale: influence of experience and professional background on perceptual rating of voice quality. J. Voice. 11(1), 74–80 (1997).
Article
Google Scholar
B. Barsties, M. De Bodt, Assessment of voice quality: current state-of-the-art. Auris Nasus Larynx. 42(3), 183–188 (2015).
Article
Google Scholar
T. M. Elliott, L. S. Hamilton, F. E. Theunissen, Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones. J. Acoust. Soc. Am.133(1), 389–404 (2013).
Article
Google Scholar
A. Caclin, S. McAdams, B. K. Smith, S. Winsberg, Acoustic correlates of timbre space dimensions: a confirmatory study using synthetic tones. J. Acoust. Soc. Am.118(1), 471–482 (2005).
Article
Google Scholar
B. O’Connor, S. Dixon, G. Fazekas, et al, in Proceedings of The 2020 Joint Conference on AI Music Creativity. An exploratory study on perceptual spaces of the singing voice (KTH Royal Institute of TechnologyStockholm, 2020). https://doi.org/10.30746/978-91-519-5560-5.
Google Scholar
K. Heidemann, A system for describing vocal timbre in popular song. Music Theory Online. 22(1), 2 (2016). https://mtosmt.org/issues/mto.16.22.1/mto.16.22.1.heidemann.html. Accessed 10 Apr 2022.
Article
Google Scholar
A. W. Cox, The metaphoric logic of musical motion and space (University of Oregon Press, Eugene, 1999).
Google Scholar
D. K. Blake, et al., Timbre as differentiation in indie music. Music Theory Online. 18(2), 1 (2012). https://www.mtosmt.org/issues/mto.12.18.2/toc.18.2.html. Accessed 10 Apr 2022.
Article
MathSciNet
Google Scholar
W. Slawson, Sound color (Yank Gulch Music, Talent, 1985).
Google Scholar
R. Pratt, P. Doak, A subjective rating scale for timbre. J. Sound Vib.45(3), 317–328 (1976).
Article
Google Scholar
R. Cogan, New images of musical sound (Harvard University Press, Cambridge, 1984).
Google Scholar
M. Lavengood, A new approach to the analysis of timbre (PhD dissertation, City University of New York, New York City, 2017).
Google Scholar
J. Wilkins, P. Seetharaman, A. Wahl, B. Pardo, in Proc. ISMIR 2018. Vocalset: a singing voice dataset, (2018), pp. 468–474. https://doi.org/10.5281/zenodo.1492453.
P. Zwan, in Audio Engineering Society Convention 121. Expert system for automatic classification and quality assessment of singing voices (Audio Engineering SocietyWarsaw, 2006).
Google Scholar
M. Łazoryszczak, E. Półrolniczak, Audio database for the as sessment of singing voice quality of choir members. Elektronika: Konstrukcje, Technol., Zastosowania. 54(3), 92–96 (2013).
Google Scholar
M. Goto, T. Nishimura, Aist humming database: music database for singing research. IPSJ SIG Notes (Tech. Rep.)(Jpn. Ed.)2005(82), 7–12 (2005).
Google Scholar
J. Stark, Bel canto: a history of vocal pedagogy (University of Toronto Press, Toronto, 1999).
Google Scholar
T. Bourne, M. Garnier, D. Kenny, Music theater voice: production, physiology and pedagogy. J. Sing.67(4), 437 (2011).
Google Scholar
I. Titze, Why do classically trained singers widen their throat?. J. Sing.69(2), 177 (2012).
Google Scholar
A. Vurma, J. Ross, Where is a singer’s voice if it is placed “forward”?. J. Voice. 16(3), 383–391 (2002).
Article
Google Scholar
F. Eyben, M. Wöllmer, B. Schuller, in Proceedings of the 18th ACM International Conference on Multimedia. Opensmile: the Munich versatile and fast open-source audio feature extractor, (2010), pp. 1459–1462. https://doi.org/10.1145/1873951.1874246.
B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. Burgoon, A. Baird, A. Elkins, Y. Zhang, E. Coutinho, K. Evanini, et al, in 17TH Annual Conference of the International Speech Communication Association (Interspeech 2016), Vols 1-5. The interspeech 2016 computational paralinguistics challenge: deception, sincerity & native language, (2016), pp. 2001–2005. https://doi.org/10.21437/Interspeech.2016-129.
Y. -L. Lin, G. Wei, in Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, vol. 8. Speech emotion recognition based on hmm and svm (Piscataway, 2005), pp. 4898–4901. https://doi.org/10.1109/ICMLC.2005.1527805.
M. Hariharan, V. Vijean, C. Fook, S. Yaacob, in Proceedings of the 2012 IEEE 8th International Colloquium on Signal Processing and Its Applications. Speech stuttering assessment using sample entropy and least square support vector machine (IEEEPiscataway, 2012), pp. 240–245. https://doi.org/10.1109/CSPA.2012.6194726.
Google Scholar
R. B. Lanjewar, S. Mathurkar, N. Patel, Implementation and comparison of speech emotion recognition system using Gaussian mixture model (GMM) and k-nearest neighbor (k-NN) techniques. Procedia Comput. Sci.49:, 50–57 (2015).
Article
Google Scholar
L. S. Chee, O. C. Ai, M. Hariharan, S. Yaacob, in 2009 International Conference for Technical Postgraduates (TECHPOS). Automatic detection of prolongations and repetitions using LPCC (IEEEPiscataway, 2009), pp. 1–4. https://doi.org/10.1109/TECHPOS.2009.5412080.
Google Scholar
L. S. Chee, O. C. Ai, M. Hariharan, S. Yaacob, in 2009 IEEE Student Conference on Research and Development (SCOReD). Mfcc based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA (IEEEPiscataway, 2009), pp. 146–149. https://doi.org/10.1109/SCORED.2009.5443210.
Chapter
Google Scholar
L. He, M. Lech, N. C. Maddage, N. Allen, in Proceedings of the 2009 Fifth International Conference on Natural Computation, vol. 2. Stress detection using speech spectrograms and sigma-pi neuron units (IEEEPiscataway, 2009), pp. 260–264. https://doi.org/10.1109/ICNC.2009.59.
Chapter
Google Scholar
G. Zhou, J. H. Hansen, J. F. Kaiser, in Proc. IEEE ICASSP 1999, vol. 4. Methods for stress classification: nonlinear teo and linear speech based features (IEEEPiscataway, 1999), pp. 2087–2090. https://doi.org/10.1109/ICASSP.1999.758344.
Google Scholar
T. L. Nwe, S. W. Foo, L. C. De Silva, in Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint, vol. 3. Detection of stress and emotion in speech using traditional and FFT based log energy features (IEEEPiscataway, 2003), pp. 1619–1623. https://doi.org/10.1109/ICICS.2003.1292741.
Chapter
Google Scholar
K. K. Kishore, P. K. Satish, in Proceedings of the 3rd IEEE International Advance Computing Conference. Emotion recognition in speech using mfcc and wavelet features (IEEEPiscataway, 2013), pp. 842–847. https://doi.org/10.1109/IAdCC.2013.6514336.
Google Scholar
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, S. Zafeiriou, in Proc. IEEE ICASSP 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network (IEEEPiscataway, 2016), pp. 5200–5204. https://doi.org/10.1109/ICASSP.2016.7472669.
Google Scholar
T. Koike, K. Qian, B. W. Schuller, Y. Yamamoto, in Proc. Interspeech 2020. Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 Challenge Mask Task, (2020), pp. 2047–2051. https://doi.org/10.21437/Interspeech.2020-1552.
S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, B. Schuller, in Proc. Interspeech 2017. Snore sound classification using image-based deep spectrum features, (2017), pp. 3512–3516. https://doi.org/10.21437/Interspeech.2017-434.
H. Wu, W. Wang, M. Li, in Proc. Interspeech 2019. The DKU-Lenovo systems for the interspeech 2019 computational paralinguistic challenge, (2019), pp. 2433–2437. https://doi.org/10.21437/Interspeech.2019-1386.
J. Wagner, D. Schiller, A. Seiderer, E. André, in Proc. Interspeech 2018. Deep learning in paralinguistic recognition tasks: are hand-crafted features still relevant?, (2018), pp. 147–151. https://doi.org/10.21437/Interspeech.2018-1238.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, in Advances in Neural Information Processing Systems. Attention is all you need (Curran Associates, Inc.Red Hook, 2017), pp. 5998–6008.
Google Scholar
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., in Proc. ICLR 2021. An image is worth 16x16 words: transformers for image recognition at scale, (2021).
F. -R. Stöter, S. Uhlich, A. Liutkus, Y. Mitsufuji, Open-unmix-a reference implementation for music source separation. J. Open Source Softw.4(41), 1667 (2019).
Article
Google Scholar
R. Hennequin, A. Khlif, F. Voituret, M. Moussallam, Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw.5(50), 2154 (2020).
Article
Google Scholar
A. Défossez, N. Usunier, L. Bottou, F. Bach, Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174 (2019).
P. Boersma, Praat: doing phonetics by computer [computer program] (2011). http://www.praat.org/. Accessed 18 Apr 2021.
D. R. Appelman, The science of vocal pedagogy: theory and application vol. 1 (Indiana University Press, Bloomington, 1967).
Google Scholar
J. C. McKinney, The diagnosis and correction of vocal faults: a manual for teachers of singing and for choir directors (Waveland Press, Long Grove, 2005).
Google Scholar
J. Large, Towards an integrated physiologic-acoustic theory of vocal registers. NATS Bull. 28(3), 18–25 (1972).
Google Scholar
G. Grove, S. Sadie, The new grove dictionary of music and musicians vol. 1 (MacMillan Publishing Company, London, 1980).
Google Scholar
J. L. LoVetri, Female chest voice. J. Sing.60(2), 161–164 (2003).
Google Scholar
S. Krajinovic, Problems of singers in opera plays. Master’s thesis (Høgskolen i Agder, Norway, 2006).
Google Scholar
H. Cai, in Proceedings of the 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics. Acoustic analysis of resonance characteristics of head voice and chest voice (IEEEPiscataway, 2019), pp. 1–6. https://doi.org/10.1109/CISP-BMEI48845.2019.8966068.
Google Scholar
M. Aura, A. Geneid, K. Bjørkøy, M. Rantanen, A. -M. Laukkanen, A nasoendoscopic study of “head resonance” and “imposto” in classical singing. J. Voice. 36(1), 83–90 (2020).
Article
Google Scholar
J. Sundberg, T. D. Rossing, The science of singing voice. J. Acoust. Soc. Am.87(1), 462–463 (1990).
Article
Google Scholar
P. L. Debertin, Perceptual judgments of nasal resonance. MA thesis (The University of Montana, 1979).
W. B. Wooldridge, Is there nasal resonance. Bulletin. 13:, 128–129 (1956).
Google Scholar
S. F. Austin, Movement of the velum during speech and singing in classically trained singers. J. Voice. 11(2), 212–221 (1997).
Article
Google Scholar
A. Vurma, J. Ross, The perception of’forward’and’backward placement’of the singing voice. Logopedics Phoniatrics Vocology. 28(1), 19–28 (2003).
Article
Google Scholar
G. Lee, C. C. Yang, T. B. Kuo, Voice low tone to high tone ratio-a new index for nasal airway assessment. Chin. J.Physiol.46(3), 123–27 (2003).
Google Scholar
G. -S. Lee, C. -P. Wang, S. Fu, Evaluation of hypernasality in vowels using voice low tone to high tone ratio. Cleft Palate-Clin. J.46(1), 47–52 (2009).
Article
Google Scholar
K. Wyllys, A preliminary study of the articulatory and acoustic features of forward and backward tone placement in singing. MA thesis (Western Michigan University, 2013).
R. T. Sataloff, Professional singers: the science and art of clinical care. Am. J. Otolaryngol.2(3), 251–266 (1981).
Article
Google Scholar
V. L. Stoer, H. Swank, Mending misused voices. Music. Educ. J.65(4), 47–51 (1978).
Article
Google Scholar
H. B. Rothman, A. A. Arroyo, Acoustic variability in vibrato and its perceptual significance. J. Voice. 1(2), 123–141 (1987).
Article
Google Scholar
S. Z. K. Khine, T. L. Nwe, H. Li, in Proceddings of the International Symposium on Computer Music Modeling and Retrieval. Exploring perceptual based timbre feature for singer identification (Springer-VerlagBerlin, 2007), pp. 159–171. https://doi.org/10.1007/978-3-540-85035-9_10.
Google Scholar
T. Nakano, M. Goto, Y. Hiraga, in Proceedings of the 9th International Conference on Spoken Language Processing. An automatic singing skill evaluation method for unknown melodies using pitch interval accuracy and vibrato features, (2006). https://doi.org/10.21437/Interspeech.2006-474.
T. L. Nwe, H. Li, Exploring vibrato-motivated acoustic features for singer identification. IEEE Trans. Audio Speech Lang. Process.15(2), 519–530 (2007).
Article
Google Scholar
M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit.44(3), 572–587 (2011).
Article
MATH
Google Scholar
B. Schuller, S. Steidl, A. Batliner, in Proc. Interspeech 2009. The interspeech 2009 emotion challenge, (2009), pp. 312–315. https://doi.org/10.21437/Interspeech.2009-103.
F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, K. R. Scherer, On the acoustics of emotion in audio: what speech, music, and sound have in common. Front. Psychol.4:, 292 (2013).
Article
Google Scholar
J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization. CoRR. abs/1607.06450: (2016). http://arxiv.org/abs/1607.06450. Accessed 6 Jan 2021.
K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Deep residual learning for image recognition (IEEEPiscataway, 2016), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90.
Google Scholar
A. Rosenberg, in Proc. Interspeech 2012. Classifying skewed data: importance weighting to optimize average recall, (2012), pp. 2242–2245. https://doi.org/10.21437/Interspeech.2012-131.
T. -Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, in Proceedings of the IEEE International Conference on Computer Vision. Focal loss for dense object detection (IEEEPiscataway, 2017), pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.324.
Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al, Scikit-learn: machine learning in python. J. Mach. Learn. Res.12:, 2825–2830 (2011).
MathSciNet
MATH
Google Scholar
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al, in Advances in Neural Information Processing Systems. Pytorch: an imperative style, high-performance deep learning library (Curran Associates, Inc.Red Hook, 2019), pp. 8026–8037.
Google Scholar