Speech signal modeling using multivariate distributions
 Ali Aroudi^{1},
 Hadi Veisi^{2}Email author,
 Hossein Sameti^{3} and
 Zahra Mafakheri^{4}
https://doi.org/10.1186/s1363601500781
© Aroudi et al. 2015
Received: 14 August 2014
Accepted: 29 November 2015
Published: 30 December 2015
Abstract
Using a proper distribution function for speech signal or for its representations is of crucial importance in statisticalbased speech processing algorithms. Although the most commonly used probability density function (pdf) for speech signals is Gaussian, recent studies have shown the superiority of superGaussian pdfs. A large research effort has focused on the investigation of a univariate case of speech signal distribution; however, in this paper, we study the multivariate distributions of speech signal and its representations using the conventional distribution functions, e.g., multivariate Gaussian and multivariate Laplace, and the copulabased multivariate distributions as candidates. The copulabased technique is a powerful method in modeling nonGaussian multivariate distributions with nonlinear interdimensional dependency. The level of similarity between the candidate pdfs and the real speech pdf in different domains is evaluated using the energy goodnessoffit test.
In our evaluations, the bestfitted distributions for speech signal vectors with different lengths in various domains are determined. A similar experiment is performed for different classes of English phonemes (fricatives, nasals, stops, vowels, and semivowel/glides). The evaluation results demonstrate that the multivariate distribution of speech signals in different domains is mostly superGaussian, except for Melfrequency cepstral coefficient. Also, the results confirm that the distribution of the different phoneme classes is better statistically modeled by a mixture of Gaussian and Laplace pdfs. The copulabased distributions provide better statistical modeling of vectors representing discrete Fourier transform (DFT) amplitude of speech vectors with a length shorter than 500 ms.
Keywords
1 Introduction
Statisticalbased speech processing algorithms have attracted wide interests during the last three decades in numerous applications, e.g., speech coding [1], speech recognition [2, 3], speech synthesis [4], and speech enhancement [5]. In all statisticalbased speech processing algorithms, a probability density function (pdf) is assumed for the signal or its representation. Therefore, it is not surprising that proper selection of the pdf has been one of the challenges persistently addressed in this area [6–8].
Proposed superGaussian univariate distribution of speech signal in different domains
All aforementioned publications have aimed to address the issue of modeling univariate pdf of speech signals for algorithms using univariate pdf. However, there are many statisticalbased algorithms that take advantage of multivariate distribution of speech signals, and therefore, studying the multivariate distribution of speech to exploit a more proper pdf is a key issue for those speech processing algorithms too. There are typically several challenges in the studying and modeling of speech signals in the multivariate distribution case, e.g., the nonlinear or linear interdimensional dependency, and the sparsity and complexity of the multidimensional space. These issues may have caused to mostly focus on the investigation of univariate distribution during the last two decades, and a small progress has been made in the multivariate distribution study of speech signal. The earlier studies on multivariate distribution of speech signal, performed by Brehm et al. [19] and LeBlancin et al. [20], suggested the multivariate Gaussian pdf for speech frames with a length of 5 ms. As the frame length and the process domain may vary the distribution [14], the multivariate Gaussian pdf may not be an appropriate choice for the algorithms using frame length other than 5 ms, e.g., 10 to 35 ms or exploiting process domain other than the time domain, e.g., DFT or DCT. In recent studies, Gazor et al. [14] and Jensen et al. [21] have used the moment test and have shown that Laplace multivariate distribution models speech signal or its representations are better than the Gaussian multivariate distribution. However, in these studies, the moment test as a GOF test was applied to each dimension individually and the possible contribution of interdimensional dependency to the multivariate distribution was not considered.
In this paper, we investigate multivariate distribution of speech signal in the time and transformed domains. We consider new plausible distribution candidates to tackle the multivariate distribution modeling challenges. Among the candidates, copulabased distributions are also proposed which are able to model the highdimensional nonGaussian distribution with nonlinear interdimensional dependency [22]. The copulabased distributions have been popular over the last decade in the statistical fields, e.g., climate research, econometrics, risk management [22, 23], and finance [24, 25]. The other possible pdf candidates of speech including multivariate Gaussian, multivariate Laplace, the mixture of Gaussian, and the mixture of Laplace distributions are investigated in this paper too. We employ the goodnessoffit test [15, 16, 26] to evaluate the degree of similarity between the candidate distribution and the real speech signal distribution. The GOF test is a tractable threestep approach to investigate distribution of data. In the first step, a number of candidates are assumed as the pdf of the real data. Next, an estimator, e.g., maximum likelihood (ML) is exploited to fit the candidates to the real data, and finally, the GOF test is performed to quantify the level of similarity between the fitted candidates and the real data. It is noted that although a wide number of GOF tests have been proposed, the most appropriate GOF test is the one that can highly cover underlying problem conditions, e.g., in our case study is high dimensionality of spaces. We briefly present a number of GOF tests, a summary of their strengths and deficiencies, and finally choose the one that has been reported as the most appropriate for highdimensional space.
In general, speech processing algorithms using multivariate distribution exploit different feature types to process speech signals. For instance, traditional hidden Markov model (HMM)based speech recognition and synthesis algorithms [3, 27] exploit Melfrequency cepstral coefficients (MFCC); HMMbased speaker recognition [13] systems exploit either linear predictive coding (LPC) or MFCC; HMMbased speech enhancement algorithms use LPC, time, DCT, MFCC, or DFT [7, 9, 10]; and codebookdrivenbased speech enhancement algorithms [28] employ LPC. However, all these algorithms assume the multivariate Gaussian pdf for extracted features of speech signals. As the feature type may influence the distribution [14], the multivariate distribution of the different feature types including DFT, DCT, time, LPC, and MFCC is studied in this paper. It is noted that a number of speech processing algorithms, e.g., proposed by Martin [6], Shin et al. [7], model the real and imaginary parts of DFT separately. Thus, we study the real and imaginary parts of DFT features separately. The whole study of multivariate distribution in this paper is concentrated on clean speech signals.
The remainder of this paper is structured as follows. In Section 2, the copulabased distributions are presented including their formulations and parameter estimation. In Section 3, the GOF tests are briefly reviewed and among them, the energy test is selected as the most appropriate one for the multivariate distribution study of highdimensional space. Section 4 elaborates candidates’ formulations, their parameter estimation, and an algorithm for exploring the bestfitted candidate. Section 5 presents the evaluation setup and experimental results. Finally, Section 6 concludes the work.
2 Copulabased distribution
A copula is defined as a multivariate probability distribution where the marginal probability distribution of each variable is uniform and is used to describe the dependency between random variables [22, 29–33]. As all the multivariate joint distributions can be written in terms of a copula and univariate marginal pdfs [29], copulas are used as a popular statistical tool for modeling multivariate distributions. In this regard, copulas allow to easily model the distribution of multivariate random variables by estimating only marginal pdfs and copulas. A copulabased distribution can capture important characteristics of a vector, e.g., the appropriate pdf for margins and the appropriate correlation structure with a possibly simple form.
The purpose of this section is to briefly review the basic definition of the copula and a number of the most commonly used estimation methods for fitting the copula to the real data.
2.1 Copula model
The two most used parametric forms for c _{ X }(.) are elliptical and Archimedean [22, 30]. The Archimedeanbased copulas are mostly used in the bivariate form and they are not practically usable for highdimensional spaces due to its highcomputational cost [22, 34]. In contrast to the Archimedean, the ellipticalbased copulas, including Gaussian and Studentt copula, can be used for spaces with any number of dimensions [22]. We therefore briefly review the Gaussian and Studentt copulas in the following sections.
2.2 Gaussian copulas
2.3 Studentt copulas
2.4 Fit a copula model
The ML approach estimates the parameters of marginal pdfs and copula density function jointly using numerical optimization [22]. This is the only way to estimate all the parameters consistently [22].

The ML method is used for spaces with a small number of dimensions due to the numerical complexity issue.

The IFM method ends up a suboptimal solution for parameter estimation since the loglikelihood function is maximized in two individual steps [31].

The IFM and CML methods result in closedform formulas only for Gaussian copula case [31].

When one of the values of offdiagonal components of the covariance matrix ΣCopula of either the Studentt or Gaussian copulas takes 1 or −1, the estimation procedure of the copula parameters using CML method may fail [39]. It is due to Cholesky decomposition performed in CML methods.
3 Goodnessoffit test
A wide number of goodnessoffit (GOF) tests depending on underlying conditions of the case study have been proposed. In our study, high dimensionality and possible nonlinear interdimensional dependency are the most crucial issues. Although various GOF tests are proposed for onedimensional space, only some of them are extendable for highdimensional space. As the number of space dimensions increases, the tests become inefficient [26]. For instance, the extension of χ ^{2} test [15] to higher dimensions suffers from the curse of dimensionality [40] caused by the space sparsity unless the sample sizes are large enough.
There are several GOF tests particularly proposed for the multivariate case, e.g., the nearest neighbor test which exploits the nearest neighbors [41], Mardia test which exploits the skewness and kurtosis [42] and Freidman–Rafsky test which exploits the minimum spanning tree [43]. In this regard, the energy test has also recently been proposed by Zach and Aslan [44]. The performance superiority of the energy test has been demonstrated compared to Mardia, nearest neighbor, χ ^{2}, and Friedman–Rafsky tests. Accordingly, the energy test is selected as a more appropriate GOF for underlying conditions of the study in this paper to evaluate candidates. In the following, the energy test is discussed.
3.1 Energy test
 1.
The real dataset is segmented, resulting in N vectors each of length d, \( {\boldsymbol{x}}_{i=1}^N=\left\{{\boldsymbol{x}}^1, \dots,\;{\boldsymbol{x}}^i, \dots,\;{\boldsymbol{x}}^N\right\} \). Depending on the process domain, x ^{ i } represents the segmented real data in that process domain, e.g., time, DFT, and DCT.
 2.
A possible candidate pdf \( {f}_{{\boldsymbol{X}}_0}\left(\boldsymbol{x}\right) \), e.g., either copulabased or conventional distributions, is hypothesized and fitted to the real data vectors \( {\boldsymbol{x}}_{i=1}^N \).
 3.
The number M of simulated data vectors following fitted pdf \( {f}_{{\boldsymbol{X}}_0}\left(\boldsymbol{x}\right) \) is generated using Monte–Carlo.
 4.
The energy test statistic is computed using Eq. (14) to determine the level of similarity between the distributions of real data vectors \( {\boldsymbol{x}}_{i=1}^N \) and simulated data vectors \( {\boldsymbol{q}}_{j=1}^M \).
4 Multivariate distribution candidates
 1.
Copulabased Laplace distribution (CLD)
 2.
Copulabased Laplace distribution with mutually independent dimensions (CLID), i.e., c _{ X }(.) = 1
 3.
Copulabased generalized extreme value distribution (CGevD)
 4.
Copulabased Rayleigh distribution (CRD)
 5.
Copulabased Gamma distribution (CGD).
 1.
A marginal distribution, e.g., one of Eqs. (15)–(18), is accounted for \( {f}_{X_j}\left({x}_j\right),\kern0.24em j=1:d \) and parameter set is accordingly estimated.
 2.
The corresponding CDF of \( {f}_{{\mathtt{X}}_j}\left({x}_j\right) \) is derived to compute \( {u}_{\mathrm{j}}={F}_{{\boldsymbol{X}}_j}\left({x}_j\right),\kern0.24em j=1:d \) and \( {y}_j={F}_{N\left(0,1\right)}^{1}\left({u}_j\right) \) as shown in Eq. (4).
 3.
The parameter of Gaussian copula density α _{Copula} = Σ _{Copula} is estimated using Eq. (13).
Multivariate distribution candidates considered for experimental setup
PDF class  Candidates  Description 

Copulabased PDF  CLD  Copulabased distribution with marginal Laplace distribution. 
CLID  Copulabased distribution with mutually independent marginal Laplace distribution.  
CGevD  Copulabased distribution with marginal GEV distribution.  
CRD  Copulabased distribution with marginal Rayleigh distribution.  
CGD  Copulabased distribution with marginal Gamma distribution.  
Conventional PDF  MGD  Multivariate Gaussian distribution. 
MLD  Multivariate Laplace distribution.  
MGLD, p = 0.25  Multivariate Gaussian–Laplace distribution.  
MGLD, p = 0.50  Multivariate Gaussian–Laplace distribution.  
MGLD, p = 0.75  Multivariate Gaussian–Laplace distribution. 
5 Evaluation results
The data set of the second setup of evaluations
File  Phoneme class  # of phonemes  Duration (s)  # of frames for each phoneme class (N)  

20 ms  30 ms  100 ms  500 ms  
1  Semivowel/glide  460  29.77  1489  993  298  60 
2  Vowel  1240  125.39  6270  4180  1254  251 
3  Nasal  298  18.57  929  620  186  38 
4  Fricative  482  46.16  2309  1539  462  93 
5  Stop  1109  56.13  2870  1872  562  113 
The experimental evaluations were performed for time features (T), amplitude of DFT (ADFT), real parts of DFT (RDFT), imaginary parts of DFT (IDFT), DCT, LPC, and MFCC features. Regarding the LPC and MFCC, 10 and 12 coefficients were extracted from frames, respectively. The MFCC vectors were extracted from 23 Melfrequency filter banks. To set up the energy test, the value of M was taken equal to N. All the reported energy test values were computed at a significant level of 0.01 using a bootstrap method [44, 52].

The bestfitted candidate in the sense of the energy test for the T, RDFT, IDFT, and DCT features with frame length of 20, 30, and 100 ms is MLD, despite the often used assumption of multivariate Gaussian distribution in the speech enhancement algorithms [8–10], but consistent with the univariate Laplace distribution proposed by Martin [6] and Gazor et al. [14].

The univariate Rayleigh distribution has been proposed for ADFT feature with a short frame length. Maybe as a consequence, it was expected that multivariate Rayleigh distribution (CRD) would be superior in modeling the multivariate distribution of ADFT; however, the energy test evaluation results proposed the CLD as the bestfitted candidate for frame lengths shorter than 500 ms.

Regarding statistical modeling of ADFT features with short frame length, although CLD and CLID are both Laplacebased distribution, CLD was proposed as the bestfitted candidate. As the copula density function c _{ X }(.), which models interdimensional dependency, is nonunit for CLD and unit for CLID, the superiority of CLD over CLID shows how the modeling of interdimensional dependency contributes to the proper multivariate statistical modeling.

Increasing frame length to 500 ms caused the bestfitted candidate corresponding to ADFT, RDFT, IDFT, and DCT features to be shifted from either CLD or MLD toward MGLD. This finding suggests that the Gaussian distribution contributed to the actual multivariate distribution of those domains when the frame length sufficiently increased, which is also supported by the central limit theorem. Similarly, varying the bestfitted distribution for LPC features from MGLD (with p = 0.25) to MGD verifies this contribution, too.

The bestfitted candidate for the MFCC with different frame lengths is MGD, consistent with the assumption of multivariate Gaussian distribution used in most speech recognition algorithms [2, 3].
Bestfitted multivariate distribution in the sense of energy test for different features and frame lengths of speech signals
Feature (domain)  Frame length  

20 ms  30 ms  100 ms  500 ms  
ADFT  CLD  CLD  CLD  MGLD, p = 0.50 
LPC  MGLD, p = 0.25  MGLD, p = 0.25  MGLD, p = 0.25  MGD 
T  MLD  MLD  MLD  MLD 
MFCC  MGD  MGD  MGD  MGD 
RDFT  MLD  MLD  MLD  MGLD, p = 0.75 
IDFT  MLD  MLD  MLD  MGLD, p = 0.75 
DCT  MLD  MLD  MLD  MGLD, p = 0.25 
According to the conclusions, representative frames of speech signals containing T, RDFT, IDFT, or DCT features that are often used in statistical modelbased speech enhancement algorithms [9, 10, 28] can be better statistically modeled by the MLD than MGD distribution. Furthermore, if the statisticalbased algorithm exploits MFCC and ADFT, the energy test proposes MGD and ADFT, respectively.
Tables 5, 6, 7, 8, 9, 10, and 11 present the evaluation results of the energy test for the second experimental setup. In each table, there are 15 blocks surrounded by bold lines belonging to each phoneme class with a determined frame length. Each block in these tables contains four cells, as shown by Fig. 3. The A and B cells show the first and the second bestfitted candidates, respectively, and C and D cells indicate the energy test value corresponding to the first and the second bestfitted candidates, respectively.
Bestfitted multivariate distribution based on the energy test for ADFT coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.50  MGLD, p = 0.25  CGevD  CGD 
0.05  0.06  0.03  0.04  0.02  0.02  0.00  0.01  
Vowel  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.50  CGevD  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.25 
0.04  0.05  0.04  0.05  0.03  0.04  0.02  0.02  
Nasal  MGLD, p = 0.50  CGevD  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGLD, p = 0.50 
0.04  0.04  0.02  0.02  0.03  0.03  0.00  0.00  
Fricative  MGLD, p = 0.75  CGD  MGLD, p = 0.50  MGLD, p = 0.25  MGLD, p = 0.50  MGLD, p = 0.25  MGD  MGLD, p = 0.25 
0.04  0.04  0.02  0.02  0.03  0.03  0.00  0.00  
Stop  CGD  CGevD  CGD  CGevD  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.25  MGLD, p = 0.50 
0.06  0.08  0.07  0.09  0.03  0.03  0.00  0.00 
Bestfitted multivariate distribution based on the energy test for LPC coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MGLD, p = 0.50  MGLD, p = 0.25  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGLD, p = 0.50  MGD  MGLD, p = 0.25 
0.07  0.08  0.00  0.00  0.00  0.00  0.02  0.02  
Vowel  MGLD, p = 0.25  MGLD, p = 0.50  MGLD, p = 0.25  MGLD, p = 0.50  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD 
0.01  0.01  0.01  0.01  0.01  0.01  0.00  0.00  
Nasal  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25 
0.00  0.00  0.00  0.00  0.00  0.01  0.00  0.00  
Fricative  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGLD, p = 0.50  MGD  CLD 
0.00  0.00  0.00  0.00  0.00  0.00  0.02  0.03  
Stop  MGLD, p = 0.25  MGLD, p = 0.25  MGLD, p = 0.25  MGLD, p = 0.50  MGLD, p = 0.25  MGD  MGD  MGLD, p = 0.25 
0.02  0.02  0.01  0.01  0.00  0.00  0.00  0.00 
Bestfitted multivariate distribution based on the energy test for time coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.25  MGLD, p = 0.50 
0.01  0.02  0.01  0.02  0.00  0.01  0.00  0.00  
Vowel  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.75  MGLD, p = 0.50 
0.01  0.02  0.00  0.02  0.00  0.01  0.00  0.01  
Nasal  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.50  MGLD, p = 0.50  MLD  MGLD, p = 0.25 
0.00  0.01  0.01  0.02  0.00  0.00  0.00  0.01  
Fricative  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.25  MGLD, p = 0.50 
0.00  0.01  0.00  0.01  0.00  0.00  0.00  0.00  
Stop  MLD  CLD  MLD  CLD  MGLD, p = 0.75  MLD  MLD  MGLD, p = 0.25 
0.08  0.12  0.05  0.08  0.01  0.01  0.00  0.00 
Bestfitted multivariate distribution based on the energy test for MFCC coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25 
0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  
Vowel  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25 
0.00  0.01  0.00  0.01  0.01  0.01  0.00  0.00  
Nasal  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGLD, p = 0.75  MGLD, p = 0.50 
0.00  0.01  0.01  0.01  0.00  0.00  0.00  0.01  
Fricative  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  CLD  CLID 
0.00  0.01  0.00  0.00  0.00  0.00  0.03  0.03  
Stop  MGD  MGLD, p = 0.25  MGD  MGLD, p = 0.25  MGLD, p = 0.25  MGD  CLD  MGD 
0.00  0.00  0.00  0.00  0.00  0.01  0.00  0.01 
Bestfitted multivariate distribution based on the energy test for RDFT coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.75  CLD 
0.00  0.01  0.00  0.01  0.00  0.01  0.00  0.00  
Vowel  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.75  MLD 
0.00  0.01  0.00  0.01  0.00  0.00  0.00  0.00  
Nasal  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  CGevD 
0.00  0.01  0.00  0.01  0.00  0.00  0.01  0.01  
Fricative  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.50  MGD 
0.01  0.01  0.00  0.01  0.00  0.00  0.00  0.00  
Stop  MLD  CLID  MLD  CLD  MGLD, p = 0.75  MLD  MGLD, p = 0.25  MLD 
0.06  0.10  0.04  0.08  0.01  0.01  0.00  0.00 
Bestfitted multivariate distribution based on the energy test for IDFT coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  CGevD 
0.00  0.01  0.00  0.01  0.00  0.01  0.00  0.00  
Vowel  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.75  MLD  MGLD, p = 0.75  CLD 
0.00  0.01  0.01  0.01  0.00  0.01  0.01  0.01  
Nasal  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  MLD  MGLD, p = 0.75  CLD  MLD  CLD 
0.00  0.01  0.00  0.01  0.00  0.00  0.00  0.01  
Fricative  MGLD, p = 0.75  MLD  MLD  MGLD, p = 0.75  MGLD, p = 0.50  CLID  MGLD, p = 0.75  CLD 
0.01  0.01  0.00  0.00  0.00  0.01  0.00  0.00  
Stop  MLD  CLD  MLD  CLD  MGLD, p = 0.75  MLD  MGLD, p = 0.50  CLD 
0.08  0.12  0.04  0.08  0.01  0.01  0.00  0.01 
Bestfitted multivariate distribution based on the energy test for DCT coefficients of five phoneme classes in different frame lengths
Phoneme class  Frame length  

20 ms  30 ms  100 ms  500 ms  
Semivowel/glide  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD 
0.00  0.01  0.00  0.01  0.00  0.00  0.00  0.00  
Vowel  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD 
0.00  0.01  0.00  0.01  0.01  0.01  0.01  0.01  
Nasal  MGLD, p = 0.75  MGLD, p = 0.50  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MGLD, p = 0.50  MGD  MGLD, p = 0.50 
0.00  0.01  0.00  0.01  0.00  0.01  0.00  0.02  
Fricative  MGLD, p = 0.75  MLD  MGLD, p = 0.75  MLD  MGLD, p = 0.50  MGLD, p = 0.75  MGLD, p = 0.75  MGLD, p = 0.50 
0.01  0.01  0.00  0.01  0.00  0.01  0.00  0.00  
Stop  MLD  CLD  MLD  CLD  MLD  MGLD, p = 0.75  MGLD, p = 0.75  MGLD, p = 0.50 
0.05  0.10  0.04  0.08  0.01  0.01  0.00  0.00 

The univariate Rayleigh distribution has been proposed for statistical univariate modeling of ADFT feature. Maybe as a consequence, it was expected that multivariate Rayleigh distribution (CRD) would be also superior in modeling multivariate distribution of ADFT; however, the evaluation results proposed MGLD, CGD, or CGevD as the bestfitted candidates.

The bestfitted candidate in the sense of the energy test for all phoneme classes in T, RDFT, IDFT, and DCT features with different frame lengths was either MLD or MGLD (with p ∈ {0.25, 0.50, 0.75}). In particular for frame lengths of 20 and 30 ms, which are mostly exploited in speech processing, either MLD or MGLD with p = 0.75 dominated. As a consequence, the Laplace distribution contributes more compared to the Gaussian distribution in the statistical multivariate modeling of T, RDFT, IDFT, and DCT features with short frame lengths.

The bestfitted candidates for different phoneme classes with LPC feature was mostly MGD or MGLD with p = 0.25. As a consequence, the Gaussian distribution contributed more in the statistical multivariate modeling of LPC feature compared to the Laplace distribution.

As the first or second bestfitted candidates for different process domains of a phoneme class with a fixed frame length mostly varied between MLD and MGLD (with p ∈ {0.25, 0.50, 0.75}), the statistical modeling of phonemes with a mixture of Gaussian and Laplace distributions is proposed.

The bestfitted candidate for most phoneme classes with MFCC features extracted from frames of length less than 500 ms is MGD, consistent with the assumption of multivariate Gaussian distribution used in most speech recognition algorithms [2, 3].

The only copulabased distribution proposed by the energy test evaluation results was CGD for statistical modeling of the stop phoneme class in ADFT domain with frame lengths of 20 and 30 ms, and CGevD for semivowel/glide with frame length of 500 ms.

Based on the evaluation results, in the sense of the energy test, the copulabased distributions using IFM method were mostly overcome by conventional distributions in the second experimental setup. As only one of parameter estimation methods of copulabased distribution, IFM method, was taken into account in the experimental evaluation, and the IFM method ends up a suboptimal solution for parameter estimation, it is difficult to have a generic conclusion on copulabased distribution’s benefit in statistical modeling of speech frame. One of future work perspective might therefore be to study the power of statistical modeling of copulabased distribution of speech frame using optimal parameter estimation methods.

In some cases, where the energy test values of the first and second bestfitted candidates are almost the same, there is almost no superiority in the sense of energy test between the first or second bestfitted distributions, e.g., the case of fricative phoneme in time domain with different frame lengths in Table 7.
6 Conclusions
In this paper, the multivariate distribution of speech features in various domains, e.g., time, DFT, DCT, MFCC, and LPC, was studied and a framework was proposed for exploring the bestfitted distribution among different candidates. Ten plausible candidates including five conventional distributions, e.g., the multivariate Gaussian, multivariate Laplace, and the mixture of Gaussian–Laplace distributions (in three forms), and five copulabased distributions with marginal Laplace, independent marginal Laplace, Rayleigh, Gamma, and generalized extreme value (GEV) distributions were considered to explore the effect of feature type, phoneme class (for English language), and frame length on the distribution.
The evaluation results of the test energy showed that the multivariate Laplace distribution statistically better models time and DFT features of speech signals compared to the multivariate Gaussian distribution. For the amplitude of DFT features, the copulabased distribution with marginal Laplace distribution was proposed as the bestfitted candidate. For the MFCC features, the bestfitted candidate was MGD, consistent with the assumption of multivariate Gaussian distribution used in most speech recognition algorithms. For multivariate statistical modeling of different phoneme classes, the first or second bestfitted candidates for different domains (and also for different frame sizes) mostly varied between MLD and MGLD (with p ∈ {0.25, 0.50, 0.75}), i.e., a mixture of Gaussian and Laplace distributions. The future work of this study can lead toward the development of statistical speech processing algorithms exploiting Laplace, mixture of Laplace and Gaussian, or copulabased multivariate distribution, depending on the feature type, phoneme class, and frame length.
Although the copulabased distribution was proposed as the bestfitted distribution for the modeling of amplitude of DFT, it is not the case for other features. It means that the copulabased approach requires more investigation in numbers of ways. First, the practical issues, e.g., the computational cost and the lack of sufficient amount of data for parameter estimation of some phoneme classes, e.g., stops, are needed to be considered. Second, as the IFM method used for parameter estimation of copulabased distribution ends up in a suboptimal estimate, developing an optimal parameter estimation method for large vector dimensions is needed to have a fair evaluation of the copulabased distribution power in the statistical modeling of speech signals, e.g., compared to the optimal parameter estimation method used for MLD and MGD.
Declarations
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 DY Zhao, Model based speech enhancement and coding, Ph.D., Royal Institute of Technology, KTH (2007)Google Scholar
 L Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)View ArticleGoogle Scholar
 Huang, X, Acero, A, and Hon, HW, Spoken language processing: a guide to theory, algorithm, and system development, Prentice Hall PTR, 2001Google Scholar
 H Zen, K Tokuda, AW Black, Statistical parametric speech synthesis. Speech Comm. 51, 1039–1064 (2009)View ArticleGoogle Scholar
 Z Xin, P Jancovic, L Ju, M Kokuer, Speech signal enhancement based on map algorithm in the ICA space. Signal Processing, IEEE Transactions on 56, 1812–1820 (2008)View ArticleGoogle Scholar
 R Martin, Speech enhancement based on minimum meansquare error estimation and superGaussian priors. Speech and Audio Processing, IEEE Transactions on 13, 845–856 (2005)View ArticleGoogle Scholar
 JW Shin, JH Chang, NS Kim, Statistical modeling of speech signals based on generalized Gamma distribution. Signal Processing Letters, IEEE 12(3), 258–261 (2005)View ArticleGoogle Scholar
 A Aroudi, H Veisi, and H Sameti, Hidden Markov Modelbased Speech Enhancement Using Multivariate Laplace and Gaussian Distributions, IET Signal Processing, 9(2), 177–185, 2015, (doi:10.1049/ietspr.2014.0032)
 H Veisi, H Sameti, Speech enhancement using hidden Markov models in Melfrequency domain. Speech Commun. 55, 205–220 (2013)View ArticleGoogle Scholar
 Y Ephraim, A Bayesian estimation approach for speech enhancement using hidden Markov models. Trans. Sig. Proc. 40, 725–735 (1992)View ArticleGoogle Scholar
 Y Ephraim, D Malah, Speech enhancement using a minimummean square error shorttime spectral amplitude estimator. Acoustics, Speech and Signal Processing, IEEE Transactions on 32, 1109–1121 (1984)View ArticleGoogle Scholar
 R McAulay, M Malpass, Speech enhancement using a softdecision noise suppression filter. Acoustics, Speech and Signal Processing, IEEE Transactions on 28, 137–145 (1980)View ArticleGoogle Scholar
 A Fazel, S Chakrabartty, An overview of statistical pattern recognition techniques for speaker verification. Circuits and Systems Magazine, IEEE 11, 62–81 (2011)View ArticleGoogle Scholar
 S Gazor, Z Wei, Speech probability distribution. Signal Processing Letters, IEEE 10, 204–207 (2003)View ArticleGoogle Scholar
 Allen, AO, Probability, statistics, and queueing theory with computer science applications, Academic Press, Inc., 1978Google Scholar
 Papoulis, A, Probability, random variables and stochastic processes, McGrawHill Companies, 1991Google Scholar
 B Chen, PC Loizou, A Laplacianbased MMSE estimator for speech enhancement. Speech Commun. 49, 134–143 (2007)View ArticleGoogle Scholar
 JS Erkelens, J Jensen, R Heusdens, Speech enhancement based on rayleigh mixture modeling of speech spectral amplitude distributions, European Signal Proc. Conf. EUSIPCO, 2007Google Scholar
 H Brehm, W Stammler, Description and generation of spherically invariant speechmodel signals. Signal Process. 12, 119–141 (1987)View ArticleGoogle Scholar
 JP LeBlanc, PL De Leon, Speech separation by kurtosis maximization, in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, 1998Google Scholar
 J Jensen, I Batina, RC Hendriks, R Heusdens, A study of the distribution of timedomain speech samples and discrete fourier coefficients, Proc. IEEE First BENELUX/DSP Valley Signal Processing Symposium, 2005Google Scholar
 C Schölzel, P Friederichs, Multivariate nonnormally distributed random variables in climate research; introduction to the copula approach. Nonlin. Processes Geophys. 15, 761–772 (2008)View ArticleGoogle Scholar
 T Ané, C Kharoubi, Dependence structure and risk measure. J. Bus. 76, 411–438 (2003)View ArticleGoogle Scholar
 JC Rodriguez, Measuring financial contagion: a copula approach. Journal of Empirical Finance 14, 401–423 (2007)View ArticleGoogle Scholar
 C Genest, M Gendron, M BourdeauBrien, The advent of copulas in finance. European Journal of Finance 15, 609–618 (2009)View ArticleGoogle Scholar
 G Palombo, Multivariate goodness of fit procedures for unbinned data: an annotated bibliography, arXiv, 2011, 1102.2407v1.Google Scholar
 AW Black, H Zen, K Tokuda, Statistical parametric speech synthesis, in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference, 2007Google Scholar
 S Srinivasan, J Samuelsson, WB Kleijn, Codebookbased Bayesian speech enhancement for nonstationary environments. Audio, Speech, and Language Processing, IEEE Transactions on 15, 441–452 (2007)View ArticleGoogle Scholar
 A Sklar, Fonctions de Répartition à n Dimensions et Leurs Marges. Publications Inst. Statist. Univ. Paris 8, 229–231 (1959)MathSciNetGoogle Scholar
 RB Nelsen, An introduction to copulas (Springer, New York, 2006)MATHGoogle Scholar
 E Bouye, VD, A Nikeghbali, G Riboulet, and T Roncalli,, ‘Copulas for finance: a reading guide and some applications’, Working Paper. Groupe de Recherche Operationnelle, Credit Lyonnais, Available at http://ssrn.com/abstract=1032533, Nov. 2013.
 Joe, H, Multivariate Models and Dependence Concepts, Chapman and Hall, 1997Google Scholar
 W Hoeffding, Scale—invariant correlation theory, in The Collected Works of Wassily Hoeffding, ed. by NI Fisher, PK Sen (Springer, New York, 1994)Google Scholar
 P Embrechts, F Lindskog, and A McNeil, ‘Modelling Dependence with Copulas and Applications to Risk Management’, Handbook of Heavy Tailed Distributions in Finance, Rachev, S. (ed), Elsevier, 329–38 2001Google Scholar
 X Chen, Y Fan, Estimation of copulabased semiparametric time series models, Vanderbilt University Department of Economics, 2004.Google Scholar
 G Biau, M Wegkamp, A note on minimum distance estimation of copula densities. Statistics & Probability Letters 73, 105–114 (2005)MATHMathSciNetView ArticleGoogle Scholar
 BVM Mendes, EFL De Melo, RB Nelsen, Robust fits for copula models. Communications in Statistics: Simulation and Computation 36, 997–1017 (2007)MATHMathSciNetView ArticleGoogle Scholar
 Bowman, AW and Azzalini, A., Applied smoothing techniques for data analysis: the kernel approach with Splus illustrations, Clarendon Press, 1997Google Scholar
 J Yan, Enjoy the joy of copulas: with a package copula. J. Stat. Softw. 21, 1–21 (2007)View ArticleGoogle Scholar
 Bellman, R.E, Adaptive control processes: a guided tour, Princeton University Press, 1961Google Scholar
 PJ Clark, FC Evans, Generalization of a nearest neighbor measure of dispersion for use in K dimensions. Ecology 60, 316–317 (1979)View ArticleGoogle Scholar
 KV Mardia, Measures of multivariate skewness and kurtosis with applications. Biometrika 57, 519–530 (1970)MATHMathSciNetView ArticleGoogle Scholar
 JH Friedman, LC Rafsky, Multivariate generalizations of the WaldWolfowitz and Smirnov twosample tests. Ann. Stat. 7, 697–717 (1979)MATHMathSciNetView ArticleGoogle Scholar
 B Aslan, G Zech, Statistical energy as a tool for binningfree, multivariate goodnessoffit tests, twosample comparison and unfolding. Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 537, 626–636 (2005)View ArticleGoogle Scholar
 V Schmidt, E James, Gentle: random number generation and Monte Carlo methods. Metrika 64, 251–252 (2006)View ArticleGoogle Scholar
 G Muraleedharan, C.G.S.a.C.L., Characteristic and Moment Generating Functions of Generalized Extreme Value Distribution, Sea Level Rise, Coastal Engineering, Shorelines and Tides, Nova Science Publishers, 269–276 2011.Google Scholar
 Minka, TP, ‘Estimating a Gamma distribution’, http://research.microsoft.com/enus/um/people/minka/papers/minkagamma.pdf, Dec. 2012.
 Paul Embrechts, CK, Thomas Mikosch, Modelling extremal events: for insurance and finance, Springer, 1997Google Scholar
 JC Lagarias, JA Reeds, MH Wright, PE Wright, Convergence properties of the Nelder–Mead simplex method in low dimensions. SIAM J. on Optimization 9, 112–147 (1998)MATHMathSciNetView ArticleGoogle Scholar
 K Fragiadakis, SG Meintanis, Goodnessoffit tests for multivariate Laplace distributions. Math. Comput. Model. 53, 769–779 (2011)MATHMathSciNetView ArticleGoogle Scholar
 Garofolo, JS, Lamel, LF, Fisher, WM, Fiscus, JG, Pallett, DS, and Dahlgren, NL, Acoustic phonetic continuous speech corpus, NIST, 1993Google Scholar
 Efron, B. and Tibshirani, R.J., An introduction to the Bootstrap, CRC press, 1994Google Scholar
 B Aslan, and G Zech, A New Class of Binning Free, Multivariate GoodnessofFit Tests: The Energy Tests, arXiv preprint hepex/0203010, 2002Google Scholar