# Correlation analysis of the speech multiscale product for the open quotient estimation

- Wafa Saidi
^{1}Email author, - Aicha Bouzid
^{1}and - Noureddine Ellouze
^{1}

**2011**:8

https://doi.org/10.1186/1687-4722-2011-8

© Saidi et al; licensee Springer. 2011

**Received: **21 January 2011

**Accepted: **10 November 2011

**Published: **10 November 2011

## Abstract

This article proposes a multiscale product (MP)-based method for estimating the open quotient (OQ) from the speech waveform. The MP is operated by calculating the wavelet transform coefficients of the speech signal at three scales and then multiplying them. The resulting MP signal presents negative peaks informing about the glottis closure, and positive ones informing about the glottis opening. Taking into account the shape of the speech MP close to the derivative of electroglottographic (EGG) signal, we proceed to a correlation analysis for the fundamental frequency and OQ measurement. The approach validation is done on voiced parts of the Keele University database by calculating the absolute and relative errors between the OQ estimated from the speech and the corresponding EGG signals. When considering the mean OQ over each voiced segments, results of our test show that OQ is estimated within an absolute error from 0.04 to 0.1 and a relative error from 8 to 21% for all the speakers. The approach is not so performant when the evaluation concerns the OQ frame-by-frame measurements. The absolute error reaches 0.12 and the relative error 30%.

## Keywords

## 1. Introduction

According to the source-filter theory of the speech production [1], voiced speech is represented as the response of the vocal tract filter to the glottal voice source. The glottal source consists of quasi-periodic pulses which are created by the vocal folds oscillations. It is characterised by two crucial moments; the glottal closure (GCI) and opening instants (GOI). GCIs and GOIs are required to be estimated accurately for many applications in various speech areas, such as voice quality assessment [2], speech analysis and coding [3], speaker identification [4] and glottal source estimation [5].

A glottal source parameter widely related to the GCI and GOI is the open quotient (OQ). It is defined as the ratio between the glottal open phase duration and the speech period. The open phase is the proportion of the glottal cycle during which the glottis is open. Thus, it is the duration between one GOI and the consecutive GCI. The speech period is the interval limiting two successive GCIs.

OQ is of considerable interest as it has been reported to be related to voice quality such as "breathy" and "pressed" voices [6, 7]. A breathy voice happens when the vocal folds do not completely close during a glottal cycle and thus the OQ is large. A pressed voice is produced with constricted glottis and it corresponds to a small OQ. Vocal quality is studied with more details in [8].

In [9], the OQ changes with vocal registers were analysed using high-speed digital imaging and electroglottography (EGG). The work presented in [10] proposes the OQ measurements from the EGG signal and studies the relationship between the OQ and the perception of the speaker's age. The correlation between the OQ and the fundamental frequency has been studied for male and female speakers in [11, 12]. Henrich [13] provides an overview of the OQ variations with the vocal intensity and the fundamental frequency.

The EGG signal was the easiest way to measure the OQ as it is a direct representation of the glottal activity. In this context, Henrich et al. [13–15] suggested a correlation-based method called DECOM for automatic measurement of the fundamental frequency (F0) and the OQ using the derivative of electroglottographic (DEGG) signals. Bouzid and Ellouze [16] used the multiscale product (MP) of the wavelet transform (WT) for detecting singularities in speech signal caused by the opening and the closing of the vocal folds. But no quantitative results were given.

For estimating the OQ and other glottal parameters from the speech signal only, many approaches have been proposed to estimate the glottal source signal. These methods are based on the digital inverse filtering using linear prediction or vocal-tract deconvolution [17–19]. A recent study done in [20] uses the zeros of the *z*-transform with a general model of the glottal flow to compute the OQ and the asymmetry quotient on speech signal of various voice qualities.

In this article, we are inspired by the approach presented in [14] where the OQ is estimated from the EGG signal using a correlation-based algorithm. Knowing that the speech MP provides a signal having a shape strongly close to the DEGG signal, we apply the Henrich correlation approach on the newly obtained signal and not on the EGG one. Therefore, we can give an estimation of the pitch period and the OQ from the speech signal over frames of a fixed length.

This rest of the article is organised as follows. Section 2 presents the MP analysis of the speech signal. Section 3 describes the proposed approach to estimate the OQ over a given frame. The method is divided into three stages. The first one operates the speech MP consisting of making the WT coefficients at three scales. The second step consists of windowing the MP signal and then split it into positive and negative parts. The third step computes the crosscorrelation function between the obtained two parts for estimating the open phase duration, and the autocorrelation of the negative part for estimating the pitch period. Evaluation results are presented in Section 4. Conclusion is drawn in Section 5.

## 2. MP for speech analysis

WT is a multiscale analysis widely used in image and signal processing. Owing to the efficient time-frequency localisation and the multiresolution characteristics, the WTs are quite suitable for processing signals of transient and non-stationary nature. Mallat and Zhong [21] have shown that multiscale edge detection is equivalent to find the local maximum of its wavelet representation. Several wavelet-based algorithms have been proposed to detect signal singularities [22–24]. GCIs and GOIs are such events characterising the speech signal. The peak displaying the discontinuity in the WT is often damaged by noise when the scale is so fine or smoothed when the scale is large.

To improve edge detection using wavelet analysis, the MP method is proposed. It consists of making the product of the WT coefficients of the acoustic signal over three scales. It enhances the peak amplitude of the modulus maxima line and eliminates spurious peaks due to the vocal tract effect.

*f*(

*n*) at scales is

where ${W}_{{s}_{j}}f\left(n\right)$ represents the WT of the function *f*(*n*) at scale *s*_{
j
} .

The product *p*(*n*) shows peaks at signal edges, and has relatively small values elsewhere. An odd number of terms in *p*(*n*) preserve the edge sign.

The MP was first related to the edge detection problem in image processing [25, 26]. Besides, the MP is proposed by Bouzid and Ellouze [16, 27] to extract crucial information concerning the vocal source such as glottal opening and closure instants, the fundamental frequency, the OQ and the voicing decision. In previous studies, we proved that the MP is a robust and efficient method for determining the GCI from both clean and noisy acoustic signal [28, 29].

The strength of the MP of the EGG signal compared to the DEGG signal is profoundly studied by Bouzid and Ellouze [16]. This study attempts to measure the voice source parameters using the MP of the EGG signal.

## 3. Proposed method for OQ estimation

### 3.1. Overview of the method

*First stage:* consists of computing the MP of a voiced speech signal and then the signal is divided into frames of a fixed length. To compute the MP, we multiply the WTs of the speech signal at scales 2, 5/2 and 3 using the quadratic spline function.

*N*, we multiply it by a sliding rectangular window

*w*[

*N*]. The MP over a window of index

*i*is given by

where *k* is within [1, *N*] and *i* is the frame index.

*Second stage:* consists of separating the speech MP into two parts: a negative part MP^{c} which contains information concerning glottal closure peaks, and a positive part MP^{o} which contains information about glottal opening peaks. The MP^{c} signal is derived from the original signal by replacing any positive value by 0. In the same way, the MP^{o} signal is derived from the original signal by replacing any negative value by 0.

^{o}and the MP

^{c}. Minima of the MP negative part correspond to the GCI and peaks of the positive part fit with GOI.

*Third stage*: concerns the calculation of the crosscorrelation function between the positive and negative parts (MP^{o} and MP^{c}) for estimating the open phase, and the autocorrelation function of MP^{c} to estimate the fundamental frequency over each frame. The open phase and the fundamental frequency are, respectively, given by the non-null index matching with the first maximum of the crosscorrelation and autocorrelation functions. The OQ is then deduced by calculating the ratio between the open phase and the pitch period.

^{o}and MP

^{c}over a frame

*i*is calculated as follows

^{c}over a frame

*i*is calculated as follows

### 3.2. Frame selection

Assuming that the fundamental frequency value is approximately known, the frames length is chosen to be no less than four periods and no longer than eight periods. We chose these limits for the frame because on running speech, the fundamental frequency varies by a significant amount over eight periods of pitch. So, we use a rectangular window with a fixed length of 25.6 ms for female speakers and 51.2 ms for male speakers.

### 3.3. MP autocorrelation for the fundamental frequency estimation

Autocorrelation analysis is a well-known method for fundamental frequency estimation. This technique was firstly used by Rabiner [30] as a pitch detector. Henrich et al. [14] applied this approach to estimate the fundamental frequency from the EGG signal.

In [14], Henrich et al. discuss the problems of double or imprecise peaks happening on the DEGG signal at the opening and the closing of the glottis and how to handle them. This glottal behaviour is observed by Anastalpo and Karnell [31]. These problems are overcome using the MP of the EGG signal as proposed in [16]. For real speech, typical cases are absent for closing peaks and are seldom observed for opening peaks.

### 3.4. MP crosscorrelation for open phase estimation

To calculate the glottis open phase duration of the speech signal, we calculate its MP at first. Then, we operate the crosscorrelation between its positive and negative parts. The first maximum index is considered as the open phase.

However, we note the cases where the speech MP produces more than one positive peak during a period. This behaviour induces double peaks on the crosscorrelation function. So, we consider the mean value of the two maxima. Our solution gives the nearest value to the open phase measured by the EGG signal as it is considered as the ground truth.

### 3.5. OQ estimation

Since the fundamental frequency and the open phase are given, it is possible to estimate the OQ.

Observing the OQ accuracy representation in Figures 12 and 14, we conclude that the OQ estimation is more precise when considering the mean OQ value over the voiced segments.

Gross deviation of the OQ estimation is caused by the errors of the open phase estimation happening when the opening peaks are doubled or imprecise.

The OQ estimation is unbiased in all cases. The error is much larger in Figures 13 and 14 than in Figures 11 and 12, showing that the GOI localisation from the speech signal is less accurate than from the EGG signal in the second case.

## 4. Experiments and results

### 4.1. Data

To evaluate the performance of our algorithm for OQ estimation, we use the Keele University database. This database includes the acoustic speech signals and laryngograph signals (single speaker recording). Five adult female speakers (*f*_{
i
} ) and five adult male speakers (*m*_{
i
} ) with i ∈ {1,...,5} are recorded in low ambient noise conditions using a sound-proof room. Each utterance consists of the same phonetically balanced English text: "The North Wind Story." In each case, the acoustic and laryngograph signals are time-synchronised and share the same sampling rate value of 20 kHz [32]. The Keele database includes reference files containing a voiced/unvoiced segmentation and a pitch estimation of 25.6 ms segments with 10 ms overlapping. The reference files also mark uncertain pitch and voicing decisions. The database is open source and it available on [33].

### 4.2. Results

The Keele University database consists of running speech containing voiced, unvoiced and silence parts. Only voiced segments extracted from the database are handled by our algorithm.

To evaluate the performance of our approach for OQ estimation, we calculate absolute and relative errors between OQ estimated from the speech signal and the reference OQ estimated from the EGG signal.

We consider the indexes {1,...,10} corresponding to speakers {*f*_{1}, *f*_{2}, *f*_{3}, *f*_{4}, *f*_{5}, *m*_{1}, *m*_{2}, *m*_{3}, *m*_{4}, *m*_{5}}. Each speaker *k* is characterised by *N*_{
k
} the number of voiced segments. Each segment is divided into *n*_{
ki
} frames where *k* ∈ {1,...,10} and *i* ∈ {1,...,*N*_{
k
} }.

*k*are defined as follow

where oq_{
nki
}(*j*) is the estimated OQ over a frame *j* that belongs to a voiced segment *i* uttered by a speaker *k*. oqegg_{
nki
}(*j*) is the reference OQ value for the same frame calculated from the EGG signal.

For the second case, absolute and relative errors are defined by the mean values of the OQ estimated over the frames constituting the voiced segment:

*k*, the absolute and the relative errors are given by

where OQ _{
ki
} is the mean value calculated over a segment referring to the frames constituting this voiced segment.

Performance of the MP for the OQ estimation over voiced frames of the Keele University database

Speakers | Absolute error | Relative error (%) | speakers | absolute error | Relative error (%) |
---|---|---|---|---|---|

| 0.08 | 18 |
| 0.10 | 21 |

| 0.07 | 16 |
| 0.09 | 28 |

| 0.08 | 18 |
| 0.12 | 30 |

| 0.05 | 10 |
| 0.08 | 21 |

| 0.07 | 16 |
| 0.11 | 30 |

Performance of the MP for the OQ estimation over voiced segments of the Keele University database

Speakers | Absolute error | Relative error (%) | speakers | absolute error | Relative error (%) |
---|---|---|---|---|---|

| 0.07 | 13 |
| 0.10 | 19 |

| 0.04 | 9 |
| 0.07 | 17 |

| 0.07 | 13 |
| 0.07 | 16 |

| 0.04 | 8 |
| 0.06 | 15 |

| 0.05 | 10 |
| 0.10 | 21 |

Table 1 gives errors referring to voiced frames. However, Table 2 gives errors referring to voiced segments.

Overall results show that the estimation of the OQ with the proposed method is competitive especially when considering the errors calculated over voiced segments of the database. In this case, absolute errors are at most 0.1 for speakers M1 and M5 and 0.07 for speakers f1 and f3. Relative errors do not exceed 13% for female speakers and 21% for male speakers.

Besides, the proposed approach for the OQ estimation can be considered as interesting and efficient regarding the error values and the lack of developed works in this field.

This research is a first step considered in our global project to give an accurate estimation of instantaneous OQ from the speech signal. That's why, the proposed measure is of great importance as it permits to give an approximate interval more little than the period to localise the GOI. Once the GOIs are accurately located, we can turn back to estimate once again the OQ with more precision and for each period.

## 5. Conclusion

In this article, an approach for the OQ estimation from the speech signal is presented. It is based upon the correlation of the speech MP.

The MP is used to provide a simplified transformed speech signal that reminds the derivative of the EGG signal shape representing the global source activity.

The OQ estimation is obtained by calculating the ratio of the open phase over the pitch period. The open phase is referred as the index non-null of the first maximum localised on the inter-correlation function between the positive and the negative parts of the speech MP. As the same way, the pitch period is indexed by the first maximum of the speech MP correlation function.

Evaluation computes the absolute and relative errors between the OQ values determined from the speech signal and the OQ measured on the EGG signal considered as a reference. The evaluation is done on the Keele University database. The proposed approach reveals interesting performance.

## Declarations

## Authors’ Affiliations

## References

- Fant G:
*Acoustic Theory of Speech Production (Mouton, La Hague)*. 1960.Google Scholar - Gaubitch N, Naylor P:
**Spatio-temporal averaging method for enhancement of reverberant speech.***5th International Conference on Digital Signal Processing*2007, 607-610.Google Scholar - Jinachitra P:
**Glottal closure and opening detection for flexible parametric voice coding.***INTERSPEECH*2006. paper 1359-Thu2BuP.2Google Scholar - Guerchi D, Mermelstein P:
**Low-rate quantization of spectral information in a 4 kb/s pitch-synchronous CELP coder.***IEEE Workshop on speech coding*2000, 111-113.Google Scholar - Gudnason J, Brookes M:
**Voice source cepstrum coefficients for speaker identification.***IEEE International Conference on Acoustics, Speech and Signal Processing*2008, 4821-4824.Google Scholar - Alku P, Vilkman E:
**A comparison of glottal voice source quantification parameters in breathy, normal and pressed phonation of female and male speakers.***Folia Phoniatr (Basekl)*1996,**48:**240-254. 10.1159/000266415View ArticleGoogle Scholar - Klatt D, Klatt L:
**Analysis, synthesis, and perception of voice quality variations among female and male talkers.***J Acoust Soc Am*1990,**87:**820-857. 10.1121/1.398894View ArticleGoogle Scholar - Keating PA, Esposito C:
**Linguistic voice quality.***11th Australasian International Conference on Speech Science and Technology, Auckland, NZ*2006.Google Scholar - Echternach M, Dippold S, Sundberg J, Zander MF, Richter B:
**High-speed imaging and elecrtoglottography measurements of the open quotient in untrained male voices' register transitions.***J Voices*2010,**24**(6):644-650. 10.1016/j.jvoice.2009.05.003View ArticleGoogle Scholar - Winkler R, Sendlmeier W:
**Open quotient (EGG) measurements of young and eldrly voices: results of production and perception study.***ZAS Papers Linguistics*2005,**40:**213-225.Google Scholar - Hanson DG, Gerratt BR, Berke GS:
**Frequency, intensity and target matching effects on photogolottographic measures of open quotient and speed quotient.***J Speech Hear Res*1990,**33:**45-50.View ArticleGoogle Scholar - Kitzing P, Sonesson B:
**A photogolottographical study of the female vocal folds during phonation.***Folia Phoniatr (Basekl)*1974,**26:**138-149. 10.1159/000263776View ArticleGoogle Scholar - Henrich N, d'Allessandro C, Castellengo M, Doval B:
**Glottal open quotient in singing: measurements and correlation with laryngeal mechanisms, vocal intensity, and fundamental frequency.***J Acoust Soc Am*2005,**117**(3):1417-1430. 10.1121/1.1850031View ArticleGoogle Scholar - Henrich N, d'Allessandro C, Castellengo M, Doval B:
**On the use of the deravative of electroglottographic signals for characterization of nonpathological phonation.***J Acoust Soc Am*2004,**115**(3):1321-1332. 10.1121/1.1646401View ArticleGoogle Scholar - Henrich N, Doval B, d'Allessandro C, Castellengo M:
**Open quotient measurements on EGG, speech and singing signals.***Proceedings of the 4th International Workshop on Advances in Quantitative Laryngoscopy, Voice and Speech Research, Jena*2000.Google Scholar - Bouzid A, Ellouze N: Voice source measurement based on multiscale analysis of electroglottographic signal. Speech CommunGoogle Scholar
- Shue YL, Kreiman J, Alwan A:
**a novel codebook search technique for estimating the open quotient.***Interspeech*2009, 2895-2898.Google Scholar - Sturmel N, d'Allessandro C, Doval B:
**A spectral method for estimation of the voice speed quotient and evaluation using electroglottography.**In*7th Conference on Advances in Quantitative Laryngology*. Groningen, The Netherlands; 2006:6.Google Scholar - Jinachitra P, Smith JO: Joint estimation of glottal source and vocal tract for vocal synthesis using Kalman smoothing and EM algorithm. WASPAA'2005, New Paltz, NYGoogle Scholar
- Sturmel N, d'Allessandro C, Doval B:
**Glottal parameters estimation on speech using the zeros of the z-transform.***INTERSPEECH*2010, 665-668.Google Scholar - Mallat S, Zhong S:
**Characterization of signals from multiscale edges.***IEEE Trans Pattern Anal Mach Intell*1992,**14**(7):710-732. 10.1109/34.142909View ArticleGoogle Scholar - Wendt C, Petropulu AP:
**Pitch determination and speech segmentation using the discrete wavelet transform.***Proceedings of ISCAS 96, Atlanta*1996,**2:**45-48.Google Scholar - Tuan VN, d'Allessandro C:
**Robust glottal closure detection using the wavelet transform.***Proceedings of the European Conference on Speech Technology*1999, 2805-2808.Google Scholar - Wang JF, Shen SH:
**Wavelet transforms for speech signal processing.***J Chin Inst Eng*1999,**22**(5):549-560. 10.1080/02533839.1999.9670493View ArticleGoogle Scholar - Rosenfeld A:
**A nonlinear edge detection.***Proc IEEE*1970,**58:**814-816.View ArticleGoogle Scholar - Xu Y, Weaver JB, Healy DM, Lu J:
**Wavelet transform domain filters: a spatially selective noise filtration technique.***IEEE Trans Image Process*1994,**3**(6):747-758. 10.1109/83.336245View ArticleGoogle Scholar - Bouzid A, Ellouze N:
**Electroglottographic measures based on GCI and GOI detection using MP.***Int J Comput Commun Control*2008,**III**(1):21-32.Google Scholar - Saidi W, Bouzid A, Ellouze N:
**Evaluation of multi-scale product method and DYPSA algorithm for glottal closure instant detection.***3rd International Conference on Information and Communication Technologies: From Theory to Applications, 2008. ICTTA 2008*2008, 1-5.Google Scholar - Saidi W, Bouzid A, Ellouze N:
**MPM method and DYPSA algorithm evaluation for GCI detection in noisy speech signal.***Int J Comput Inf Technol and Comp*2010,**1**(1):93-105.Google Scholar - Rabiner LR:
**On the use of autocorrelation analysis for pitch detection.***IEEE Trans Acoust Speech Signal Process*1977,**25**(1):24-33. 10.1109/TASSP.1977.1162905View ArticleGoogle Scholar - Anastalpo S, Karnell MP:
**Synchronized videoscopic and electroglottographic examination of glottal opening.***J Acoust Soc Am*1988,**83:**1883-1890. 10.1121/1.396472View ArticleGoogle Scholar - Plante F, Meyer G, Ainsworth WA:
**A pitch extraction reference database.***Proc of EUROSPEECH*1995, 837-840.Google Scholar - Keele Pitch Database:
**Pssychology Home page--Human Machine Perception.**University of Liverpool; 1995. [http://www.liv.ac.uk/Psychology/hmp/projects/pitch.html]Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.