Discriminative features based on modified log magnitude spectrum for playback speech detection

Yang, Jichen; Xu, Longting; Ren, Bo; Ji, Yunyun

doi:10.1186/s13636-020-00173-5

Research
Open access
Published: 07 April 2020

Discriminative features based on modified log magnitude spectrum for playback speech detection

Jichen Yang¹,
Longting Xu²^na1,
Bo Ren³ &
…
Yunyun Ji⁴

EURASIP Journal on Audio, Speech, and Music Processing volume 2020, Article number: 6 (2020) Cite this article

2595 Accesses
4 Citations
Metrics details

Abstract

In order to improve the performance of hand-crafted features to detect playback speech, two discriminative features, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients, are proposed for playback speech detection in this work. They rely on our findings that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can enhance the discriminative power between genuine speech and playback speech. Then constant-Q variance-based octave coefficients (constant-Q mean-based octave coefficients) can be obtained by combining variance-based modified log magnitude spectrum (mean-based modified log magnitude spectrum), octave segmentation, and discrete cosine transform. Finally, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients are evaluated on ASVspoof 2017 corpus version 2.0 and ASVspoof 2019 physical access, respectively. Experimental results show that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can produce discriminative features toward playback speech. Further results on the two databases show that constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients can perform better than some common features, such as mel frequency cepstral coefficients and constant-Q cepstral coefficients.

1 Introduction

Replay attacks present serious threat to automatic speaker verification (ASV) system. In which the source recordings of playback are from the legitimate clients [1, 2]. Thus, replay attacks can pose the threat to ASV system. This motivates our focus on playback speech detection.

Since the ASVspoof 2017 challenge [1, 2], more and more researchers begin to focus on playback speech detection [3–10]. Similar to many speech signal processing systems, most of all playback speech detection systems usually consist of front-end feature and back-end classifier [11–18]. For the end-to-end systems such as [19–23], the output layer of the neural network can be seen as a virtual classifier and the rest of the neural network can be seen a deep feature extractor. In this paper, we mainly focus on how to extract discriminative feature for playback speech detection.

1.1 Related works

Before 2017, several studies about playback speech detection have been reported. The earlier ones [24–26] were based on small-scale databases, where only a small number playback and recording conditions were taken into account. For example, in [24, 27], three playback and recording devices were used to collect the database; in [25, 28], one recording device and one playback device were used to create the database, which is named as authentic and playback speech database (APSD); in [29], the database was built by four smartphones; and in [26], four devices were used to create the playback utterances in the database, which is named as (audio-visual spoofing 2015) AVspoof 2015.

Different from the above databases, the launch of the ASVspoof 2017 corpus provided a large common database, obtained using 26 playback devices, 25 recording devices, and 26 environments [1, 2, 30]. So, ASVspoof 2017 corpus can be used to evaluate a playback speech detection algorithm justly because it has more channel and acoustic conditions than any previous databases [24–26]. In addition, the recent released ASVspoof 2019 physical access also can be used in this study. Hence, ASVspoof 2017 and ASVspoof 2019 physical access not only can support researchers to develop countermeasures, but also can protect ASV system to avoid replay attack [1].

After ASVspoof 2017 and ASVspoof 2019 physical access corpus were released in 2017 and 2019, respectively, some effective methods were proposed to detect playback speech based on the two databases. According to how the used features generate, these methods can be categorized into two types: hand-crafted design features and deep features. In general, hand-crafted features are the features which are obtained by using math formula to design while deep features are the features which are obtained by learning from neural networks for the input.

Deep feature extraction usually contains two steps: the fist is to train a classifier using neural network and the input, and the second is to remove the output layer of the classifier. Because the end-to-end system can be seen a virtual classifier (the output layer) and a deep feature extractor (the rest part), in other words, the end-to-end system can be seen a deep feature extractor if the output layer is removed. So in this study, we also regard the deep features that can be obtained from the end-to-end systems as special type of deep features. According to the neural networks used, there are several types deep features. For example, light convolutional neural network was used to learn deep feature for the input of log power spectrum of constant-Q transform (CQT) and fast Fourier transform in [14, 20, 21], deep Siamese that is formed two convolutional neural networks with the input of spectrogram are used to learned to obtain Siamese embedding features in [31], residual network (ResNet) was used to learn deep feature from the input of group delay gram in [19, 32, 33].

For the hand-crafted design features, which mainly include the following categories:

CQT based features: which include constant-Q cepstral coefficients (CQCC) [34, 35] used in [4, 36, 37], extended CQCC [38, 39], constant-Q magnitude-phase octave coefficients [40], and constant-Q statistics-plus-principal information coefficients [41].
Discrete Fourier transform (DFT) based features: which include Mel frequency cepstral coefficients (MFCC) [4, 13, 36], mel filterbank slope [10], linear filterbak slope [10], and Q-log domain DFT-based mean normalized log spectral [42].
Variable length energy separation algorithm (VESA)-based features: which include instantaneous frequency cosine coefficients based on VESA [6] and instantaneous amplitude cosine coefficients based on VESA [43].
Prediction cepstral coefficients-based features: which include linear prediction cepstral coefficients residual part and linear prediction cepstral coefficients cepstrum [13, 19], frequency domain linear prediction [9].
Spectral centroid-based features: which include subband spectral centroid frequency coefficients and subband spectral centroid magnitude coefficients [12] and spectral centroid deviation [16].
Phased-based features: which include instantaneous frequency cosine coefficient [44, 45] and modified group delay cepstral coefficient [15].
Zero time windowing-based features: zero time windowing cepstral coefficients [46, 47].
Single frequency filter-based features: single frequency filter cepstral coefficients [3, 47].

In which, CQCC is the most widely used features in playback speech detection, in ASVspoof2017 and ASVspoof 2019 challenge, CQCC plus Gaussian mixture model (GMM) are used to form the baseline system by the organizers [2, 48]. The reason is that CQT is a long-term transform, and it can provide more frequency detail to capture playback information in playback speech detection compared with DFT.

Generally speaking, deep features perform better than hand-crafted design features in playback speech detection because more useful information for discriminating playback speech from genuine speech can be obtained by deep learning. However, deep features rely on training data heavily. That is to say, deep features only suit on the scope of training data. Further, if we want to study the property of playback speech, hand-crafted design features can be selected rather than deep features. The goal of the paper is to extract discriminative feature for playback speech detection, so hand-crafted design feature is studied. Therefore, our focus is how to extract hand-crafted discriminative features in this study.

Traditional hand-crafted features used in speech signal processing such as MFCC and CQCC are not designed for playback speech detection. In order to improve the performance of hand-crafted features to detect playback speech, we focus on designing discriminative features for playback speech detection in this study. Considering three facts which are as following:

CQT is a long-term transform and it can provide more frequency detail to capture the playback information compared with DFT.
Many hand-crafted features such as CQCC and MFCC are extracted from log power spectrum that can be obtained from log magnitude spectrum (LMS).
A feature can have more discriminative power to distinguish playback speech from genuine speech if the discriminative power between genuine speech and playback speech can be enhanced.

Therefore, in this study, LMS based on CQT is used as study object to investigating how to enhance the discriminative power between genuine speech and playback speech and then modified log magnitude spectrum can be obtained. Finally, by combining with octave segmentation and discrete cosine transform (DCT), hand-crafted features with more discriminative feature for playback speech detection can be obtained, we can call them as hand-crafted discriminative features.

1.2 Contributions of the work

The goal of the work is how to extract hand-crafted discriminative feature by enlarging the difference between genuine speech and playback speech for playback speech detection. There are mainly two contributions in this work.

We found that discriminative power between genuine speech and playback speech can be enhanced if LMS is added its variance or mean. Based on the findings, two methods are proposed to modify LMS and we refer them as variance-based modified log magnitude spectrum (VMLMS) and mean-based modified log magnitude spectrum (MMLMS). In which, LMS is obtained using CQT which is used to convert speech from the time domain into the frequency domain. It is the first contribution of the paper.

By combining VMLMS, octave segmentation and DCT, one new feature from VMLMS is obtained, namely, constant-Q variance-based octave coefficients (CVOC). In the same way, the other feature is obtained by combining MMLMS, octave segmentation, and DCT, which is named as constant-Q mean-based octave coefficients (CMOC). They are the second contribution of the paper.

The remainder of the paper is organized as follows. Section 2 introduces modified log magnitude spectrum. Section 3 introduces how to extract discriminative features. Sections 4 and 5 gives the experimental results and corresponding analysis on ASVspoof 2017 version 2.0 and ASVspoof 2019 physical access databases, respectively. Section 6 concludes the paper.

2 Proposed method I: modified log magnitude spectrum

In this section, in order to enhance the discriminative power between genuine speech and playback speech, two methods to modify LMS are proposed by analyzing discriminative power between genuine speech and playback speech, which are VMLMS and MMLMS. Here, Fisher’s ratio [49] that is often used to measure discriminative power of two classes [50], is used to measure discriminative power between genuine speech and playback speech, its equation is as follows [11]:

$$\begin{array}{*{20}l} F_{C_{1}C_{2}}=\frac{(\overline{C_{1}} - \overline{C_{2}})^{2}}{\sigma^{2}_{C_{1}} + \sigma^{2}_{C_{2}}} \end{array} $$

(1)

where C₁ and C₂ present two classes, $F_{C_{1}C_{2}}$ represents Fisher ratio between C₁ and $C_{2}, \overline {C_{1}}$ and $\overline {C_{2}}$ represent mean of C₁ and C₂, respectively, $\sigma ^{2}_{C_{1}}$ and $\sigma ^{2}_{C_{2}}$ represent variance of C₁ and C₂, respectively.

2.1 Variance-based modified log magnitude spectrum

We assume X₀ and Y₀ are a frame genuine speech magnitude spectrum and its corresponding playback speech magnitude spectrum, respectively, and Kis frequency bin number, we can obtain

$$ {X_{0}}=\bigg\{ x_{1}, x_{2},..., x_{K} \bigg\} $$

(2)

$$ {Y_{0}}=\bigg\{y_{1}, y_{2},..., y_{K} \bigg\} $$

(3)

In addition, we can obtain X₀ and Y₀ in log-scale, denoted as X and Y

$$ {X}=\bigg\{ \log(x_{1}), \log(x_{2}),..., \log(x_{K}) \bigg\} $$

(4)

$$ {Y}=\bigg\{\log(y_{1}), \log(y_{2}),..., \log(y_{K}) \bigg\} $$

(5)

Supposing ${\overline {X}}$ and ${\overline {Y}}$ are means of X and Y, respectively, we can obtain

$$\begin{array}{*{20}l} {\overline{X}}=\frac{\sum_{k=1}^{K}\log(x_{k})}{K} \end{array} $$

(6)

$$\begin{array}{*{20}l} {\overline{Y}}=\frac{\sum_{k=1}^{K}\log(y_{k})}{K} \end{array} $$

(7)

Supposing ${\sigma _{X}^{2}}$ and ${\sigma _{Y}^{2}}$ are variance of X and Y, respectively, we can obtain

$$\begin{array}{*{20}l} {\sigma_{X}^{2}}=\frac{\sum_{k=1}^{K}(\log(x_{k})-\overline{X})^{2}}{K} \end{array} $$

(8)

$$\begin{array}{*{20}l} {\sigma_{Y}^{2}}=\frac{\sum_{k=1}^{K}(\log(y_{k})-\overline{Y})^{2}}{K} \end{array} $$

(9)

Supposing F_XY is Fisher’s ratio between X and Y, according to Eq. (1), we can obtain

$$\begin{array}{*{20}l} {F_{XY}}=\frac{(\overline{X}-\overline{Y})^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}} \notag \\ \end{array} $$

(10)

Supposing X^′ and Y^′ satisfy:

$$ {X'}=\bigg\{ \log(x_{1})+\sigma_{X}^{2}, \log(x_{2})+\sigma_{X}^{2},..., \log(x_{K})+\sigma_{X}^{2} \bigg\} $$

(11)

$$ {Y'}=\bigg\{\log(y_{1})+\sigma_{Y}^{2}, \log(y_{2})+\sigma_{Y}^{2},..., \log(y_{K})+\sigma_{Y}^{2} \bigg\} $$

(12)

The means of X^′ and Y^′, denoted as ${\overline {X'}}$ and ${\overline {Y'}}$, which are as follows:

$$\begin{array}{*{20}l} {\overline{X'}} =\frac{\sum_{k=1}^{K}(\log(x_{k})+\sigma_{X}^{2})}{K} \notag \\ \hspace{0.5cm} =\overline{X}+\sigma_{X}^{2} \end{array} $$

(13)

$$\begin{array}{*{20}l} {\overline{Y'}} =\frac{\sum_{k=1}^{K}(\log(y_{k})+\sigma_{Y}^{2})}{K} \notag \\ \hspace{0.5cm} =\overline{Y}+\sigma_{Y}^{2} \end{array} $$

(14)

The variances of X^′ and Y^′, denoted as ${\sigma _{X'}^{2}}$ and ${\sigma _{Y'}^{2}}$, which are as follows:

$$\begin{array}{*{20}l} {\sigma_{X'}^{2}}&=\frac{\sum_{k=1}^{K}(\log(x_{k})+\sigma_{X}^{2}-\overline{X'})^{2}}{K} \notag \\ \hspace{0.7cm} &=\frac{\sum_{k=1}^{K}(\log(x_{k})+\sigma_{X}^{2}-(\overline{X}+\sigma_{X}^{2}))^{2}}{K} \notag \\ \hspace{0.7cm} &=\frac{\sum_{k=1}^{K}(\log(x_{k})-\overline{X})^{2}}{K} \notag \\ \hspace{0.7cm} &=\sigma_{X}^{2} \end{array} $$

(15)

$$\begin{array}{*{20}l} {\sigma_{Y'}^{2}}&=\frac{\sum_{k=1}^{K}(\log(y_{k})+\sigma_{Y}^{2}-\overline{Y'})^{2}}{K} \notag \\ \hspace{0.7cm} &=\frac{\sum_{k=1}^{K}(\log(y_{k})+\sigma_{Y}^{2}-(\overline{Y}+\sigma_{Y}^{2}))^{2}}{K} \notag \\ \hspace{0.7cm} &=\frac{\sum_{k=1}^{K}(\log(y_{k})-\overline{Y})^{2}}{K} \notag \\ \hspace{0.7cm} &=\sigma_{Y}^{2} \end{array} $$

(16)

Supposing $\phantom {\dot {i}\!}F_{X'Y'}$ is Fisher’s ratio between X^′ and Y^′, according to Eq. (1), we can obtain

$$\begin{array}{*{20}l} {F_{X'Y'}}&=\frac{(\overline{X'}-\overline{Y'})^{2}}{\sigma_{X'}^{2}+\sigma_{Y'}^{2}} \notag \\ \hspace{1.1cm} &=\frac{(\overline{X}+\sigma_{X}^{2}-\overline{Y}-\sigma_{Y}^{2})^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}} \notag \\ \hspace{1.1cm} &=\frac{(\overline{X}-\overline{Y}+\sigma_{X}^{2}-\sigma_{Y}^{2})^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}} \end{array} $$

(17)

Let

$$\begin{array}{*{20}l} {F_{mro}}&=\frac{F_{X'Y'}}{F_{XY}} \notag \\ \hspace{0.5cm} &=\frac{ \frac{(\overline{X}-\overline{Y}+\sigma_{X}^{2}-\sigma_{Y}^{2})^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}} }{ \frac{(\overline{X}-\overline{Y})^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}}} \notag \\ \hspace{0.5cm} &=\frac{(\overline{X}-\overline{Y}+\sigma_{X}^{2}-\sigma_{Y}^{2})^{2} }{(\overline{X}-\overline{Y})^{2}} \notag \\ \hspace{0.5cm} &=(1+\frac{\sigma_{X}^{2}-\sigma_{Y}^{2} }{\overline{X}-\overline{Y} })^{2} \end{array} $$

(18)

Let

$$\begin{array}{*{20}l} F_{vrs}=\frac{\sigma_{X}^{2}-\sigma_{Y}^{2} }{\overline{X}-\overline{Y}} \end{array} $$

(19)

From Eqs. (18) and (19), we can see that F_mro is determined by F_vrs and then F_vrs is determined by $\sigma _{X}^{2}, \sigma _{Y}^{2}, \overline {X}$, and $\overline {Y}$. However, as these parameters are unknown, it is not possible to determine the value of F₁ and F₂ directly.

Therefore, statistical analysis methods can be used to obtained F_vrs. To this end, APSD [25] and AVspoof 2015 [26] are used here. There are two reasons behind selecting these two databases. One is that they are the two largest publicly available databases of genuine-playback speech utterances to date with 3600 and 5600 respectively. The other is that the former is designed for the purpose of replay speech detection and the latter for replaying spoofing detection and synthetic speech detection. The 3600 genuine-playback pairs utterances from APSD and 5600 genuine-playback pairs utterances from AVspoof 2015 can be used to obtain the statistics of F_vrs on different $\sigma _{X}^{2}, \sigma _{Y}^{2}, \overline {X}$, and $\overline {Y}$. The CQT is applied on utterances from the two databases to compute F_vrs on $\sigma _{X}^{2}, \sigma _{Y}^{2}, \overline {X}$ and $\overline {Y}$ frame by frame. Finally, average F_vrs can be obtained, denoted as $\overline {F_{vrs}}$. In the same way, average values of $\sigma _{X}^{2}, \sigma _{Y}^{2}, \overline {X}$, and $\overline {Y}$ can be denoted as $\overline {\sigma _{X}^{2}}, \overline {\sigma _{Y}^{2}}, \overline {\overline {X}}$, and $\overline {\overline {Y}}$.

Table 1 shows the statistics value of $\overline {F_{vrs}}$ on APSD and AVspoof 2015. From Table 1, it can be observed that $\overline {F_{vrs}}$ is above 0 not only for APSD but also for AVspoof 2015. According to the relationship between F_mro and F_vrs in Eqs. (18) and (19), we can know that the statistics value of $\overline {F_{mro}}$ is above 1 on the two databases. Further, we can know that $\phantom {\dot {i}\!}F_{X'Y'}$ is larger than F_XY.

Table 1 Statistics value of $\overline {F_{vrs}}$ on APSD and AVspoof 2015

Discriminative features based on modified log magnitude spectrum for playback speech detection

Abstract

1 Introduction

1.1 Related works

1.2 Contributions of the work

2 Proposed method I: modified log magnitude spectrum

2.1 Variance-based modified log magnitude spectrum

2.2 Mean-based modified log magnitude spectrum

3 Proposed method II: hand-crafted discriminative features extraction

3.1 Constant-Q transform

3.2 Octave segmentation and discrete cosine transform

4 Studies on ASVspoof 2017

4.1 Database introduction

4.2 Evaluation rule and experimental setup

4.3 Experiment results and analysis

4.4 Comparison with modified log magnitude spectrum

4.5 Comparison with Gaussian mixture model

4.6 Comparison with some commonly used features

4.7 Comparison with some other known systems

5 Studies on ASVspoof 2019 physical access

5.1 Database introduction and evaluation metric

5.2 Experimental results and analysis

5.3 Comparison with modified log magnitude spectrum

5.4 Comparison with Gaussian mixture model

5.5 Comparison with some commonly used features

5.6 Comparison with some other known systems

6 Conclusion

Availability of data and materials

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords