Discriminative features based on modified log magnitude spectrum for playback speech detection

In order to improve the performance of hand-crafted features to detect playback speech, two discriminative features, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients, are proposed for playback speech detection in this work. They rely on our findings that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can enhance the discriminative power between genuine speech and playback speech. Then constant-Q variance-based octave coefficients (constant-Q mean-based octave coefficients) can be obtained by combining variance-based modified log magnitude spectrum (mean-based modified log magnitude spectrum), octave segmentation, and discrete cosine transform. Finally, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients are evaluated on ASVspoof 2017 corpus version 2.0 and ASVspoof 2019 physical access, respectively. Experimental results show that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can produce discriminative features toward playback speech. Further results on the two databases show that constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients can perform better than some common features, such as mel frequency cepstral coefficients and constant-Q cepstral coefficients.

, the output layer of the neural network can be seen as a virtual classifier and the rest of the neural network can be seen a deep feature extractor. In this paper, we mainly focus on how to extract discriminative feature for playback speech detection.

Related works
Before 2017, several studies about playback speech detection have been reported. The earlier ones [24][25][26] were based on small-scale databases, where only a small number playback and recording conditions were taken into account. For example, in [24,27], three playback and recording devices were used to collect the database; in [25,28], one recording device and one playback device were used to create the database, which is named as authentic and playback speech database (APSD); in [29], the database was built by four smartphones; and in [26], four devices were used to create the playback utterances in the database, which is named as (audio-visual spoofing 2015) AVspoof 2015.
Different from the above databases, the launch of the ASVspoof 2017 corpus provided a large common database, obtained using 26 playback devices, 25 recording devices, and 26 environments [1,2,30]. So, ASVspoof 2017 corpus can be used to evaluate a playback speech detection algorithm justly because it has more channel and acoustic conditions than any previous databases [24 -26]. In addition, the recent released ASVspoof 2019 physical access also can be used in this study. Hence, ASVspoof 2017 and ASVspoof 2019 physical access not only can support researchers to develop countermeasures, but also can protect ASV system to avoid replay attack [1].
After ASVspoof 2017 and ASVspoof 2019 physical access corpus were released in 2017 and 2019, respectively, some effective methods were proposed to detect playback speech based on the two databases. According to how the used features generate, these methods can be categorized into two types: hand-crafted design features and deep features. In general, hand-crafted features are the features which are obtained by using math formula to design while deep features are the features which are obtained by learning from neural networks for the input.
Deep feature extraction usually contains two steps: the fist is to train a classifier using neural network and the input, and the second is to remove the output layer of the classifier. Because the end-to-end system can be seen a virtual classifier (the output layer) and a deep feature extractor (the rest part), in other words, the end-to-end system can be seen a deep feature extractor if the output layer is removed. So in this study, we also regard the deep features that can be obtained from the end-to-end systems as special type of deep features. According to the neural networks used, there are several types deep features. For example, light convolutional neural network was used to learn deep feature for the input of log power spectrum of constant-Q transform (CQT) and fast Fourier transform in [14,20,21], deep Siamese that is formed two convolutional neural networks with the input of spectrogram are used to learned to obtain Siamese embedding features in [31], residual network (ResNet) was used to learn deep feature from the input of group delay gram in [19,32,33].
In which, CQCC is the most widely used features in playback speech detection, in ASVspoof2017 and ASVspoof 2019 challenge, CQCC plus Gaussian mixture model (GMM) are used to form the baseline system by the organizers [2,48]. The reason is that CQT is a long-term transform, and it can provide more frequency detail to capture playback information in playback speech detection compared with DFT. Generally speaking, deep features perform better than hand-crafted design features in playback speech detection because more useful information for discriminating playback speech from genuine speech can be obtained by deep learning. However, deep features rely on training data heavily. That is to say, deep features only suit on the scope of training data. Further, if we want to study the property of playback speech, hand-crafted design features can be selected rather than deep features. The goal of the paper is to extract discriminative feature for playback speech detection, so handcrafted design feature is studied. Therefore, our focus is how to extract hand-crafted discriminative features in this study.
Traditional hand-crafted features used in speech signal processing such as MFCC and CQCC are not designed for playback speech detection. In order to improve the performance of hand-crafted features to detect playback speech, we focus on designing discriminative features for (2020) 2020: 6 Page 3 of 14 playback speech detection in this study. Considering three facts which are as following: • CQT is a long-term transform and it can provide more frequency detail to capture the playback information compared with DFT. • Many hand-crafted features such as CQCC and MFCC are extracted from log power spectrum that can be obtained from log magnitude spectrum (LMS). • A feature can have more discriminative power to distinguish playback speech from genuine speech if the discriminative power between genuine speech and playback speech can be enhanced.
Therefore, in this study, LMS based on CQT is used as study object to investigating how to enhance the discriminative power between genuine speech and playback speech and then modified log magnitude spectrum can be obtained. Finally, by combining with octave segmentation and discrete cosine transform (DCT), hand-crafted features with more discriminative feature for playback speech detection can be obtained, we can call them as hand-crafted discriminative features.

Contributions of the work
The goal of the work is how to extract hand-crafted discriminative feature by enlarging the difference between genuine speech and playback speech for playback speech detection. There are mainly two contributions in this work. We found that discriminative power between genuine speech and playback speech can be enhanced if LMS is added its variance or mean. Based on the findings, two methods are proposed to modify LMS and we refer them as variance-based modified log magnitude spectrum (VMLMS) and mean-based modified log magnitude spectrum (MMLMS). In which, LMS is obtained using CQT which is used to convert speech from the time domain into the frequency domain. It is the first contribution of the paper.
By combining VMLMS, octave segmentation and DCT, one new feature from VMLMS is obtained, namely, constant-Q variance-based octave coefficients (CVOC). In the same way, the other feature is obtained by combining MMLMS, octave segmentation, and DCT, which is named as constant-Q mean-based octave coefficients (CMOC). They are the second contribution of the paper.
The remainder of the paper is organized as follows. Section 2 introduces modified log magnitude spectrum. Section 3 introduces how to extract discriminative features. Sections 4 and 5 gives the experimental results and corresponding analysis on ASVspoof 2017 version 2.0 and ASVspoof 2019 physical access databases, respectively. Section 6 concludes the paper.

Proposed method I: modified log magnitude spectrum
In this section, in order to enhance the discriminative power between genuine speech and playback speech, two methods to modify LMS are proposed by analyzing discriminative power between genuine speech and playback speech, which are VMLMS and MMLMS. Here, Fisher's ratio [49] that is often used to measure discriminative power of two classes [50], is used to measure discriminative power between genuine speech and playback speech, its equation is as follows [11]: where C 1 and C 2 present two classes, F C 1 C 2 represents Fisher ratio between C 1 and C 2 , C 1 and C 2 represent mean of C 1 and C 2 , respectively, σ 2 C 1 and σ 2 C 2 represent variance of C 1 and C 2 , respectively.

Variance-based modified log magnitude spectrum
We assume X 0 and Y 0 are a frame genuine speech magnitude spectrum and its corresponding playback speech magnitude spectrum, respectively, and K is frequency bin number, we can obtain In addition, we can obtain X 0 and Y 0 in log-scale, denoted as X and Y X = log(x 1 ), log(x 2 ), ..., log(x K ) (4) Y = log(y 1 ), log(y 2 ), ..., log(y K ) Supposing X and Y are means of X and Y , respectively, we can obtain Supposing σ 2 X and σ 2 Y are variance of X and Y , respectively, we can obtain Supposing F XY is Fisher's ratio between X and Y , according to Eq. (1), we can obtain Supposing X and Y satisfy: The means of X and Y , denoted as X and Y , which are as follows: The variances of X and Y , denoted as σ 2 X and σ 2 Y , which are as follows: Supposing F X Y is Fisher's ratio between X and Y , according to Eq. (1), we can obtain Let From Eqs. (18) and (19), we can see that F mro is determined by F vrs and then F vrs is determined by σ 2 X , σ 2 Y , X, and Y . However, as these parameters are unknown, it is not possible to determine the value of F 1 and F 2 directly.
Therefore, statistical analysis methods can be used to obtained F vrs . To this end, APSD [25] and AVspoof 2015 [26] are used here. There are two reasons behind selecting these two databases. One is that they are the two largest publicly available databases of genuine-playback speech utterances to date with 3600 and 5600 respectively. The other is that the former is designed for the purpose of replay speech detection and the latter for replaying spoofing detection and synthetic speech detection. The 3600 genuine-playback pairs utterances from APSD and 5600 genuine-playback pairs utterances from AVspoof 2015 can be used to obtain the statistics of F vrs on different σ 2 X , σ 2 Y , X, and Y . The CQT is applied on utterances from the two databases to compute F vrs on σ 2 X , σ 2 Y , X and Y frame by frame. Finally, average F vrs can be obtained, denoted as F vrs . In the same way, average values of σ 2 X , σ 2 Y , X, and Y can be denoted as σ 2 X , σ 2 Y , X, and Y . Table 1 shows the statistics value of F vrs on APSD and AVspoof 2015. From Table 1, it can be observed that F vrs is above 0 not only for APSD but also for AVspoof 2015. According to the relationship between F mro and F vrs in Eqs. (18) and (19), we can know that the statistics value of F mro is above 1 on the two databases. Further, we can know that F X Y is larger than F XY . Page 5 of 14 The above discussion leads to the findings that discriminative power between X and Y is greater than discriminative power between X and Y . In addition, from the comparison between Eqs. (4) and (11), Eqs. (5) and (12), in order to enhance the discriminative power between genuine speech and playback speech, a method to modify LMS is proposed. The VMLMS can be obtained by adding LMS and its variance. Figure 1(a) shows the framework how to obtain VMLMS on the basis of LMS.

Mean-based modified log magnitude spectrum
Supposing X and Y satisfy The means of X and Y , denoted as X and Y , which are as follows: The variances of X and Y , denoted as σ 2 X and σ 2 Y , which are as follows: Supposing F X Y is the Fisher's ratio between X and Y , according to Eq. (1), we can obtain From Eq. (26), we can see that F X Y is four times of F XY . In other words, discriminative power between X and Y is greater than discriminative power between X and Y . Hence, the other method to modify LMS is proposed. MMLMS can be obtained by adding LMS and its mean. Figure (2)(a) shows the framework how to obtain MMLMS on the basis of LMS.

Proposed method II: hand-crafted discriminative features extraction
In this section, CVOC and CMOC extraction is introduced. Figures 1 and 2 show the block diagram of CVOC and CMOC extraction, respectively. From Fig. 1, it can be seen that it consists of two parts: (a) VMLMS extraction and (b) CVOC extraction, in which CVOC is obtained on the basis of VMLMS. Further, there are five modules in VMLMS extraction, which are CQT, magnitude spectrum, log, variance, and add. There are two modules in CVOC extraction on the basis of VMLMS, which are octave segmentation and DCT.
From Fig. 2, it can be observed that it consists of two parts: (a) MMLMS extraction and (b) CMOC extraction, in which CMOC is obtained on the basis of MMLMS. Further, there are five modules in MMLMS extraction, which are CQT, magnitude spectrum, log, mean, and add. There are two modules in CMOC extraction on the basis of MMLMS, which are octave segmentation and DCT.
The module of CQT is used to convert speech from the time domain into the frequency domain. Magnitude spectrum is used to obtain magnitude spectrum on the basis of CQT. Log is used to obtained LMS. The modules of variance (mean) and add are used to obtained VMLMS (MMLMS) on the basis of LMS. Octave segmentation is used to segment MLMS frequency bins into blocks according to octave. The DCT is used to extract principal information of every block. Next, CQT, octave segmentation, and DCT will be introduced in detail.

Constant-Q transform
The CQT was proposed [51,52]. Here, Q is defined as the ratio of center frequency to bandwidth, which is as Eq. (27), in which, f m is center frequency and δ f is the bandwidth.
where f m represents mth frequency bin and it obeys where f 1 is the center frequency of the lowest-frequency bin, B is the number of bins in every octave. From Eq. (28), we can see that every frequency bin has different frequency bandwidth, the more k, the more bandwidth. This is different from the frequency region in DFT in which every frequency bin has the equal frequency bandwidth. For a discrete time domain signal x(n), supposing Y (m, n) is its CQT, which is defined as where m = 1, 2, ...K is the frequency bin index, N m are the variable window lengths, a * m (n 1 ) denotes the complex conjugate of a m (n), and • denotes rounding toward negative infinity. The basic functions a m (n) are complexvalued time-frequency atoms and are defined by where f m is the center frequency of m-th bin, f s is the sampling rate, and ν(t) is a window function (e.g., Hanning window). φ m is a phase offset. C is a scaling factor and

Octave segmentation and discrete cosine transform
In our previous work, octave segmentation was proposed to segment magnitude-phase spectrum [40] and octave power spectrum [53]. In this work, octave segmentation is used here to segment VMLMS (MMLMS) into un-overlapped blocks according to octave. After octave segmentation, every block has B frequency bins. And then DCT is used to extract principal information of every block. Next, we will take VMLMS as an example to show how to calculate the final coefficients.
where R represents total octave number, and it satisfies After DCT is employed on every block. For every block DCT result, the former Z dimensions as selected as feature (Z is a positive integer), we can obtain CVOC of x(n), denoted as CVOC x (n) .
where z is from 1 to Z-1 and

Database introduction
ASVspoof 2017 corpus was released after ASVspoof 2017 challenge [1,30] . However, the organizers found some zero-value samples and silence in ASVspoof 2017 will affect the result of playback speech detection. In 2018, the organizers updated ASVspoof 2017 by removing those zero-value samples and silence, and named the correct version as ASVspoof 2017 V2 [30]. It is constituted by three subsets: training data, development, and evaluation data, Table 2 gives some details of ASVspoof 2017 V2.

Evaluation rule and experimental setup
In ASVspoof 2017 challenge, participants are allowed to pool training data with development data together to train a final model. Equal error rate (EER) is used as evaluation metric. According to ASVspoof 2017 challenge rule, two types models are trained, one is used to evaluate the performance of the proposed features on evaluation set, wherein 4724 utterances from training and development In CQT, there are several important parameters, which will affect the final performance. They are the number of bins in a octave (B), octave number (R), sampling period that is used for re-sampling to transform octave power spectrum into linear power spectrum [34], gamma, respectively. In the process of CMOC and CVOC, all the parameters in CQT are set according to [34], in which, B is set as 96, sampling period is set as 16, gamma is set as 3.3026, and R is set as 9, which means that there are 9 octaves in CQT. In addition, in CMOC extraction, Z is set as 12, which means there are 12 coefficients obtained from every block after DCT is applied on it. Therefore, the static dimension of CMOC is set as 108. Since our previous works [38,40] have shown static features will degrade the performance in playback speech detection, only dynamic features are used in this study. For different feature combinations of CMOC and CVOC dynamic features, D and A represent delta and acceleration, respectively.
In this study, similar to our previous playback speech detection studies [38,40,41], deep neural network (DNN) is selected as a suitable classifier because we found that DNN based systems can give better performance. The reason may be that DNN has both a classifier function and feature-learning ability [54]. Computational network Toolkit [55] is used to train DNN, which is used as classifier in our experiments. In addition, in the DNN training process, stochastic gradient descent is used. In our experiments, a series of four-layer DNN classifiers are trained for different feature combinations of CMOC and CVOC, which have two hidden layers with 512 nodes at every layer and output layer with 2 nodes and the input nodes number constituted by 11-frame context window of the input feature vector. In other words, for different feature combining of CVOC and CMOC dynamic features, the input nodes are different, for example, for CVOC-A, the input node number is 108 ×11 (including left five frames and right five frames), while for CMOC-DA , the input node number is 216 ×11.
All the DNNs trained in our experiments follow the same method, which consists of the following: (1) The training criterion is cross-entropy with softmax. (2) sigmoid network is used for the hidden layers training. (3) Mean and variance normalization is supplied on the input data. (4) In DNN training, stochastic gradient descent is used. (5) The learning rate is set as 0.8 for the first epoch, 3.2 for from second to fifteenth epochs, and 0.08 for sixteenth to twenty-fifth, in DNN training, there are totally 25 epochs. (6) The minibatch size is set as 256 for the first eopchs and 1024 for the rest epochs. (7) 0.9 is set for the momentum. Table 3 gives the experimental results on ASVspoof 2017 V2 development set using dynamic features of CMOC and CVOC. From Table 3, two conclusions can be obtained: (1) For CMOC, CMOC-A can give the best performance on ASVspoof 2017 V2 development set, then followed by CMOC-DA and CMOC-D. (2) For CVOC, CVOC-DA performs better than CVOC-A, and then CVOC-A performs better than CVOC-D on ASVspoof 2017 V2 development set. Table 4 gives the experimental results on ASVspoof 2017 V2 evaluation set using different dynamic features of CMOC and CVOC. From Table 4, several conclusions can be drawn: (1) For CMOC, CMOC-DA gives the best performance on ASVspoof 2017 V2 evaluation set, then followed by CMOC-D and CMOC-A. (2) For CVOC, CVOC-DA performs better than CVOC-D and then CVOC-D performs better than CVOC-A on ASVspoof 2017 V2 evaluation set. (3) Comparing Table 3 with Table 4, it can be seen that CMOC-A performs the best on development set while CMOC-DA on evaluation set, also it can be observed that CVOC-DA gives the best performance on development and evaluation set. (4) CVOC-DA performs better than CMOC-DA on ASVspoof 2017 V2 evaluation set. As mentioned above, CVOC and CMOC are obtained by applying octave segmentation plus DCT on VMLMS and MMLMS, respectively. Further, VMLMS is obtained by statistical analysis method while MMLMS is obtained by maths formula. Though we cannot compare their discriminative power using Fisher's ratio directly, we can say that CVOC has more discriminative power than CMOC on ASVspoof 2017 V2 evaluation set from the experimental result.

Comparison with modified log magnitude spectrum
In this subsection, modified log magnitude spectrum, namely, MMLMS and VMLMS, their performance is compared with corresponding CMOC and CVOC on ASVspoof 2017 V2 evaluation set. Table 5 gives the comparison with modified log magnitude spectrum on ASVspoof 2017 V2 evaluation set in terms of EER. In which, DNN is also used to model MMLMS and VMLMS, respectively. From Table 5, it can be seen that CMOC performs better than MMLMS and then CVOC performs better than VMLMS on ASVspoof 2017 V2 evaluation set,  respectively. The reason is that more discriminative information can be obtained by applying octave segmentation plus DCT on the modified spectrums, which can make EER reduce 16.46% and 12.25%, respectively.

Comparison with Gaussian mixture model
In this subsection, the performance of CMOC-DA and CVOC-DA using the DNN will be compared with the corresponding performance using GMM as the model of CMOC-DA and CVOC-DA on ASVspoof 2017 V2. Table 6 shows that the comparison with GMM on ASVspoof 2017 V2 evaluation set in terms of EER, in which, the mixture of the GMM is 512. From Table 6, several conclusions can be obtained: (1) For CMOC-DA, the EER can increase from 14.16% to 31.33%, which increases by 121.26%.
(2) For CVOC-DA, the EER can increase from 11.46% to 30.56%, which increases by 166.67%. (3) From the performance comparison, we can say that DNN can perform better than GMM on ASVspoof 2017 V2 evaluation set for CMOC-DA and CVOC-DA, the reason is that DNN has feature learning ability as well as classification, it also confirms that consideration of DNN for our studies is useful.

Comparison with some commonly used features
In this section, some commonly used features, for example, MFCC and CQCC are compared and with CVOC and CMOC on ASVspoof 2017 V2 evaluation set. In addition, considering the modules of variance or add are removed from Fig. 1 or the modules of mean and add are removed from Fig. 2, the obtained feature can be named as constant-Q octave coefficients (COC). It can be used to compare the performance with CVOC and CMOC to show the role of VMLMS and MMLMS in CVOC and CMOC. Table 7 gives the performance comparison among MFCC-DA, CQCC-DA, COC-DA, CMOC-DA, and CVOC-DA on ASVspoof 2017 V2 evaluation set in terms   Table 7, it can be seen that (1) the performance of CQCC-DA, COC-DA, CMOC-DA, and CVOC-DA is better than MFCC-DA. The reason is that MFCC-DA is based on DFT which is a short-term transform while the other four features are based on CQT which is a long-term transform. CQT can provide more frequency details.
(2) Both CVOC-DA and CMOC-DA perform better than COC-DA, which means that our proposed VMLMS and MMLMS have more discriminative power toward playback speech. In addition, it also confirms that our idea is correct and effective. (3) The performance of CVOC-DA and CMOC-DA is better than CQCC-DA and COC-DA, the reason is that modified log magnitude spectrum is used the two feature extraction. Table 8 gives the comparison with some known systems based on hand-crafted features on ASVspoof 2017 V2 evaluation set. In which, logE represents logarithm energy, qDFTspe represents Q-log domain DFT-based mean normalized log spectral [42], eCQCC represents extended CQCC [38], CMPOC represents constant-Q magnitudephase octave coefficients [40] and CQSPIC represents constant-Q statistics-plus-principal information coefficients [41]. From Table 8, it can be seen that the performance of our systems are better than some other known systems. The reason may be that discriminative features are used our systems. However, our systems are a little worse than the system based on qDFTspe [42]. In addition, feature combination SDA perform the best in [30] while feature combination DA performs the best in our system.  The reason is that cepstral mean and variance normalization (CMVN) is applied on feature in [30] and the feature distribution has been changed while CMVN is not applied on our feature. We also found that CQSPIC performs better than CVOC and CMOC, the reason is that CQSPIC is a combined feature, it has spectral principal information, subband information, and short-term spectral statistical information while our CVOC and CMOC only has spectral principal information.

Database introduction and evaluation metric
In this section, CMOC and CVOC are evaluated on ASVspoof 2019 physical access [48], which was released in 2019 for ASVspoof 2019 challenge, some details are given in Table 9. In which, the corpus has three subset, train, development, and evaluation set. According to ASVspoof 2019 challenge rule, tandem detection cost function (t-DCF) [56] and EER are used as the primary and secondary metric, respectively, which is the same as the previous works [57][58][59][60][61][62][63][64].    Table 11 gives the experimental results on ASVspoof 2019 physical access evaluation set using different dynamic features of CMOC and CVOC. From Table 11, several conclusions can be drawn: (1) For CMOC, CMOC-A gives the best performance on ASVspoof 2019 physical access evaluation set, then followed by CMOC-DA and CMOC-D. (2) For CVOC, CVOC-A performs better than CVOC-DA and then CVOC-DA performs better than CVOC-D on ASVspoof 2019 physical access evaluation set. (3) Comparing Table 10 with Table 11, it can be seen that CMOC-A and CVOC-A perform the best on ASVspoof 2019 physical access development and evaluation set. (4) CVOC-A performs better than CMOC-A on ASVspoof 2019 physical access development and evaluation set. Which also confirms that CVOC-A has more discriminative ability than CMOC-A, the same as on ASVspoof 2017 evaluation set.

Comparison with modified log magnitude spectrum
In this subsection, modified log magnitude spectrum, the performance of MMLMS and VMLMS is compared with their corresponding CMOC and CVOC on ASVspoof 2019 physical access evaluation set. Table 12 gives the comparison with modified log magnitude spectrum on ASVspoof 2019 physical access evaluation set in terms of EER. In which, DNN is also used to model MMLMS-A and VMLMS-A, respectively. From Table 12, it can be seen that CMOC-A, CVOC-A perform much better than corresponding MMLMS-A and VMLMS-A on ASVspoof 2019 physical access evaluation set in terms of t-DCF or EER, respectively. The reason is that more discriminative information can be obtained by applying octave segmentation plus DCT on the modified spectra.

Comparison with Gaussian mixture model
In this subsection, the performance of CMOC-A and CVOC-A using the DNN will be compared with the corresponding performance using GMM as the model of CMOC-A and CVOC-A on ASVspoof 2019 physical access. Table 13 shows that the comparison with GMM on ASVspoof 2019 physical access evaluation set in terms of EER, in which, the mixture of the GMM is 512. From Table 13, several conclusions can be obtained:    Table 14 gives the performance comparison among MFCC-A, CQCC-A, COC-A, CMOC-A, eCQCC-A, CQSPIC-A, and CVOC-A on ASVspoof 2019 physical access evaluation set in terms of t-DCF and EER. In which, eCQCC represents extended CQCC (eCQCC) [38], CMPOC represents constant-Q magnitude-phase octave coefficients [40] and CQSPIC represents constant-Q statistics-plus-principal information coefficients [41]. In addition, MFCC-A, CQCC-A, eCQCC-A, CQSPIC-A, and COC-A have their respective DNN classifiers. From Table 14, according to t-DCF or EER, it can be seen that (1) The performance of CQCC-A, COC-A, eCQCC-A, CQSPIC-A, CVOC-A, and CMOC-A is better than MFCC-DA. The reason is that MFCC-DA is based on DFT which is a short-term transform while the other four features are based on CQT which is a long-term transform. CQT can provide more frequency details.

Comparison with some commonly used features
(2) Both CMOC-A and CVOC-A perform better than COC-A, which also confirms that our proposed VMLMS and MMLMS have more discriminative power than LMS toward playback speech. (3) Similar to the performance between CVOC and eCQCC on ASVspoof 2017 V2, CVOC also give better performance than eCQCC on ASVspoof 2019 physical access evaluation set. It means that CVOV has more discriminative ability than eCQCC on the two databases. (4) It is surprising to found that CVOC-A performs better than CQSPIC-A on ASVspoof 2019 physical access evaluation set unlike the comparison between them on ASVspoof 2017 V2 evaluation set. The reason may be that CVOC can extract more discriminative information than CQSPIC on ASVspoof 2019 physical access evaluation set. (5) The performance of CMOC-A and CVOC-A is better than CQCC-A and COC-A, the reason is that modified log magnitude spectrum is used the two feature extraction. Table 15 gives the comparison with some known systems based on hand-crafted features on ASVspoof 2019 physical access evaluation set. In which, LFCC represents linear frequency cepstral coefficients. From Table 15, it can be seen that the performance of our systems are better than the two known systems. The reason is that discriminative features are used our systems.

Conclusion
This paper addresses the problem how to extract handcrafted discriminative features for playback speech detection. Two methods to obtain modified log magnitude spectrum are proposed by analyzing the discriminative power between genuine speech and playback speech using Fisher's ratio. Then, CVOC and CMOC are extracted by using octave segmentation and DCT on the basis of VMLMS and MMLMS, respectively. The experimental results on ASVspoof 2017 V2 and ASVspoof 2019 physical access databases show that both CVOC and CMOC perform better than some commonly used features because VMLMS and MMLMS can enhance the discriminative power between genuine speech and playback speech. In addition, CVOC can perform better than CMOC on the two databases, which means that CVOC has more discriminative power than CMOC. The EER of CVOC on ASVspoof 2017 V2 evaluation set can reach 11.46%, and the t-DCF on ASVspoof 2019 physical access evaluation set can achieve 0.165. It is somewhat surprising to find that the proposed method can work so well. Future work can explore how far this idea can be extended.