DeepDet : YAMNet with BottleNeck Attention Module (BAM) for TTS synthesis detection

Spoofed speeches are becoming a big threat to society due to advancements in artificial intelligence techniques. Therefore, there must be an automated spoofing detector that can be integrated into automatic speaker verification (ASV) systems. In this study, we recommend a novel and robust model, named DeepDet , based on deep-layered architecture, to categorize speech into two classes: spoofed and bonafide. DeepDet is an improved model based on Yet Another Mobile Network (YAMNet) employing a customized MobileNet combined with a bottleneck attention module (BAM). First, we convert audio into mel-spectrograms that consist of time–frequency representations on mel-scale. Second, we trained our deep layered model using the extracted mel-spectrograms on a Logical Access (LA) set, including synthesized speeches and voice conversions of the ASVspoof-2019 dataset. In the end, we classified the audios, utilizing our trained binary classifier. More precisely, we utilized the power of layered architecture and guided attention that can discern the spoofed speech from bonafide samples. Our proposed improved model employs depth-wise linearly separate convolutions, which makes our model lighter weight than existing techniques. Furthermore, we implemented extensive experiments to assess the performance of the suggested model using the ASVspoof 2019 corpus. We attained an equal error rate (EER) of 0.042% on Logical Access (LA), whereas 0.43% on Physical Access (PA) attacks. Therefore, the performance of the proposed model is significant on the ASVspoof 2019 dataset and indicates the effectiveness of the DeepDet over existing spoofing detectors. Additionally, our proposed model is robust enough that can identify the unseen spoofed audios and classifies the several attacks accurately.


Introduction
Speech is commonly used as a transmitting medium in digital devices such as mobile phones and computers.Some other characteristics of speech exist, i.e., rhythm, genre, pitch, etc.However, with the advancement of artificial intelligence and deep learning models [1,2], it has become easy to manipulate the signals and generate fake speech to deceive the listener.Moreover, various speech synthesis algorithms, i.e., GAN [3], Deepvoice [4], cotton [5], and wavelet [6], have gained importance to generate natural speech just like humans and defeat the automatic speaker verification (ASV) systems.For example, false information related to politics based on deep fakes became a major threat to the US presidential elections in 2020 [7].Furthermore, an incident of loss of USD 243,000 occurred when an audio-deep fake [8] was employed in bank transactions.Therefore, these incidents show the vulnerability of the ASV systems that are used widely in various security systems.
In the ASVspoof 2019 competition, the dataset has two partitions: logical access (LA) and physical access (PA) attacks.There exist several methods to synthesize speeches, including three main types: replay attack (RA), text-to-speech synthesis (TTS), and voice cloning (VC), which are considered in the logical access (LA) set of the ASVSpoof2019 dataset, and physical access (PA) attacks include recordings that simulate attacks on physical access systems.These attacks may involve impersonation, replay, or other methods aimed at deceiving the verification system respectively.Researchers have proposed different approaches [9][10][11] for spoofing detection.Some algorithms exist based on machine learning techniques to discern the audio based on data-driven and knowledgefocused countermeasures [12][13][14].However, in traditional machine learning algorithms, hand-crafted feature extraction is performed, which is a consuming task and more complex due to the need for feature engineering to select optimal features.Whereas in the deep learning model, features are extracted and selected automatically and are more flexible.
With the advancement in the domain of Convolutional Neural Networks(CNNs), some methods have been proposed based on deep layers, such as Chintha et al., who proposed a recurrent CNN structure to detect spoofed (fake) audio [15].Moreover, a lightweight convolutional neural network, namely LCNN, has been employed by [16], utilizing the softmax loss function to detect antispoofed attacks.Furthermore, various combinations of detecting systems have been tested along with ResNet [17] and explored with other classifiers as well for better performance [18,19].In [20], a model was employed based on an end-to-end ensemble method to learn the fusions of various detection systems.Even though the performance of these proposed algorithms was satisfactory, there exists an issue of generalization for unseen attacks on the models.Therefore, it is necessary to introduce an efficient and robust system that can carry out the detection of fake audio from any source.
The model proposed in this study focuses on the detection of synthesized speech rather than real due to two major concerns: (1) defining a comprehensive model for detecting all possible variations of genuine audio can be extremely complex and may lead to false positives or false negatives, and (2) adversaries can employ various techniques to generate fake audio that mimics genuine recordings, making it challenging to rely solely on detecting genuine audio.Therefore, focusing on fake audio detection allows system designers to address potential threats and vulnerabilities introduced by malicious actors attempting to deceive the system.
This study proposes a novel and robust framework to detect spoofed voices, such as LA and PA attacks, based on a deep learning model, namely DeepDet.Our proposed model is mainly divided into three phases.First, audio features have been extracted in the form of images known as mel-spectrograms.Second, a deep layered network has been trained using the ASVspoof-2019 dataset to classify the audio input as fake or real.Third, the network performs the binary classification of melspectrograms.To evaluate our proposed system's performance and effectiveness, we perform our experiments utilizing the publically available dataset, i.e., ASVspoof 2019.More precisely, we proposed a robust system that identifies and classifies the spoofed audios generated by the text-to-speech and cloning algorithms-based systems.We assessed the performance of the suggested system using PA (replay and bonafide samples) and LA (voice conversion, speech synthesis, and bonafide) sets from the ASVspoof 2019 corpus.The major offerings of the proposed model are presented below: • To propose a novel deep learning based model, namely DeepDet based on improved YAMNet's architecture for spoofed audio detection similar to image classification models.• The DeepDet employs an improved architecture using an attention module for the feature extraction from mel-spectrograms.It employs depth-wise separable convolutions, that is why our proposed model is lightweight.• An attention block makes our model focus on relevant parts of the input, reducing the information loss and improving the model's ability to capture finegrained patterns.• Our proposed method is a robust speech spoofing detector that can be utilized to detect unseen synthetic voice attacks along with replay attacks and voice conversion.• We evaluated our proposed system by employing extensive experiments that confirm the significance of our proposed system over existing techniques.
The remaining paper is ordered as follows: Sect. 2 defines the related work, Sect. 3 enlightens the methodology of the proposed technique, Sect. 4 defines the experiments performed, and Sect. 5 demonstrates the conclusion and limitations.

Related work
With the advancement of technology and electronic devices, the processing of content based on speech, such as music, ambient sounds, games, and entertainment, has become a significant field for researchers.Various models have been proposed for the classification of audio based on audio features [21][22][23][24][25].Moreover, in the last two decades, text-to-speech systems have become so powerful that they are capable of generating a realistic voice after training limited audio samples from target speakers [26].Therefore, it is a huge threat for ASV systems as they may attacked by the naturalness of the speech generated [27].The applications that can protect the ASV systems from attacks are called deepfake speech detectors.Thus, various machine learning and deep learning-based works have been proposed for the detection of forged speech.
In [28], a support vector machine (SVM)-based classifier has been utilized as AVS employing GMM.They attained an equal error rate of 4.92% and 7.78% on the 2006 NIST for speaker identification core test.The authors have proposed the Gaussian Mixture Model (GMM) and a relative phase shift with a support vector machine (SVM) for synthetic speech detection to minimize the weaknesses of speaker verification systems.Moreover, a detailed comparison of the Hidden Markov Model (HMM) and DNN has been performed for the detection of spoofed speech [29].In [30], the proposed model employs the spectrograms in image form as input to CNN, thus forming a base of audio processing using images.In [31], various feature descriptors have been used, such as Mel Frequency Cepstral Coefficient (MFCC), spectrogram, etc., and the effect of GMM-UBM on the accuracy has been analyzed.It is concluded that the combination of different feature descriptors gives better results in terms of equal error rate (EER).Chao et al. [28] utilized SVM to discern the real speeches from the fake recordings of the claimed man.Similarly, in [32], Chao has employed two core methods, Kernel Fisher Discriminant (KFD) and SVM, to verify speakers and attained better results as compared to their previous work based on the GBM and UBM methods.Moreover, to decline the computational cost of the polynomial kernel SVM by exchanging the dot product among two utterances with two i-vectors [33].Furthermore, the authors applied the features selection technique, attaining a 64% dimensionality reduction in features with an equal error rate of 1.7% [33].Whereas Loughran et al. [34] overcame the issue of imbalanced data (where one class sample is greater than the other) by utilizing a genetic algorithm (GA) with an adjusted cost function.Malik et al. [35] developed a system for audio forgery detection based on acoustic signatures of the environment by investigating the integrity of audio.However, these proposed models failed to address synthesized audio content with high precision.
In [36], a DNN-based classifier has been proposed to detect and employ highlight Human Log Likelihoods (HLL) as a metric for scoring and proved to be better than classical log-likelihood ratios (LLR).Additionally, they also utilized various cepstral coefficients for the classifier's training [37,38] also employed a convolutional neural network for the audio classification.An extensive comparison has been made using DL techniques for fake audio detection in [39], demonstrating that CNN and recurrent neural network (RNN) based models give better results than all other employed techniques [40] explains that the spectral features are significant to use for the detection of synthetic speech.For example, MFCC features are better than other spectral features for the model's input.Furthermore, [41] describes the challenges and limitations of the spoofed detection models.In [42], a bispectral method for the analysis and detection of synthetic voices has been proposed.They examined uncommon spectral features in fake speeches synthesized using DNNs, which they called bispectral features.They also tried to find high-order polyspectral features to discern the fake audio.A capsule network-based approach has been proposed in [43].They enhanced the generalization of the proposed system and examined the artifacts deeply to increase the overall performance of the model.They also investigated the replay attacks in audios employing their network.In [44], authors proposed a model for fake audio detection named DeepSonar.They analyzed the network layers and the activation patterns for various input audios to examine the difference between fake and real speeches.They employed three datasets consisting of English and Chinese language and attained an average accuracy of 98.1%.In [45], authors have proposed a model for fake audio detection based on micro-features such as voicing onset timing (VOT) and coarticulation.They analyzed that VOT numbers are high in fake speeches and attained a 23.5% error rate employing a fusion of both feature descriptors.The authors claimed that these micro-features can be used as standalone features for fake audio detection.Moreover, temporal convolutional networks (TCN) [46] have outperformed traditional algorithms such as RNNs and LSTMs for various tasks.
The latest deep learning techniques for text-tospeech synthesis systems, such as [47], clone the voice using original speech recordings.It requires a few minutes of recording in real voice and generates fake audio in some seconds.Although the techniques have been improved [48], they still face the challenge of naturalness.The authors in [49] have investigated the usage of RAWNet2 for the spoofing detection.They improved the architecture of RawNet2 and proved that their results are second best for the detection of A17 attacks.The authors proposed a novel feature extraction process in [50] for replay attack detection.The developed l Cochlear Filter Cepstral Coefficients-based Instantaneous Frequency using Quadrature Energy Separation Algorithm (CFCCIF-QESA) features, with excellent temporal resolution as well as relative phase information [51] suggested integrating orthogonal convolution into RawNet for fake audio detection, which serves to decrease the correlation between filters when optimizing Sinc-conv parameters, thereby enhancing discriminability.Additionally, they introduced temporal convolutional networks (TCN) to capture long-term dependencies in speech signals.Experimental results on the ASVspoof 2019 dataset reveal that their model, namely TO-RawNet, demonstrates a relative reduction of 66.09% in equal error rate (EER) on the logical access scenario compared to RawNet.This underscores the effectiveness of the approach in detecting fake audio attacks.
The existing studies have failed to fully discern the fake voices, and thorough evaluation has not been performed to evaluate their robustness employing the various manipulated voices (changing pitch, rhythm, and resampling it without changing the linguistics).Furthermore, audio artifacts are more difficult to detect than image artifacts as transforming audio signals into a frequency representation (mel-spectrograms) facilitates pattern identification by models.The temporal and spectral information is also considered to capture the audio's frequency content.In addition to this, as indoor or outdoor voices have environmental noises, it is very easy for fake voice generators to add realworld noise to the voice to fool the listener or the ASV system.Thus, an automatic fake audio detector that is robust enough to identify the fake audio of various environments is still needed.

Methodology
Deep learning architectures are made of various layers, such as input, hidden, and classification layers, as shown in Fig. 1.These hidden layers have various types, i.e. convolutional, batch normalization, pooling, activation, etc.The deep learning models extract features utilizing various filters convolving over the input images.Moreover, when the filters are convolved over all the data, then a feature map is formed.These feature maps are reduced in dimensions employing pooling layers, minimizing the computational power of the system.These feature maps can be fed again to further convolution layers, repeating the above steps.
Numerous applications exist for various purposes, such as facial feature recognition [52], speech identification [45], and emotion [53].The presented system consists of three main phases, i.e., (1) features extraction, ( 2) training, and (3) classification.We employed features extraction utilizing a feature extraction layer through which mel-spectrograms have been generated and passed to an improved architecture of MobileNet [54] guided by the attention module as a base network in YamNet [55].The audios comprised a 16000 Hz sampling frequency to capture the essential characteristics of speech.It provides a balance between capturing the necessary frequency content for clear speech communication and minimizing bandwidth requirements.Then, the audio was transformed into mel-spectrograms of 96 × 64 size.Secondly, we trained an improved network, i.e., Deep-Det over the generated mel-spectrograms belonging to two classes such: as Bonafide (Real) and Spoofed (Fake).The Mel-spectrograms of fake audio are different from real audio.Therefore, the proposed system learns the patterns precisely for two classes.Thirdly, we classified various input audios using the trained classifier.Moreover, the MobileNet employs linearly separable depth-wise convolutions.Therefore, our proposed model becomes lightweight.The design of the proposed system is shown in Fig. 2.
In the improved version, we customized the base network MobileNetV1, adding three layers grouped convolutional 2D layer, instance batch normalization layer, and activation layer before the fully connected layer.Moreover, we have changed all the batch normalization layers in the original network to instance normalization, which improves the convergence of the network.We increased the depth of the proposed network while decreasing the sensitivity of the model to the hyperparameters.Our proposed model extracts the most representative features from the mel-spectrograms generated from audio.Additionally, our model is lightweight due to depth-wise separable convolutions.

YAMNet architecture
Transfer learning is a famous deep learning aspect in which the model can learn the features from any other trained model.The key aspect of transfer learning is to minimize the computational cost of utilizing previously learned patterns.It is preferred to employ the transfer learning concept when a large size of unlabeled data is available to train a model.Therefore the pre-trained model utilizes its previous training features to reduce the time and effort.YAM-Net employs the MobileNetV1 as the base network and is a pre-trained model on the Google AudioSet dataset for 521 audio events.Therefore, in our work, we are training YAMNet on unbalanced data and achieving significant performance.Before the features extraction phase, resampling is performed into 16,000 Hz with one channel audio.Moreover, YAMNet is a DL-based model.Therefore it extracts the audio features automatically due to the feature extraction layer.The feature extraction layer extracts the audio features in the form of spectrograms, and then these spectrograms are fed to improved MobileNet layers for classification.The layered architecture of the original YAMNet is shown in Fig. 3.

MobileNet
The MobileNet is developed using depth-wise separable convolutions that factorize a simple convolution into depth-wise convolution, as well as a 1 × 1 convolution that is identified as a point-wise convolution.These depth-wise convolutions employ one filter to each channel, splitting the standard convolution into two separable layers, one to apply filter and the other for concatenation.Furthermore, the point-wise convolution is employed of 1 × 1 size and concatenates the output with the depth-wise convolution.Due to this factorization, the size and computational complexity are decreased significantly.The standard convolution, along with depth-wise and point-wise convolution, are shown in Fig. 4.An input as D k x D k x M feature map k is fed to the standard convolutional layer, and the output feature map is generated as G. Here, D k represents the spatial height and width of an input feature map, M represents the total number of channels as input, D G refers to the spatial height and width of the feature map in output, and N represents the total output channels.
The standard convolutional layer is represented by conv.Kernel as L having a size of D L × D L × M × N. Here, D L refers to the dimension in the spatial context of a square kernel, M represents the total input channels, and N refers to the total output channels.Suppose we employ stride 1, and padding, therefore the output feature map is calculated as below: Moreover, the standard convolutional operation involves the cost as below: (1)

Fig. 2 Flow diagram of the proposed system
Here, the cost of computation relies upon the product of the total input channels as M, the total output channels as N, and the kernel size as D L and the size of the feature map as D k .D k .Furthermore, the Mobile network consid- ers all these computational terms and their respective connections.It breaks the connections employing depthwise separable convolutions among the output channel and the kernel size.The standard convolutional functions utilize the filtering and combining operations on features using the convolutional kernels, providing the new output of feature representations.Moreover, these two steps, such as filtering and combining, can be divided into two (2) separate processes via depth-wise separable convolutions to minimize the computational cost.Depth-wise separable convolution consists of 2 layers, i.e., pointwise convolution and depth-wise convolution.We utilize depth-wise convolution to employ a unit filter to all input channels.Whereas, point-wise convolution is applied using 1 × 1 convolution to make a linear combination of the output from the depth-wise layer.Moreover, mobile networks use two more layers, batch normalization, and ReLU non-linearity, for both types of layers.Batch normalization layers are usually employed in convolutional neural networks.The output of batch normalization (BN) is made of 4-D tensors, which are referred to as I b,c,x,y, and F b,c,x,y correspondingly.Where b represents batch, c is the channel, and x and y are two spatial dimensions, respectively.When an input is a form of images, then channels are based on RGB channels.BN layers employ a similar normalization in each channel for all activations.
In Eq. 3, BN deducts the mean activation µ c, as shown in the below equation.It is subtracted from all input activations of channel c and ∅ comprises channel c acti- vations across all mini-batches features b and spatial locations x,y.Moreover, the centered activation is divided by the standard deviation σ c and added to ∈ for the sta- bility in computation.The normalization in BN is followed by various affine transformations using and β c in each channel during training.The significance of employing BN is that it evades activation explosion by improving all the activations to make them zero-mean.Due to this, it becomes possible to train a network employing large learning rates, as the means and variances have been normalized.Therefore activations should not grow uncontrollably.Furthermore, large learning rates allow the algorithm to reach the convergence point fast.Small learning rates show slight progress in flat directions of the optimization and may converge at the sharp local minimum, exhibiting less generalization performance [56].
Rectified linear units (ReLU), referred to as an activation function, first found importance in acoustic models [57] exhibiting mathematical and biological characteristics.It is described as a source of improved training processes of deep learning models.It works based on threshold values at 0, such as f(x) = max(0,x).It gives output as 0 when x is less than 0 and provides a linear (3) function when x is larger than or equivalent to 0 i.e., x ≥ 0. The improved YAMNet, i.e., DeepDet, is exhibited in Fig. 5.
The rectified linear unit (ReLU), which is an activation function, yields 0 as an output where x < 0 and then yields a linear having a slope of 1 where x > 0.
We employ the ReLU activation function among all hidden layers of a deep neural network and as the classification function in the output layer of the proposed network.
The depth-wise convolution having one filter for a single input channel is computed as shown in the equation below.
Here, L represents the depth-wise convolutional ker- nel having size D L x D L x M; here, the m th filter in L is employed on the m th channel in K to form an output feature vector as } .The cost for computation of depth-wise convolution is computed as below: Moreover, the depth-wise convolution is more proficient than the standard convolution.However, it employs filtering only on input channels and does not merge to form new features.Therefore, an additional layer is required that can combine the depth-wise convolution's output using 1 × 1 convolution to form new features.The combined form of depth-wise and point-wise convolution is known as depth-wise separable convolution, which was first introduced by [58].The summation of depth-wise and point-wise 1 × 1 convolutions can be represented mathematically as below: We can express convolution in two steps, i.e., filtering and combining mathematically in Eq. 8.
(5) The mobile network employs 3 × 3 convolutions that require approximately eight times less computational cost than standard convolutions.

Improved architecture
The MobileNet is developed using depth-wise separable convolutions as described in the previous unit, excluding the first fully convolutional layer.The architecture of our improved MobileNet is shown in Table 1.It is depicted that we have added three extra layers before the global average pooling layers, such as the grouped convolutional 2D layer, instance batch normalization layer, and ReLU layer.We increased another block of three layers to increase the efficiency of the model by extracting the most representative features from the mel-spectrograms.Moreover, we have changed all the batch normalization layers with instance normalization layers [59].They operate differently on input data.While instance normalization (IN) transforms an individual training sample, Batch Normalization (BN) applies the transformation to the entire mini-batch of samples.This makes BN reliant on the batch size, as a larger batch size is necessary to obtain a statistically more accurate mean and variance.Implementing a large batch size can be challenging due to memory constraints.Consequently, when faced with memory limitations, smaller batch sizes may be chosen, which can pose problems in certain situations.The use of a very small batch size can introduce training errors because the mean and variance become more prone to noise.IN outperforms BN in scenarios where a small batch size is employed.
Moreover, because BN's effectiveness is tied to the batch size, it cannot be applied in the same manner during test time as it is during training.This limitation arises because, typically, only one example is processed during the testing phase, making it impossible to compute mean and variance in the same way as during training.Instead, BN utilizes moving averages and variance for inference during test time.In contrast, IN is independent of the batch size, ensuring consistent implementation for both training and testing.Besides this, batch normalization introduces additional noise during training since the outcome for a specific instance is influenced by neighboring instances.Interestingly, this type of noise can have both positive and negative effects on the network.
More precisely, a mel-spectrogram in the form of a 2D image having dimensions of 96 × 64 × 1 is fed from the image input layer to the first convolutional 2D layer.Then, the convolutional 2D layer gives an output of 48 × 32 × 32 channels having stride two and the (8 same padding.In 3rd step, instance normalization is employed with 32 channels having offset 1 × 1 × 32 and scale 1 × 1 × 32.In the 4th step, the ReLU activation function is employed, which gives 48 × 32 × 32 activations.At the 5th step, depth-wise 2d convolution is employed on 32 groups of 1 3 × 3 × 1 convolutions having learnable as weights: 3 × 3 × 1 × 1 × 32, bias: 1 × 1 × 1 × 32 with stride one and padding same.At the 6th and 7th steps, instance normalization and activation function ReLU is employed, giving activations as 48 × 32 × 32.Similarly, this sequence is followed till the last activation function at step 85, giving output 3 × 2 × 1024.As described before, when audio files are converted into mel-spectrograms, the number of bands is 64.

BAM attention module
Channel dependency usage is a significant way of improving CNN model execution.To increase the performance of state-of-the-art models with negligible computational cost, we used attention block, i.e., BottleNeck Attention Module (BAM) [60], in our improved YAMNet model, as shown in Fig. 5. BAM's architecture depends on two pathways: spatial and channel.It gets training in an end-to-end way with our proposed DeepDet.Different channel weights are trained using the cost function, and the weight coefficients of the feature channel are obtained automatically.The attention module assists the model in attaining intermediate features more effectively.The proposed attention module's architecture is shown in Fig. 6.The F represents the feature map, whereas M(F) is an attention map computed by the module using two attention methods: spatial as M s and channel as M c .Two hyperparameters exist, i.e., r as the reduction ratio and d as the dilation value.More specifically, r controls the overhead among both attention methods, and d spatially helps in contextual information using the receptive field size.
The incorporation of the attention mechanism in BAM aids in acquiring more informative and distinctive representations.This, in turn, improves the extraction of features and fosters a deeper comprehension of intricate patterns in the data, ultimately augmenting the network's capacity for learning and generalization.Additionally, BAM plays a role in enhancing the robustness of a neural network by empowering it to dynamically adjust its attention in response to the input context.This adaptability equips the network to effectively manage variations and shifts in the input data, bolstering its resilience to diverse conditions and scenarios.Attention mechanisms, exemplified by BAM, have the potential to mitigate computational overhead by allowing the network to concentrate on pertinent segments of the input, thereby conserving computational resources.3 Experimental evaluation

Dataset
The challenge of spoofed voice detection came in 2015, known as the ASVSpoof 2015 corpus [61].The aim was to develop a system to detect the synthesized or cloned speech and analyze the performance using the dataset samples.After 2 years, the ASVSpoof 2017 corpus [41] came into existence for the evaluation of the replay detection systems.A large and assorted dataset was introduced in 2019, known as ASVSpoof 2019 [62], comprising both logical access (LA) and physical access(PA) attacks.The first contained the voice conversion and synthesized speech samples, including bonafide audio.The later part consists of replay and bonafide audio samples.Real speech data is sourced from 107 speakers (46 male, 61 female) without notable channel or background noise influences.Spoofed speech is then created using various spoofing algorithms based on genuine data.Furthermore, both parts have been split further into three sub-parts, namely, development, training, and evaluation sets.The logical access dataset comprises seventeen various text-to-speech and voice cloning systems.Moreover, these systems are trained using the voice cloning toolkit VCTK [63].Among these systems, six have been labeled as known attacks, whereas the other 11 systems are known as anonymous attacks.The training and development audio samples are taken from known attacking systems, and evaluation samples are collected   from 11 unknown and two known attacks.The Logical Access set consists of 2 VC systems that utilize spectral filter and artificial neural networks-based approaches.Furthermore, the LA set consists of 4 TTS systems that utilize artificial neural networks or concatenation of wave-form employing vocoders based on source-based filter Vocoder [64] or WaveNet Vocoder [65].The 11 unknown spoofing systems consist of 2 VC, 6 TTS, and 3 Hybrid forms of VC and TTS systems utilizing various waveform-based methods such as GriffinLim [66], Neural waveform techniques [67], Generative adversarial networks (GAN) [68], and combinations of waveform and spectral filtering.The statistics of the ASVSpoof 2019 dataset are shown in Table 2, whereas a depth summary of the LA set is shown in Table 3.Moreover, ASVspoof 2017 [41] comprises real replay speeches, while ASVspoof 2019 comprises synthesized replay recordings recorded under an acoustic environment to enrich the ASV system's reliability.Training and development recordings are produced, conferring to 9 replay and 27 acoustic configurations.The sizes of rooms are categorized as large, medium, and small rooms.All speeches are generated in various zones, such as A, B, and C, exhibiting varying distances (Da) between the talker and zone.The zone A voice quality is better than the B and C zones.Moreover, the eval recordings have been gathered in the same way as train and dev sets.
To evaluate the model on the LA set, we utilized the training samples as 25,000, including 2580 bonafide samples and 22800 spoofed samples to train DeepDet.We tested our model using both sets, that is, eval and dev sets.The eval set consists of 71,237 samples, including 7355 spoofed and 63882 bonafide samples, while the dev set consists of 24,844 samples, including 22,296 spoofed and 2548 bonafide samples.Furthermore, we have evaluated our proposed model using the PA dataset; we employed 54,000 samples, including 48,600 spoofed and 5400 bonafide samples, for the model's training.Additionally, we evaluated DeepDet using both remaining sets, that is, eval and dev sets of the PA database.The eval set comprises 134,730 samples, including 116,640 spoofed and 18,090 bonafide samples, and the dev set comprises 29,700 audios, including 24,300 spoofed and 5400 bonafide samples.

Environment
We performed the experiments using a GPU NVIDIA card, i.e., GEFORCE GTX with 4 GB memory.The details of the employed hardware are shown in Table 3.The operating system was Windows 10, which had 16 GB RAM.The experiment was performed on the Matlab 2021a.

Metrics
For the performance evaluation of the proposed model, we have utilized various metrics such as precision, recall, accuracy, equal error rate, and Tandem-detection cost function(t-DCF).Moreover, these metrics are relied on true positive (TP), false positive (FP), true negative (TN), and false-negative (FN).The TP refers to the correctly classified spoofed audios by our proposed model, FP refers to the number of audios that were incorrectly classified as spoofed, FN denotes the number of audios that were incorrectly classified as negative, i.e., bonafide, and TN refers to the number of audios that were correctly classified as a negative class such as bonafide.Furthermore, precision refers to the fraction of TP over the total audios (mel-spectrograms) classified as positive.The mathematical equation is given below.
The accuracy of the system indicates the correctly classified audio by the proposed system.The equation is presented below.
The recall is the fraction of the classified positive class audios to all spoofed audios whether they were classified as a real class by the system.The recall value closer to 1 refers to the better model.The equation of Recall is given below.
Moreover, we employed an Equal Error Rate (EER) and t-DCF to analyze the performance of the proposed spoof detector.Suppose FAR (θ) and FRR (θ) refer to the false acceptance and rejection rates at threshold value θ, respectively.FRR and FAR decrease and increase monotonically at the rate of θ; therefore, the EER estimates the value of θ at which both FRR and FAR become equal.
The formula for t-DCF is given below.where the value of β is depends on application-specific parameters, such as priors and costs, as well as the performance of the ASV.P cm miss (s) and P cm fa (s) represent the miss and false alarm rates of the countermeasure system at the threshold s.

Performance over synthesized speech and voice conversion
In In the eval set of LA, 13 spoofed systems are included comprising of 7 text-to-speech syntheses, i.e., A07-A12, A16, 3 TTS-VC systems, i.e., A13, A14, A15, and 3 VC spoofed systems, such as A17-A19 that are used to generate the spoof speeches.We employed an experiment based on three phases to assess the effectiveness of the DeepDet for VC and TTS systems.
In the Logical Access (LA) scenario, the training set comprises 2580 bonafide utterances and 22,800 spoofed utterances.The development set consists of 2548 bonafide and 22,296 spoofed utterances, while the evaluation set includes 7355 bonafide and 63,882 spoofed utterances.
More precisely, first, we utilized the spoofed and bonafide samples of TTS from a train set of logical access datasets for the training of our DL model.The results are shown in Table 4.We attained an EER of 0.50% and t-DCF of 0.005.Second, we utilized samples from the train set of the VC system of the LA dataset for the training of our proposed model to analyze the performance.We attained an EER of 0.90% and t-DCF of 0.06.It is concluded from the results that the DeepDet performs better for the detection of TTS spoofed speeches than VC spoofed detection.The reason behind the better performance of DeepDet for TTS spoofing detection is that the voice generated from the VC systems is based on the original audio samples' periodic characteristics.However, TTS systems lack this property.Third, we performed an experiment using the general LA dataset to analyze the performance of the proposed model and achieved 0.042 EER.The overall performance of the proposed system is significant on the LA set.Therefore, we can say that our model, i.e., DeepDet, effectively detects the fake audio.Similarly, the experiments are performed for the dev set as well, and the results are reported in Table 4.

Ablation study
In this experiment, we analyzed the performance of our proposed DeepDet using varying schemes.First, we assessed the results with the original YAMNet.Then, we attached the BAM module with YAMNet without

Performance analysis over physical access attacks
In this experiment, we aim to examine the performance of our spoofing audio detector using physical access attacks.Therefore, we transformed the auditory samples of the PA set into mel-spectrograms and passed them to an improved YAMNet's base network, i.e., customized MobileNet for the classification into bonafide and replay samples.We achieved an EER of 0.43% and 3.11% for eval and dev sets.Moreover, the min-tDCF of 0.0021 and 0.05 is achieved for eval and dev sets, as reported in Table 6.
The results show that our suggested spoofing detector attained significant performance compared to the existing models.Particularly, our proposed model is based on an improved MobileNet, which utilizes depth-wise separable convolutional layers to extract the most representative features from the mel-spectrograms generated from audio.Therefore, we attained EER, which is less than the EER achieved for the existing system on the eval set, such as in [69].We believe, after the experiment, that our improved MobileNet is capable of effectively extracting features from replay samples to detect physical access attacks.

Performance comparison with existing techniques
In this experiment, we compared DeepDet with the existing models for voice spoofing detection.The comparative results are reported in Table 7, which considers the evaluation and development set of the ASVspoof 2019 LA corpus.It can be seen that our proposed spoofing detector attains the lowest EER as 0.0015 and 0.042 for dev and eval sets, outperforming the existing systems.
The second lowest EER 0.045 is achieved by EDL-Det, and then W2V2-light-DARTS attains, for the eval set as 1.08.Moreover, the second lowest EER for the dev set is achieved by LFCC-PC-DARTS as 0.02.However, the system attained the highest EER for the eval set, which was 4.87.From this analysis, it is concluded that our spoofing detector can effectively detect various spoofed attacks and voices based on cloning algorithms.More precisely, our proposed algorithm outperforms the existing techniques.

Conclusion
In this paper, we have presented a voice spoofing detector, i.e., DeepDet, employing an improved deep learning model, YAMNet, to detect synthetic attacks.We employed an improved MobileNet along with the BAM attention module as the base network for feature extraction and classification of mel-spectrograms into bonafide and spoofed samples.An improved MobileNet with BAM effectively captures the sample dynamics, artifacts of cloning algorithms and environment, and microphone variations of the replay attacks.Moreover, the significance of utilizing MobileNet lies in an implication of linearly separable depth convolutional layers that makes it light-weight.The BAM module guides the overall network for extraction of key features from mel-spectrograms.We assessed the performance of the proposed model using a diverse and large-scale dataset, i.e., ASVspoof 2019 corpus, and it was concluded that our system is applicable for the detection of several types of spoofing attacks.More precisely, our model attained an EER of 0.43% and 0.042% for PA and LA attacks correspondingly.
Our system effectively distinguishes the various cloning algorithms employed for the generation of speech.Additionally, our comparative assessment with existing models unveils that DeepDet outperforms them for various forms of speech spoofing detection, such as cloningbased, text-to-speech, and replay attacks.Furthermore, it is worth mentioning that evaluation samples of the dataset include speeches from unseen speakers, and our proposed system attained excellent results on the ASVspoof 2019 evaluation set.Therefore, we believe that Deep-Det is a robust spoofing detector due to its effectiveness

Fig. 1
Fig. 1 General architecture of deep learning model

Table 1
Layer-wise details of our proposed MobileNet

Table 2
Statistics of ASVSpoof 2019 LA and PA sets

Table 3
System specifications for the employed model

Table 4
Results for synthesized speech and voice conversion

Table 5
Comparison with a base model

) Precision(%) Recall(%) EER(%)
In the end, we changed the BN layers with IN layers to improve the performance further.The results are reported in Table5.It is clearly visible from the results that DeepDet attains more remarkable results than the first two schemes.

Table 6
Results on PA set of ASVSpoof 2019

Table 7
Comparison with existing spoofing detection systems the evaluation set of ASVspoof 2019.In the future, we aim to cross-validate our model on other voice spoofing datasets as well and further improve the performance. on