Vulnerability issues in Automatic Speaker Verification (ASV) systems

Gupta, Priyanka; Patil, Hemant A.; Guido, Rodrigo Capobianco

doi:10.1186/s13636-024-00328-8

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Selected attacking techniques in the ASV literature

From: Vulnerability issues in Automatic Speaker Verification (ASV) systems

Basis of attack	Corpus used	Observations
Choosing the closest target based on FAR using GMM [27]	YOHO	\(\bullet\) If the number of sessions are more in which the attacker has listened to the target voice, higher verification error rate is obtained
		\(\bullet\) The highest FAR achieved was 35% by imitator2
Choosing the closest target using attacker’s ASV on the basis of EER [28]	VoxCeleb1 and VoxCeleb2	\(\bullet\) Transferability is observed from the attacker’s ASV to the attacked ASV in the order of the closest, median, and the farthest speakers
		\(\bullet\) Contrary to the intuition, if the target speaker’s voice is already similar to the impersonator’s voice, the verification error score is lowered! However, in case of the targets that are not close to the attacker, impersonation increases the verification error, thereby improving the attack
Training feedback controlled voice conversion system, with feedback coming from the black box target ASV. The VC method used is phonetic posteriorgram (PPG)-based [29]	Subset of ASVSpoof 2019 LA dataset	\(\bullet\) Higher EER indicates better attack. Overall EER achieved using PPG-VC with feedback attack method was \(30.73\%\), whereas without feedback it was \(29.25\%\)
		\(\bullet\) Male speakers were observed to be more vulnerable due to PPG-based VC attack with EER of \(32.90\%\) and \(31.60\%\) for the cases of with and without feedback, respectively
		\(\bullet\) Female speakers on the other hand were comparatively less vulnerable due to reduced EERs of \(26.67 \%\) and \(25\%\), for the cases of with and without feedback, respectively
Crafting adversarial examples at the acoustic feature-level, i.e., MFCC and log power magnitude spectrum (LPMS). To generate perturbation, fast gradient sign method (FGSM) is used to solve the optimization problem [30]	VoxCeleb1	\(\bullet\) In black box setting, for perturbation \(\epsilon = 20\), EER of \(74.62\%\) was achieved
		\(\bullet\) In white box setting, LPMS i-vector-based system was found to be more vulnerable than MFCC i-vector. For \(\epsilon = 1\), FAR and EER obtained by LPMS i-vector were \(99.99\%\) and \(99.95\%\), respectively
Crafting adversarial examples using FGSM and local distribution smoothness (LDS) method [31]	TIMIT	\(\bullet\) EER is improved by (i) \(+18.89\%\) and (ii) \(+5.54\%\) for the original test set using the regularized model
		\(\bullet\) Furthermore, EER is improved on the adversarial example test set by (i) \(+30.11\%\), and (ii) \(+22.12\%\)
Developing an audio-agnostic universal generating sound distortions by estimating perturbation. Robustness is improved by the room impulse response (RIR) [32]	Multi-speaker corpus from Voice Cloning Toolkit (VCTK)	\(\bullet\) 90% attack success rate is achieved on both x-vector and d-vector-based ASVs
		\(\bullet\) Attack time is sped up by 100 times. Both were achieved on white box scenarios
Crafting adversarial examples using biometrics transformation network configuration (ABTN), which jointly processes the loss best of the PAD and ASV systems to generate black box and white box adversarial examples [33]	ASVSpoof 2019	\(\bullet\) ABTN outperforms adversarial attacks, obtaining 10.28% and 10.14% higher EER joint w. r. t. the PGD (\(\epsilon = 1.0\)) in the LA and PA test sets, respectively.
Voice conversion using weighted frequency warping (WFW) [34]	TIMIT and CMU-ARCTIC	\(\bullet\) The WFW-based attack failed on speaker identification systems as the source voice and its corresponding speaker could be identified in numerous cases
Text-to-speech (TTS) system, which contains a speaker encoder network, a sequence-to-sequence synthesis network, and an auto-regressive WaveNet- based vocoder network, which converts the Mel spectrogram into time-domain signal [35]	VCTK and LibriSpeech	\(\bullet\) It is demonstrated that synthesized speech is reasonable natural sounding speech, similar to real even on unseen speakers
		\(\bullet\) Human-level naturalness is not achieved despite the use of a WaveNet vocoder
An autoencoder-based voice conversion system [36]	VCTK	\(\bullet\) Generalizes well to unseen speakers
		\(\bullet\) Speaker characteristics are disentangled from the linguistic content by the encoder bottleneck
		\(\bullet\) Like [35], it also uses WaveNet vocoder
SV2TTS [27]	Customized Data	Azure, and WeChat can accept at least 1 synthesized attack utterance
ASV is trained under unconstrained recording and speaking conditions [37]	Collected Impersonation Dataset (CID)	\(\bullet\) Attacks using deepfake speech are more likely to be successful than the other attacking techniques, including speech synthesis and impersonation by professionals
		\(\bullet\) It is established that the fine structures in the speech present due to the human speech production mechanism can capture the discriminating acoustic cues between natural and machine-generated speech, such as deepfake speech
DolphinAttack: inaudible voice commands on ultrasonic carriers [38]	–	\(\bullet\) Even though inaudible, DolphinAttack voice commands can successfully activate the audio hardware of devices, such as Siri, Alexa, and GoogleNow
		\(\bullet\) The attack leads to various vulnerabilities, such as visiting a malicious website, spying, injection of fake information, and denial of service
Targeted adversarial attack called as FAKEBOB under black box setting [50]	LibriSpeech and VoxCeleb	\(\bullet\) Success rate of \(99\%\) is achieved on both open source and commercial systems
		\(\bullet\) It is concluded that it is difficult to differentiate the speakers of the original voices from those of the generated adversarial voices
Two attacking setups: different speaker attack setup, and conversion attack setup [39]	MOBIO and Voxforge	\(\bullet\) Statistically significant difference with p-value = 0 (for males) and p-value = 0.0015 (for female) is observed between the mean FAR of the two attacking methods on ASV system
		\(\bullet\) Conversion attack is significantly more successful than the different speaker attack
SIRENATTACK: based on particle swarm optimization (PSO) and fooling gradient method [40]	Common Voice dataset and VCTK	\(\bullet\) The attack threat is evaluated on the DeepSpeech model, in black box and white box scenarios
		\(\bullet\) In particular, on ASV systems, average success rate from \(91.65\%\) to \(99.45\%\) is achieved in black box scenario, on various models
Professional Swedish impersonator (male) [41]	–	\(\bullet\) Low correlation between human perception and speaker verification system is observed
		\(\bullet\) The human listeners perceive prosodic features in addition to the other speech characteristics. However, machine-based ASV systems do not take prosodic features into account
Voice identity morphing [42]	LibriSpeech	\(\bullet\) Morph attack success rate of over \(80\%\) on two popular speaker recognition systems (ECAPA-TDNN and x-vector) is observed
Voice synthesis and deepfake attacks [43]	Customized Data	\(\bullet\) More than \(30\%\) of the deepfake attacks were successful, and that there was at least one successful attack for more than half of the participants

Back to article page