Download presentation
Presentation is loading. Please wait.
Published byAshlie Collins Modified over 5 years ago
1
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R.B. Butler, and Joseph Wilson University of Florida
2
Voice as an Interface Florida Institute for Cybersecurity Research 2
3
Injecting Commands 3
4
Injecting Commands (Demonstration)
4
5
Injecting Commands (Demonstration)
This ad is controversial, Google banned this voice command 5
6
You might be thinking... Benign? Easy to Defend? 6
7
Is there a generic, transferrable way to produce audio that:
Our Work! Is there a generic, transferrable way to produce audio that: sounds like noise to humans, sounds like a valid command to the system? works against both speech and speaker recognition systems with Black-Box access to target system 7
8
Modern Speech Recognition Systems
Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway 8
9
Modern Speech Recognition Systems
Audio Sample Preprocessing Feature Extraction Inference 9
10
Modern Speech Recognition Systems
Audio Sample Preprocessing Feature Extraction Inference Using microphone to record voice Analog to digit signal 1010
11
Modern Speech Recognition Systems
Preprocessing Audio Sample Feature Extraction Inference Low pass filter 10
12
Modern Speech Recognition Systems
Feature Extraction Inference Audio Sample Preprocessing voice MFCC
13
Modern Speech Recognition Systems
Audio Sample Preprocessing Feature Extraction Inference
14
Inference * Most Attacks* Modern Speech Recognition Systems
Audio Sample Preprocessing Feature Extraction Inference * Most Attacks* N. Carlini and D. Wagner, “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text,” IEEE Deep Learning and Security Workshop, 2018 N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou, “Hidden voice commands.” in USENIX Security Symposium, 2016 X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” in Proceedings of the USENIX Security Symposium, 2018. M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? Adversarial examples against automatic speech recognition,” NIPS 2017 Machine Deception Workshop
15
Feature Extraction Inference
Modern Speech Recognition Systems Feature Extraction Inference Audio Sample Preprocessing Most Attacks* Our Attack MODEL DOES NOT MATTER!!
16
Modern Speech Recognition Systems
Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway
17
Designed to approximate the human ear
Feature Extraction Designed to approximate the human ear Retains the most important features Magnitude Fast Fourier Transform (mFFT) voice FFT Feature (MFCC)
18
Feature Extraction (mFFT)
Converts time domain to frequency domain Multiple Inputs can have same output mFFT is lossy
19
Modern Speech Recognition Systems Feature Extraction
Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway
20
How the Attack Works (Types)
Grouped into 4 types Type1: time domain inverse Split into segments Inverse the voice vectors
21
How the Attack Works (Types)
Instance for TDI “Alexa, tell me the weather”, sample rate: 48kHz voice MFCC Same MFCC TDI with 25 ms TDI with 12.5 ms TDI with 2 ms Alexa can be triggered, but cannot understand command
22
How the Attack Works (Types)
Type 2 : Random Phase Generation Add a phase shifts RGG, 25ms RGG, 2ms
23
How the Attack Works (Types)
Type 3 : High Frequency Addition · Add a noise with fixed frequency · noise will be removed by pre-processing Noise, 12kHz
24
How the Attack Works (Types)
Type 4 : Time Scaling Adjust the playing speed 即倍速播放 In attack: Using the combination of these four types
25
How the Attack Works (Psychoacoustics)
Intelligibility hard to measure Fundamentals of psychoacoustics Spread energy across spectrum 20
26
How the Attack Works (Evaluation)
Speech Speaker Task Noise -> text Noise -> user Data > 20,000 successful attack samples 22 speakers Queries <10 queries to model (a few seconds!) Models x2
27
How the Attack Works (Evaluation)
Speech Speaker Task Noise -> text Noise -> user Data > 20,000 successful attack samples 22 speakers Queries <10 queries to model (a few seconds!) Models x2
28
Modern Speech Recognition Systems Feature Extraction
Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway
29
https://sites.google.com/view/practicalhiddenvoice/home
Demo! (1/2) Turn on the computer Method: time domain inversion + high frequency additon More at:
30
https://sites.google.com/view/practicalhiddenvoice/home
Demo! (2/2) Make a call to mon Method: time domain inversion + high frequency additon More at:
31
https://sites.google.com/view/practicalhiddenvoice/home
Demo! (2/2) More at:
32
Modern Speech Recognition Systems
Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway 26
33
Simple, efficient audio transformations yield “noise” that is
Takeaway Simple, efficient audio transformations yield “noise” that is understood as commands by speech systems The model is irrelevant All systems we tested are vulnerable Achieve the same goals as traditional Adversarial ML Project webpage: sites.google.com/view/practicalhiddenvoice/home hadiabdullah.github.io hadiabdullah1
34
Defense Must be implemented at or before feature extraction
Defenses Defense Must be implemented at or before feature extraction Adversarial Training? Voice Activity Detection? Environmental Noise Liveness Detection Blue et al.* * L. Blue, L. Vargas, and P. Traynor, “Hello, is it me you‘re looking for? Differentiating between human and electronic speakers for voice interface security,” in 11th ACM Conference on Security and Privacy in Wireless and Mobile Networks, 2018. 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.