Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R.B. Butler, and Joseph Wilson University of Florida

Voice as an Interface Florida Institute for Cybersecurity Research 2

Injecting Commands 3

Injecting Commands (Demonstration)
4

Injecting Commands (Demonstration)
This ad is controversial, Google banned this voice command 5

You might be thinking... Benign? Easy to Defend? 6

Is there a generic, transferrable way to produce audio that:
Our Work! Is there a generic, transferrable way to produce audio that: sounds like noise to humans, sounds like a valid command to the system? works against both speech and speaker recognition systems with Black-Box access to target system 7

Modern Speech Recognition Systems
Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway 8

Audio Sample Preprocessing Feature Extraction Inference 9

Audio Sample Preprocessing Feature Extraction Inference Using microphone to record voice Analog to digit signal 1010

Preprocessing Audio Sample Feature Extraction Inference Low pass filter 10

Feature Extraction Inference Audio Sample Preprocessing voice MFCC

Audio Sample Preprocessing Feature Extraction Inference

Inference * Most Attacks* Modern Speech Recognition Systems
Audio Sample Preprocessing Feature Extraction Inference * Most Attacks* N. Carlini and D. Wagner, “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text,” IEEE Deep Learning and Security Workshop, 2018 N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou, “Hidden voice commands.” in USENIX Security Symposium, 2016 X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” in Proceedings of the USENIX Security Symposium, 2018. M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? Adversarial examples against automatic speech recognition,” NIPS 2017 Machine Deception Workshop

Feature Extraction Inference
Modern Speech Recognition Systems Feature Extraction Inference Audio Sample Preprocessing Most Attacks* Our Attack MODEL DOES NOT MATTER!!

Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway

Designed to approximate the human ear
Feature Extraction Designed to approximate the human ear Retains the most important features Magnitude Fast Fourier Transform (mFFT) voice FFT Feature (MFCC)

Feature Extraction (mFFT)
Converts time domain to frequency domain Multiple Inputs can have same output mFFT is lossy

Modern Speech Recognition Systems Feature Extraction

How the Attack Works (Types)
Grouped into 4 types Type1: time domain inverse Split into segments Inverse the voice vectors

Instance for TDI “Alexa, tell me the weather”, sample rate: 48kHz voice MFCC Same MFCC TDI with 25 ms TDI with 12.5 ms TDI with 2 ms Alexa can be triggered, but cannot understand command

Type 2 : Random Phase Generation Add a phase shifts RGG, 25ms RGG, 2ms

Type 3 : High Frequency Addition · Add a noise with fixed frequency · noise will be removed by pre-processing Noise, 12kHz

Type 4 : Time Scaling Adjust the playing speed 即倍速播放 In attack: Using the combination of these four types

How the Attack Works (Psychoacoustics)
Intelligibility hard to measure Fundamentals of psychoacoustics Spread energy across spectrum 20

How the Attack Works (Evaluation)
Speech Speaker Task Noise -> text Noise -> user Data > 20,000 successful attack samples 22 speakers Queries <10 queries to model (a few seconds!) Models x2

Modern Speech Recognition Systems Feature Extraction

https://sites.google.com/view/practicalhiddenvoice/home
Demo! (1/2) Turn on the computer Method: time domain inversion + high frequency additon More at:

Demo! (2/2) Make a call to mon Method: time domain inversion + high frequency additon More at:

Demo! (2/2) More at:

Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway 26

Simple, efficient audio transformations yield “noise” that is
Takeaway Simple, efficient audio transformations yield “noise” that is understood as commands by speech systems The model is irrelevant All systems we tested are vulnerable Achieve the same goals as traditional Adversarial ML Project webpage: sites.google.com/view/practicalhiddenvoice/home hadiabdullah.github.io hadiabdullah1

Defense Must be implemented at or before feature extraction
Defenses Defense Must be implemented at or before feature extraction Adversarial Training? Voice Activity Detection? Environmental Noise Liveness Detection Blue et al.* * L. Blue, L. Vargas, and P. Traynor, “Hello, is it me you‘re looking for? Differentiating between human and electronic speakers for voice interface security,” in 11th ACM Conference on Security and Privacy in Wireless and Mobile Networks, 2018. 30

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.

Similar presentations

Presentation on theme: "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.

Similar presentations

Presentation on theme: "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick."— Presentation transcript:

Similar presentations

About project

Feedback