Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.

Slides:

Advertisements

Similar presentations

Audio Compression ADPCM ATRAC (Minidisk) MPEG Audio –3 layers referred to as layers I, II, and III –The third layer is mp3.

Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Advanced Speech Enhancement in Noisy Environments

Hierarchy of Design Voice Controlled Remote Voice Input Control Path Speech Processing IR Interface.

Abstract Binaural microphones were utilised to detect phonation in a human subject (figure 1). This detection was used to cut the audio waveform in two.

Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.

1 Security problems of your keyboard –Authentication based on key strokes –Compromising emanations consist of electrical, mechanical, or acoustical –Supply.

Electronics Design Laboratory Lecture #11, Fall 2014

CELLULAR COMMUNICATIONS DSP Intro. Signals: quantization and sampling.

Self-Calibrating Audio Signal Equalization Greg Burns Wade Lindsey Kevin McLanahan Jack Samet.

Ni.com Data Analysis: Time and Frequency Domain. ni.com Typical Data Acquisition System.

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,

Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.

Fourier Concepts ES3 © 2001 KEDMI Scientific Computing. All Rights Reserved. Square wave example: V(t)= 4/  sin(t) + 4/3  sin(3t) + 4/5  sin(5t) +

Dual-Channel FFT Analysis: A Presentation Prepared for Syn-Aud-Con: Test and Measurement Seminars Louisville, KY Aug , 2002.

Side Channel Attacks through Acoustic Emanations

Snooping Keystrokes with mm-level Audio Ranging on a Single Phone

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.

A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005.

Submitted By: Santosh Kumar Yadav (111432) M.E. Modular(2011) Under the Supervision of: Mrs. Shano Solanki Assistant Professor, C.S.E NITTTR, Chandigarh.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Performance Comparison of Speaker and Emotion Recognition

Automated Fingertip Detection

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.

Turning a Mobile Device into a Mouse in the Air

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.

Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.

Tonal Index in Digital Recognition of Lung Auscultation Marcin Wiśniewski,Tomasz Zieliński 2016/7/12 Signal Processing Algorithms, Architectures,Arrangements,

[1] National Institute of Science & Technology Technical Seminar Presentation 2004 Suresh Chandra Martha National Institute of Science & Technology Audio.

My Smartphone knows what you print exploring smartphone-based side-channel attacks against 3d Printers Chen Song, feng lin, zongjie ba, kui ren, chi zhou,

Machine Learning with Spark MLlib

Recognition of bumblebee species by their buzzing sound

Introduction to Audio Watermarking Schemes N. Lazic and P

CS 591 S1 – Computational Audio

COMPUTER NETWORKS and INTERNETS

Casey O’Leary – Washington State University

ARTIFICIAL NEURAL NETWORKS

Speech Processing AEGIS RET All-Hands Meeting

Spoken Digit Recognition

LECTURE 11: FOURIER TRANSFORM PROPERTIES

Efficient Image Classification on Vertically Decomposed Data

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

A presentation on Basics of Speech Recognition Systems

Duy dang, Robert kern, esteban kleckner

Inaudible Voice Commands Ultrasound Modulation

Leigh Anne Clevenger Pace University, DPS ’16

Presented by: Chen Shi 02/22/2018

Efficient Image Classification on Vertically Decomposed Data

Kocaeli University Introduction to Engineering Applications

朝陽科技大學資訊工程系謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學資訊工程系謝政勳

Duy Dang, Robert Kern, Esteban Kleckner

Privacy-preserving and Secure AI

Audio and Speech Computers & New Media.

Command Me Specification

Govt. Polytechnic Dhangar(Fatehabad)

Advances in Deep Audio and Audio-Visual Processing

LECTURE 11: FOURIER TRANSFORM PROPERTIES

THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU

Combating Replay Attacks Against Voice Assistants

Fourier Transforms of Discrete Signals By Dr. Varsha Shah

Presentation transcript:

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R.B. Butler, and Joseph Wilson University of Florida

Voice as an Interface Florida Institute for Cybersecurity Research 2

Injecting Commands 3

Injecting Commands (Demonstration) 4

Injecting Commands (Demonstration) This ad is controversial, Google banned this voice command 5

You might be thinking... Benign? Easy to Defend? 6

Is there a generic, transferrable way to produce audio that: Our Work! Is there a generic, transferrable way to produce audio that: sounds like noise to humans, sounds like a valid command to the system? works against both speech and speaker recognition systems with Black-Box access to target system 7

Modern Speech Recognition Systems Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway 8

Modern Speech Recognition Systems Audio Sample Preprocessing Feature Extraction Inference 9

Modern Speech Recognition Systems Audio Sample Preprocessing Feature Extraction Inference Using microphone to record voice Analog to digit signal 1010

Modern Speech Recognition Systems Preprocessing Audio Sample Feature Extraction Inference Low pass filter 10

Modern Speech Recognition Systems Feature Extraction Inference Audio Sample Preprocessing voice MFCC

Modern Speech Recognition Systems Audio Sample Preprocessing Feature Extraction Inference

Inference * Most Attacks* Modern Speech Recognition Systems Audio Sample Preprocessing Feature Extraction Inference * Most Attacks* N. Carlini and D. Wagner, “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text,” IEEE Deep Learning and Security Workshop, 2018 N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou, “Hidden voice commands.” in USENIX Security Symposium, 2016 X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” in Proceedings of the USENIX Security Symposium, 2018. M. Alzantot, B. Balaji, and M. Srivastava, “Did you hear that? Adversarial examples against automatic speech recognition,” NIPS 2017 Machine Deception Workshop

Feature Extraction Inference Modern Speech Recognition Systems Feature Extraction Inference Audio Sample Preprocessing Most Attacks* Our Attack MODEL DOES NOT MATTER!!

Modern Speech Recognition Systems Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway

Designed to approximate the human ear Feature Extraction Designed to approximate the human ear Retains the most important features Magnitude Fast Fourier Transform (mFFT) voice FFT Feature (MFCC)

Feature Extraction (mFFT) Converts time domain to frequency domain Multiple Inputs can have same output mFFT is lossy

Modern Speech Recognition Systems Feature Extraction Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway

How the Attack Works (Types) Grouped into 4 types Type1: time domain inverse Split into segments Inverse the voice vectors

How the Attack Works (Types) Instance for TDI “Alexa, tell me the weather”, sample rate: 48kHz voice MFCC Same MFCC TDI with 25 ms TDI with 12.5 ms TDI with 2 ms Alexa can be triggered, but cannot understand command

How the Attack Works (Types) Type 2 : Random Phase Generation Add a phase shifts RGG, 25ms RGG, 2ms

How the Attack Works (Types) Type 3 : High Frequency Addition · Add a noise with fixed frequency · noise will be removed by pre-processing Noise, 12kHz

How the Attack Works (Types) Type 4 : Time Scaling Adjust the playing speed 即倍速播放 In attack: Using the combination of these four types

How the Attack Works (Psychoacoustics) Intelligibility hard to measure Fundamentals of psychoacoustics Spread energy across spectrum 20

How the Attack Works (Evaluation) Speech Speaker Task Noise -> text Noise -> user Data > 20,000 successful attack samples 22 speakers Queries <10 queries to model (a few seconds!) Models x2

How the Attack Works (Evaluation) Speech Speaker Task Noise -> text Noise -> user Data > 20,000 successful attack samples 22 speakers Queries <10 queries to model (a few seconds!) Models x2

Modern Speech Recognition Systems Feature Extraction Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway

https://sites.google.com/view/practicalhiddenvoice/home Demo! (1/2) Turn on the computer Method: time domain inversion + high frequency additon More at: https://sites.google.com/view/practicalhiddenvoice/home

https://sites.google.com/view/practicalhiddenvoice/home Demo! (2/2) Make a call to mon Method: time domain inversion + high frequency additon More at: https://sites.google.com/view/practicalhiddenvoice/home

https://sites.google.com/view/practicalhiddenvoice/home Demo! (2/2) More at: https://sites.google.com/view/practicalhiddenvoice/home

Modern Speech Recognition Systems Structure Modern Speech Recognition Systems Feature Extraction How the Attack Works Demo Takeaway 26

Simple, efficient audio transformations yield “noise” that is Takeaway Simple, efficient audio transformations yield “noise” that is understood as commands by speech systems The model is irrelevant All systems we tested are vulnerable Achieve the same goals as traditional Adversarial ML Project webpage: sites.google.com/view/practicalhiddenvoice/home hadi10102@ufl.edu hadiabdullah.github.io hadiabdullah1

Defense Must be implemented at or before feature extraction Defenses Defense Must be implemented at or before feature extraction Adversarial Training? Voice Activity Detection? Environmental Noise Liveness Detection Blue et al.* * L. Blue, L. Vargas, and P. Traynor, “Hello, is it me you‘re looking for? Differentiating between human and electronic speakers for voice interface security,” in 11th ACM Conference on Security and Privacy in Wireless and Mobile Networks, 2018. 30