Florian Tramèr (joint work with Dan Boneh)

Slides:



Advertisements
Similar presentations
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Secure In-VM Monitoring Using Hardware Virtualization Monirul Sharif, Wenke Lee, Weidong Cui, and Andrea Lanzi Presented by Tyler Bletsch.
Physical Unclonable Functions and Applications
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
CMSC 414 Computer and Network Security Lecture 9 Jonathan Katz.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Towards Application Security On Untrusted OS
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
How to play ANY mental game
Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.
1 Architectural Support for Copy and Tamper Resistant Software David Lie, Chandu Thekkath, Mark Mitchell, Patrick Lincoln, Dan Boneh, John Mitchell and.
Public Key Encryption and the RSA Public Key Algorithm CSCI 5857: Encoding and Encryption.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Attribute-Based Encryption With Verifiable Outsourced Decryption.
Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.
Almost Entirely Correct Mixing With Applications to Voting Philippe Golle Dan Boneh Stanford University.
Secure Data Outsourcing
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Round-Efficient Multi-Party Computation in Point-to-Point Networks Jonathan Katz Chiu-Yuen Koo University of Maryland.
Embedded system security
CS203 – Advanced Computer Architecture
NFV Compute Acceleration APIs and Evaluation
MobiRNN: Efficient Recurrent Neural Network Execution On Mobile GPU
Stanford University.
World’s fastest Machine Learning Engine
Trusted Computing and the Trusted Platform Module
Analysis of Sparse Convolutional Neural Networks
Benchmarking Deep Learning Inference
On the Size of Pairing-based Non-interactive Arguments
Chilimbi, et al. (2014) Microsoft Research
New Cache Designs for Thwarting Cache-based Side Channel Attacks
Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.
Attacking an obfuscated cipher by injecting faults
MPC and Verifiable Computation on Committed Data
DeepXplore: Automated Whitebox Testing of Deep Learning Systems
Deep Learning Libraries
Map-Scan Node Accelerator for Big-Data
STRATEGIC ENCRYPTION
Privacy-preserving Machine Learning
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Logistic Regression & Parallel SGD
Cloud Security 李芮,蒋希坤,崔男 2018年4月.
Multi-Party Computation: Second year
Operating System Introduction.
Sai Krishna Deepak Maram, CS 6410
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
SCONE: Secure Linux Containers Environments with Intel SGX
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Shielding applications from an untrusted cloud with Haven
MPC Scenario 1. “Privacy-protected contingency tables”
Aimee Coughlin, Greg Cusack, Jack Wampler, Eric Keller, Eric Wustrow
TensorFlow: A System for Large-Scale Machine Learning
EE 193: Parallel Computing
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Model Compression Joseph E. Gonzalez
Bruce Maggs (with some slides from Bryan Parno)
Bruce Maggs (with some slides from Bryan Parno)
Deep Learning Libraries
CHET: An Optimizing Compiler for Fully-Homomorphic
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
TEE-Perf A Profiler for Trusted Execution Environments
Search-Based Approaches to Accelerate Deep Learning
2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.
Presentation transcript:

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Stanford security lunch – June 13th

Trusted execution of ML: 3 motivating scenarios Outsourced ML Data Privacy Growing software stack --- ML used in security-critical tasks => Need for trust => 3 motivating scenarios Integrity Model “downgrade” Disparate impact Other malicious tampering

Trusted execution of ML: 3 motivating scenarios Federated Learning Integrity Need integrity of computation and data collection Data privacy

Trusted execution of ML: 3 motivating scenarios Infected Hosts Integrity Malware Trojaned hardware

Solutions Cryptography Trusted Execution Environments (TEEs) Outsourced ML: FHE, MPC, (ZK) proof systems Federated learning: no countermeasure for poisoning… Infected hosts: verifiable computation + some root of trust Trusted Execution Environments (TEEs) Outsourced ML: isolated enclaves Federated learning: trusted sensors + isolated enclaves Infected hosts: isolated enclaves / hardware from trusted manufacturer

Trusted Execution: At what cost? Trusted ASICs (Wahby et al.): ~108×worse than SOTA Intel SGX: GPU: Nvidia TITAN XP SGX: Intel Core i7-6700 Skylake Single Core @ 3.40GHz https://medium.com/@danny_harnik/impressions-of-intel-sgx-performance-22442093595a Paging at ~90MB

“How do we efficiently leverage TEEs for secure machine learning computations?” Idea: outsource work to collocated, faster but untrusted device and verify results x F(x), proof TEE Bottleneck will be verifier complexity, not prover overhead Computations Required gap Privacy Verifiable ASICs (Wahby et al., 2016) Arithmetic circuits ~ 8 orders of magnitude No Slalom DNN inference ~ 1-2 orders “Yes”

Goal + threat model The model is known to the adversary (but not necessarily to the client) Adversary controls the rest of the software / hardware stack TEE User has secure communication channel with TEE Goal: Efficiently run DNN inference F(x) - Integrity: User obtains F(x) or aborts - Privacy: Adversary learns nothing about x

Bottlenecks in deep neural networks non linear stuff (cheap) MATRIX MULTIPLICATION VGG16 Inference on 1 CPU core ~ 97%

Outsourcing matrix multiplication: Freivald’s algorithm Input: X ∈ 𝔽n ⨉ n , W ∈ 𝔽n ⨉ n Direct Compute: Z = X * W ≈ n3 multiplications or O(n2.81) with Strassen Outsource + Verify: Sample r ← 𝔽n uniformly at random Check: Z*r = X * (W * r) Complexity: ≈ 3n2 multiplications Soundness: 1 / | 𝔽 | (boost by repeating) DNN weights. Fixed at inference time

Batched and preprocessed verification Some DNN layers are *not* matrix multiplications E.g., a dense layer is a vector-matrix product, x*W - Compute: ≈ n2 - Freivald: ≈ 3n2 ... Verify a batch of inputs: Z = [x1, x2, …, xB] * W - Compute: ≈ Bn2 - Freivald: ≈ Bn + 2n2 Preprocess learned weights: W’ = W*r - Freivald: ≈ Bn + n2 The same randomness r can be reused for multiple checks if r is kept secret from the adversary

Handling convolutions VGG16 K = 3 3 ≤ C ≤ 512 64 ≤ D ≤ 512 142 ≤ N ≤ 2242 Operation Multiplications Compute Z = im2col([x1, …, xB]) * W B*N*K2*C*D Batched verify r1 * Z * r2 = im2col(r1 * X) * (W * r2) B*N*D + B*N*C + K2*C*D + N*K2*C Savings even if B=1 Soundness: 2 / | 𝔽 |

Preprocessing for convolutions (or arbitrary linear ops) Linear operator: z = FA(x) = x * A Precompute: A’ = A * r = (∇x F)(r) Check: z * r = x * A’ Complexity: |z| + |x| multiplications Matrix of size |x| × |z| Vector of size |z| Vector of size |x| Easy to compute without making A explicit! 2 inner products! |x| = B*N*C |z| = B*N*D Convolutions Multiplications Compute B*N*K2*C*D Batched verify B*N*D + B*N*C + K2*C*D + N*K2*C Preprocessed B*N*D + B*N*C

Offline: Precompute and store R, R*W Preserving privacy Offline precomputation + online blinding Offline: Precompute and store R, R*W X X * W TEE

Preserving privacy Offline precomputation + online blinding X+R TEE Secret sharing? Offline: Precompute and store R, R*W Online: “one-time-pad” over 𝔽 TEE X+R (X+R) * W Online: Unblind using R*W TEE X+R R Can these devices be “collocated” yet “non-colluding” ?

Slalom X1 Z1 X2 Z2 W1 W2 σ … F: TEE TEE

Slalom (some details) Quantization: Integrity checks: Blinding: DNNs are typically trained / evaluated in floating point Freivald / blinding require working over a ring/field 𝔽 Quantize inputs & weights and work mod P (P < 224) Integrity checks: Eval DNN on fast device and store inputs/outputs of all linear ops ⟹ close to no prover overhead Sample r from 𝔽 and do Freivald check in double precision ⟹ verifier complexity is at least |x| + |z| double muls per linear layer Blinding: Store unblinding factors R*W encrypted in untrusted memory In online phase, decrypt (and authenticate) R*W to unblind

Design & Evaluation Implementation Workloads: TEE Implementation TEE: Intel SGX ”Desktop” CPU (single thread) Untrusted device: Nvidia Tesla GPU Port of the Eigen linear algebra C++ library to SGX (used in e.g., TensorFlow) Workloads: Microbenchmarks (see paper) VGG16 (“beefy” canonical feedforward neural network) MobileNet (resource efficient DNN tailored for low-compute devices) Variant 1: standard MobileNet (see paper) Variant 2: No intermediate ReLU in separable convolutions (this talk)

Verifiable inference MobileNet’s weights are only ~10MB so they fit in the SGX cache Difficult to get faster batched verification due to SGX memory limits VGG16 weights take 500MB so SGX has to page weights in and out of memory => ~2-3x slowdown Preprocessed weights W*r take up less memory and enable faster checks!

Verifiable and private inference Extra Costs GPU has to operate in double precision Decrypt all unblinding factors R*W (AES-GCM) Regenerate all blinding factors R (PRG using AES)

Summary Large savings (6x – 20x) in outsourcing DNN inference while preserving integrity Sufficient for some use-cases! More modest savings (3.5x – 10x) with input privacy Requires preprocessing

Open questions What other problems are (concretely) easier to verify than to compute? All NP complete problems (are those really outsourced?) What about something in P? Convex optimization Other uses of matrix multiplication Many graph problems (e.g., perfect matching) What about Slalom for verifiable / private training? Quantization at training time is hard Weights change so we can’t preprocess W*r for Freivald’s check We assume the model is public