CHET: An Optimizing Compiler for Fully-Homomorphic

Slides:



Advertisements
Similar presentations
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Advertisements

Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
SAGE: Self-Tuning Approximation for Graphics Engines
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
An Introduction to Support Vector Machines (M. Law)
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Accelerating Homomorphic Evaluation on Reconfigurable Hardware Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, Adrian Macias.
China Summer School on Lattices and Cryptography Craig Gentry and Shai Halevi June 4, 2014 Homomorphic Encryption over Polynomial Rings.
Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Packing Techniques for Homomorphic Encryption Schemes Scott Thompson CSCI-762 4/28/2016.
Secure and Practical Outsourcing of Linear Programming in Cloud Computing.
Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk†
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
TensorFlow The Deep Learning Library You Should Be Using.
Scalpel: Customizing DNN Pruning to the
NFV Compute Acceleration APIs and Evaluation
TensorFlow– A system for large-scale machine learning
Analysis of Sparse Convolutional Neural Networks
Prediction Serving Joseph E. Gonzalez Asst. Professor, UC Berkeley
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Dynamo: A Runtime Codesign Environment
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
Compact Bilinear Pooling
基于多核加速计算平台的深度神经网络 分割与重训练技术
On the Size of Pairing-based Non-interactive Arguments
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Resource Elasticity for Large-Scale Machine Learning
Privacy Preserving Similarity Evaluation of Time Series Data
Effective Data-Race Detection for the Kernel
Cloud Computing By P.Mahesh
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
Overview of TensorFlow
Vector Processing => Multimedia
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Collaborative Offloading for Distributed Mobile-Cloud Apps
Stripes: Bit-Serial Deep Neural Network Computing
Florian Tramèr (joint work with Dan Boneh)
Verifiable Oblivious Storage
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Smita Vijayakumar Qian Zhu Gagan Agrawal
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Declarative Transfer Learning from Deep CNNs at Scale
Final Project presentation
Designing Neural Network Architectures Using Reinforcement Learning
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Topological Signatures For Fast Mobility Analysis
TensorFlow: A System for Large-Scale Machine Learning
EE 193: Parallel Computing
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Model Compression Joseph E. Gonzalez
Practical (F)HE Part III – Bootstrapping
Automatic Handwriting Generation
6- General Purpose GPU Programming
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Learning and Memorization
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Presentation transcript:

CHET: An Optimizing Compiler for Fully-Homomorphic Neural Network Inferencing Roshan Dathathri          Olli Saarikivi Hao Chen Kim Laine         Kristin Lauter Saeed Maleki Madanlal Musuvathi Todd Mytkowicz

Neural Network Inference on Sensitive Data Credits: ScienceSoft Credits: MIT News

Motivation: Fully-Homomorphic Encryption (FHE) + Compute on encrypted data → offload storage and compute to cloud + Computation overhead of FHE is rapidly reducing + Simple trust model: keep the secret key + No need to trust hardware/software like Intel/Microsoft - Programming FHE applications is hard - Requires cryptographic domain expertise - Akin to writing assembly for custom hardware All are addressed by CHET Limited instruction set of FHE First FHE compiler

Fully-Homomorphic DNN Inference Latency Expert tuned CHET generated Network CHET LeNet5-small 3 s LeNet5-medium 11 s LeNet5-large 35 s Industrial 56 s SqueezeNet-CIFAR 165 s 922 s

Fully-Homomorphic Encryption (FHE) plaintext a b + a + b ciphertext a b a + b Encrypt Decrypt Homomorphic + Privacy Barrier FHE: homomorphic operators for addition and multiplication

Performance overhead of FHE over unencrypted CPU GPU FPGA ASIC Slowdown Overheads reduced due to improvements in theory and implementation GPU and FPGA can reduce slowdown Systems and compiler community can help reduce slowdown Year

Example Neural Network Inference 2 Customer plaintext 0.1 -3 10 1.5 -4 -1 0.2 2.6 1.9 1.3 ciphertext 0.1 -3 10 1.5 -4 -1 0.2 2.6 1.9 1.3 Encrypt Server Decrypt Privacy Barrier Not privy to the image, prediction, or any intermediate state

Compiling Tensor Programs for FHE libraries CHET Compiler Automation and optimizations FHE Programs CHET Runtime FHE-optimized tensor kernels FHE Libraries HEAAN SEAL FHE libraries for integer/fixed-point arithmetic CPU GPU FPGA DNN inference – tensor programs

Compiler for Homomorphic Evaluation of Tensor programs (CHET) Rest of this talk: How to ensure generated code is correct, secure, and performant? How to map tensor operations into the limited instruction set of FHE? Check our paper for other optimizations

Encryption Parameters Ciphertexts are polynomials with parameters: N: degree of polynomial Q: modulus of each coefficient 𝑎 𝑛 𝑥 𝑁 + 𝑎 𝑛−1 𝑥 𝑁−1 +…+ 𝑎 1 𝑥+ 𝑎 0 ∀𝑖, 𝑎 𝑖 = (𝑐 𝑖 % 𝑄)

Encryption Parameters FHE standard Q Better performance CHET selects N and Q automatically Correct and Secure Correct Program Dependent Incorrect Insecure Secure 𝑁 Larger Q or N => Larger cost of homomorphic operations Larger size of ciphertext

FHE for Approximate Fixed-Point Arithmetic Maintain explicit scaling factor and “re-scale” plaintext 1.1 2.2 x 2.4 ciphertext 11 22 24 Encode 10 10 10 Encrypt Q/10 11 22 24 Decrypt 10 10 10 Decode Q Privacy Barrier x 242 Rescale 100

Encryption Parameters Selection Dataflow analysis: Re-scale primitives consume modulus Other primitives propagate modulus Compute total modulus consumed -> set Q Given Q -> set N Pre-populated table x x Rescale Rescale x Rescale

FHE Packing Pack many plaintext scalars into single ciphertext vector Fixed width of N/2 (N is 210 or larger) 11 22 33 44 55 66 11 22 33 44 55 66

FHE Vectorization Limited set of SIMD instructions: Element-wise addition Element-wise subtraction Element-wise multiplication Rescale (all elements) Rotation Random access of a vector element not supported 𝑽 11 22 33 44 55 66 V = V ⋅𝟐 22 44 66 88 110 132 V = 𝑹𝒐𝒕(𝑽, 𝟑) 132 22 44 66 88 110 V = V ⋅𝑴𝒂𝒔𝒌 132

Example of Matrix Multiplication: C = A x B 𝒂 𝟏𝟏 𝒂 𝟏𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝑨 𝑩 𝒂 𝟐𝟏 𝒂 𝟐𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 𝒂 𝟏𝟏 𝒂 𝟏𝟐 𝒂 𝟐𝟏 𝒂 𝟐𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 𝑨 ′ =𝑹𝒐𝒕(𝑨, 𝟏) 𝑩 ′ =𝑹𝒐𝒕(𝑩, 𝟒) 𝒂 𝟏𝟏 𝒂 𝟏𝟐 𝒂 𝟐𝟏 𝒂 𝟐𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 𝑨 ′′ =𝑨+𝑨′ 𝑩 ′′ =𝑩+𝑩′ 𝒂 𝟏𝟏 𝒂 𝟏𝟏 𝒂 𝟏𝟐 𝒂 𝟏𝟐 𝒂 𝟐𝟏 𝒂 𝟐𝟏 𝒂 𝟐𝟐 𝒂 𝟐𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 Utilization of slots in vector Packing into a vector is a data layout

Example of Matrix Multiplication: C = A x B 𝒂 𝟏𝟏 𝒂 𝟏𝟏 𝒂 𝟏𝟐 𝒂 𝟏𝟐 𝒂 𝟐𝟏 𝒂 𝟐𝟏 𝒂 𝟐𝟐 𝒂 𝟐𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 𝒃 𝟏𝟏 𝒃 𝟏𝟐 𝒃 𝟐𝟏 𝒃 𝟐𝟐 𝑪 ′ = 𝑨 ′′ ⋅𝑩′′ 𝒄 𝟏𝟏𝟏 𝒄 𝟏𝟏𝟐 𝒄 𝟏𝟐𝟏 𝒄 𝟏𝟐𝟐 𝒄 𝟐𝟏𝟏 𝒄 𝟐𝟏𝟐 𝒄 𝟐𝟐𝟏 𝒄 𝟐𝟐𝟐 𝑹𝒐𝒕( 𝑪 ′ , 𝟔) 𝒄 𝟏𝟐𝟏 𝒄 𝟏𝟐𝟐 𝒄 𝟐𝟏𝟏 𝒄 𝟐𝟏𝟐 𝒄 𝟐𝟐𝟏 𝒄 𝟐𝟐𝟐 𝒄 𝟏𝟏𝟏 𝒄 𝟏𝟏𝟐 CHET chooses data layouts automatically 𝑪 ′′ = 𝑪 ′ + 𝑹𝒐𝒕( 𝑪 ′ , 𝟔) 𝒄 𝟏𝟏 𝒄 𝟏𝟐 ## ## 𝒄 𝟐𝟏 𝒄 𝟐𝟐 ## ## 𝑪= 𝑪 ′′ ⋅𝑴𝒂𝒔𝒌 𝒄 𝟏𝟏 𝒄 𝟏𝟐 𝒄 𝟐𝟏 𝒄 𝟐𝟐

Mapping Tensors to Vector of Vectors 1 2 3 HW (Height-Width) layout: Easier convolutions due to channels being aligned Wasted space H C W There are many ways to layout tensors into vectors, with different tradeoffs 1 2 3 CHW (Channel-Height-Width) layout: More efficient space usage Convolutions require more rotations

Data Layout Selection Search space: explore possible data layouts Cost estimation: estimate cost of each search point Pick the best-performing one

Data Layout Selection: Search Space Search space is exponential 2 choices per tensor operation: HW or CHW Prune search space: use domain knowledge -> limits to only 4 choices for the circuit Convolution faster in HW while rest faster in CHW Matrix multiplication faster if output is in CHW Convolution HW or CHW? Activation HW or CHW? Pooling HW or CHW? Matrix Mult. HW or CHW?

Data Layout Selection: Cost Estimation Cost model for FHE primitives: Asymptotic complexity (specific to FHE scheme) Microbenchmarking to determine constants Cost of a circuit: sum the costs for all operations

Experimental Setup FHE-compatible Deep Neural Networks (DNN): Systems: Hand-written HEAAN CHET with HEAAN CHET with SEAL Machine: Dual-socket Intel Xeon E5-2667v3 16 cores 224 GB of memory FHE-compatible Deep Neural Networks (DNN): Evaluation: Latency of image inference (batch size = 1) DNN Dataset # Layers # FP ops (M) Accuracy LeNet-5-small MNIST 8 0.2 98.5% LeNet-5-medium 5.8 99.0% LeNet-5-large 8.7 99.3% Industrial - 13 SqueezeNet-CIFAR CIFAR-10 19 37.8 81.5% Difficult to write code for SEAL: it was written that way because we had a compiler Accuracy trade-off: Loss of precision when making program FHE-compatible

CHET outperforms hand-written implementations Manual-HEAAN difficult to implement for large networks Industrial: 40 mins to 1 min

Best data layout depends on FHE library and DNN Slowdown Slowdown

Prior Compilers for Deep Learning Optimizing deep learning compilers target varied backends Our perspective: FHE is just another backend CHET is the first optimizing compiler for deep learning on FHE Lot more optimizations from literature to incorporate Hand-Written Models Automatic Layout Selection Scheduling Language Automatic Tuning Fully Automatic Kernel Generation More automation CHET Halide TVM TC, PlaidML

Conclusions FHE programming is a “systems and compiler” problem CHET outperforms hand-written FHE implementations CHET enables homomorphic inference using large Deep Neural Networks (DNN) CHET makes it easy to port to new FHE schemes Future work: Approximate non-linear functions automatically Generalize to other application domains

Rotation keys selection improves performance

CHET selects the best-performing data layout

Not privy to the image, prediction, or any intermediate state Program can be seen but input cannot be seen Demo in MSR booth

Rotation Keys Selection Distinct public key for each distinct amount of rotation Consumes significant memory By default, FHE schemes generate power-of-2 rotation keys Too conservative O(log(N)) keys CHET analyzes the circuit to capture all the distinct rotations used In practice, very few rotations are used: not O(N)

Overview of CHET at compile-time CHET Compiler Optimized Homomorphic Tensor Circuit Tensor Circuit Schema of Inputs Encryptor & Decryptor Desired Output Precision

Overview of CHET at runtime Not privy to the image, prediction, or any intermediate state Server Weights Optimized Homomorphic Tensor Circuit CHET Runtime HEAAN/SEAL CPU/GPU/FPGA Client Encrypted Image Encrypted Prediction Encryptor & Decryptor Encryptor & Decryptor Image Prediction Private key Public keys

Tensor Programs Tensor is a multi-dimensional array Common tensor operations: Convolution Matrix multiplication Pooling Element-wise operations Reshaping

Linear-time Dataflow Analysis Framework Dataflow equations for FHE primitives Properties of tensor circuits: Dataflow graph is a Directed Acyclic Graph (DAG) Tensor dimensions are known at compile-time Analysis: execute dataflow equations using CHET runtime Dynamically unrolls the dataflow graph on-the-fly Interpret ciphertext/plaintext as dataflow information

Semi-Honest Threat Model Hardware and software execute requested computation faithfully Hardware and software are curious about the data Client or user data must remain confidential