Model Compression Joseph E. Gonzalez

Slides:

Advertisements

Similar presentations

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Advertisements

ImageNet Classification with Deep Convolutional Neural Networks

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

GPGPU platforms GP - General Purpose computation using GPU

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Neural Networks in Computer Science n CS/PY 231 Lab Presentation # 1 n January 14, 2005 n Mount Union College.

Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

Convolutional Neural Network

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Philipp Gysel ECE Department University of California, Davis

Deep Residual Learning for Image Recognition

Convolutional Neural Networks at Constrained Time Cost (CVPR 2015) Authors : Kaiming He, Jian Sun (MSR) Presenter : Hyunjun Ju 1.

Scalpel: Customizing DNN Pruning to the

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Analysis of Sparse Convolutional Neural Networks

ESE532: System-on-a-Chip Architecture

Seth Pugsley, Jeffrey Jestes,

Benchmarking Deep Learning Inference

The Relationship between Deep Learning and Brain Function

Compact Bilinear Pooling

Quantum Simulation Neural Networks

Neural Networks for Quantum Simulation

Chilimbi, et al. (2014) Microsoft Research

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

CS184a: Computer Architecture (Structure and Organization)

FPGA Acceleration of Convolutional Neural Networks

dawn.cs.stanford.edu/benchmark

Overcoming Resource Underutilization in Spatial CNN Accelerators

Inception and Residual Architecture in Deep Convolutional Networks

Training Techniques for Deep Neural Networks

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,

Efficient Deep Model for Monocular Road Segmentation

Bit-Pragmatic Deep Neural Network Computing

Fully Convolutional Networks for Semantic Segmentation

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Introduction to Neural Networks

SBNet: Sparse Blocks Network for Fast Inference

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Neural Network Compression

Very Deep Convolutional Networks for Large-Scale Image Recognition

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Final Project presentation

Lecture: Deep Convolutional Neural Networks

Deep Neural Networks for Onboard Intelligence

ESE534: Computer Organization

1CECA, Peking University, China

Life Long Learning Hung-yi Lee 李宏毅

Convolution Layer Optimization

CSC 578 Neural Networks and Deep Learning

Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Neural Modular Networks

Dynamic Neural Networks Joseph E. Gonzalez

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Design principles for packet parsers

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Learning and Memorization

Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.

Search-Based Approaches to Accelerate Deep Learning

Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.

Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Model Compression Joseph E. Gonzalez Co-director of the RISE Lab jegonzal@cs.berkeley.edu

What is the Problem Being Solved? Large neural networks are costly to deploy Gigaflops of computation, hundreds of MB of storage Why are they costly? Added computation requirements adversely affect throughput/latency/energy Added memory requirements adversely affect download/storage of model parameters (OTA) throughput and latency through caching Energy! (5pJ for SRAM cache read, 640pj for DRAM vs 0.9pJ for a FLOP)

Approaches to “Compressing” Models Architectural Compression Layer Design  Typically using factorization techniques to reduce storage and computation Pruning  Eliminating weights, layers, or channels to reduce storage and computation from large pre-trained models Weight Compression Low Bit Precision Arithmetic  Weights and activations are stored and computed using low bit precision Quantized Weight Encoding  Weights are quantized and stored using dictionary encodings.

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices (2017) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun Megvii Inc. (Face++)

Related work (at the time) SqueezeNet (2016) – Aggressively leverage 1x1 (pointwise) convolution to reduce inputs to 3x3 convolutions. 57.5% Acc (comparable to AlexNet) 1.2M Parameters  compressed down to 0.47MB MobileNetV1 (2017) – Aggressively leverage depth-wise separable convolutions to achieve 70.6 acc on ImageNet 569M – Mult-Adds 4.2M -- Parameters

Background

Regular Convolution Computation (for 3x3 kernel) Width Computational Complexity: width, height, channel out, channel in, filter size 3 3 Height Combines information across space and across channels. Kernel Depth Depth Input Channels Output Channels

1x1 Convolution (Point Convolution) Computation Width 1 Computational Complexity: 1 Height Combines information across channels only. Depth Depth Input Channels Output Channels

Depthwise Convolution Computation (for 3x3 kernel) Width 3 3 Computational Complexity: Height Depth Depth Combines information across space only Input Channels Output Channels = Input Channels

MobileNet Layer Architecture Width 3 3 Height 1 1 Spatial Aggregation Channel Aggregation Computational Cost:

Observation from MobileNet Paper “MobileNet spends 95% of it’s computation time in 1x1 convolutions which also has 75% of the parameters as can be seen in Table 2.” Idea, eliminate the 1x1 conv but still achieve mixing of channel information? Pointwise (1x1) Group Convolution Channel Shuffle

Group Convolution Used in AlexNet to partition model across machines. groups Width 3 3 Height Computational Complexity: Depth Combines some information across space and across channels. Depth Input Channels Output Channels Can we apply to 1x1 (pointwise) convolution?

Pointwise (1x1) Group Convolution Computational Complexity: groups Width Combines some information across channels. 1 1 Issue: If applied repeatedly channel remain independent. Height Depth Depth Input Channels Output Channels

Channel Shuffle Permute channels between group convolution stages Each group should get a channel from each of the previous groups. No arithmetic operations but does require data movement Good or bad for hardware? Group Convolution Group Convolution

ShuffleNet Architecture ShuffleNet Unit MobileNet Unit Stride = 1 Stride = 2 Based on Residual Networks

Alternative Visualization ShuffleNet Unit MobileNet Unit Spatial Dimension Channel Dimension Spatial Dimension Channel Dimension Notice the cost of the 1x1 convolution Images from https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d

ShuffleNet Architecture Increased width when increasing number of groups  Constant FLOPs Observation: Shallow networks need more channels (width) to maintain accuracy.

What are the Metrics of Success? Reduction in network size (parameters) Reduction in computation (FLOPS) Accuracy Runtime (Latency)

Comparisons to Other Architectures Less computation and more accurate than MobileNet Can be configured to match accuracy of other models while using less compute.

Runtime Performance on a Mobile Processor (Qualcomm Snapdragon 820) Faster and more accurate than MobileNet Caveats Evaluated using single thread Unclear how this would perform on a GPU … no numbers reported

Model Size They don’t report memory footprint of model Onnx implementation is 5.6MB  ~1.4M parameters MobileNet reports model size 4.2M Parameters  ~16MB Generally relatively small

Limitations and Future Impact Decreases arithmetic intensity They disable Pointwise Group Convolution on smaller input layers due to “performance issues” Future Impact Not yet a widely used as MobileNet Discussion: Could potentially benefit from hardware optimization?