Network Compression and Speedup

Network Compression and Speedup
Shuochao Yao, Yiwen Xu, Daniel Calzada Network Compression and Speedup

Source: Network Compression and Speedup

Why smaller models? Operation Energy [pJ] Relative Cost 32 bit int ADD 0.1 1 32 bit float ADD 0.9 9 32 bit Register File 10 32 bit int MULT 3.1 31 32 bit float MULT 3.7 37 32 bit SRAM Cache 5 50 32 bit DRAM Memory 640 6400 Source: Network Compression and Speedup

Outline Matrix Factorization Weight Pruning Quantization method
Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup

Singular Value Decomposition (SVD) Flattened Convolutions Weight Pruning Quantization method Pruning + Quantization + Encoding Design small architecture: SqueezeNet (Give introduction to MF here) A typical CNN (e.g. VGGNet) first has some convolutional layers and then some fully connected layers. We’ll explore how to compress these layers and/or speed them up Network Compression and Speedup

Fully Connected Layers: Singular Value Decomposition
Most weights are in the fully connected layers (according to Denton et al.) 𝑊=𝑈𝑆 𝑉 ⊤ 𝑊∈ ℝ 𝑚×𝑘 ,𝑈∈ ℝ 𝑚×𝑚 ,𝑆∈ ℝ 𝑚×𝑘 , 𝑉 ⊤ ∈ ℝ 𝑘×𝑘 𝑆 is diagonal, decreasing magnitudes along the diagonal According to Denton, the majority of the weights reside in the FC layers, so compressing those is key Network Compression and Speedup

Singular Value Decomposition
By only keeping the 𝑡 singular values with largest magnitude: 𝑊 = 𝑈 𝑆 𝑉 ⊤ 𝑊 ∈ ℝ 𝑚×𝑘 , 𝑈 ∈ ℝ 𝑚×𝑡 , 𝑆 ∈ ℝ 𝑡×𝑡 , 𝑉 ⊤ ∈ ℝ 𝑡×𝑘 𝑅𝑎𝑛𝑘 𝑊 =𝑡 Network Compression and Speedup

SVD: Compression 𝑊=𝑈𝑆 𝑉 ⊤ , 𝑊∈ ℝ 𝑚×𝑘 ,𝑈∈ ℝ 𝑚×𝑚 ,𝑆∈ ℝ 𝑚×𝑘 , 𝑉 ⊤ ∈ ℝ 𝑘×𝑘 𝑊 = 𝑈 𝑆 𝑉 ⊤ , 𝑊 ∈ 𝑅 𝑚×𝑘 , 𝑈 ∈ 𝑅 𝑚×𝑡 , 𝑆 ∈ 𝑅 𝑡×𝑡 , 𝑉 ⊤ ∈ 𝑅 𝑡×𝑘 Storage for 𝑊: 𝑂(𝑚𝑘) Storage for 𝑊 : 𝑂(𝑚𝑡+𝑡+𝑡𝑘) Compression Rate: 𝑂 𝑚𝑘 𝑡 𝑚+𝑘+1 Theoretical error: 𝐴 𝑊 −𝐴𝑊 𝐹 ≤ 𝑠 𝑡+1 𝐴 𝐹 We can take advantage of the fact that S is diagonal to store only the ‘t’ highest singular values as a list. As expected, smaller t values yield higher compression rates. Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv: (2014). Network Compression and Speedup

SVD: Compression Results
Trained on ImageNet 2012 database, then compressed 5 convolutional layers, 3 fully connected layers, softmax output layer Observations: For the FC layers, we can obtain a 2-3x compression without affecting accuracy much, or we can be more aggressive and lose some accuracy Paper didn’t mention the number of elements in the FC layers (Monochromatic is when you make all colors on the convolution lie on the same line, described in 𝐶′ is the number of color channels to use - STUDY) (Outer product decomposition is discussed later) (Biclustering is where you would first cluster the weight matrices by C and F (C=channel, F=output feature), and then performing SVD or outer product decomposition on each cluster (since they are already similar), we can compress the layers. G=input clusters, H=output clusters) 𝐾 refers to rank of approximation, 𝑡 in the previous slides. Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems Network Compression and Speedup

SVD: Side Benefits Reduced memory footprint Reduced in the dense layers by 5-13x Speedup: 𝐴 𝑊 , 𝐴∈ ℝ 𝑛×𝑚 , computed in 𝑂 𝑛𝑚𝑡+𝑛 𝑡 2 +𝑛𝑡𝑘 instead of 𝑂(𝑛𝑚𝑘) Speedup factor is 𝑂 𝑚𝑘 𝑡(𝑚+𝑡+𝑘) Regularization “Low-rank projections effectively decrease number of learnable parameters, suggesting that they might improve generalization ability.” Paper applies SVD after training Premise behind the speedup: For computing AW, first do AU, then AUS, and then AUV^T Naturally, speedup is considerable for small t. How small can t be? (error was discussed in previous slides) Most research is centered around speeding up convolutional layers (not with SVD), so I couldn’t find experimental results for speedup Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems Network Compression and Speedup

Convolutions: Matrix Multiplication
Most time is spent in the convolutional layers 𝐹 𝑥,𝑦 =𝐼∗𝑊 Network Compression and Speedup

Flattened Convolutions
Replace 𝑐×𝑦×𝑥 convolutions with 𝑐×1×1, 1×𝑦×1, and 1×1×𝑥 convolutions Denton, et al.: “The majority of forward propagation time is spent on the first two convolutional layers.” If this is the case in their model, it may be the case in many models, and it would be wise to optimize it Discussed in the first week Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration." arXiv preprint arXiv: (2014). Network Compression and Speedup

Flattened Convolutions
𝐹 𝑥,𝑦 =𝐼∗ 𝑊 = 𝑥 ′ =1 𝑋 𝑦 ′ =1 𝑌 𝑐=1 𝐶 𝐼 𝑐, 𝑥− 𝑥 ′ , 𝑦− 𝑦 ′ 𝛼 𝑐 𝛽 𝑦 ′ 𝛾 𝑥 ′ 𝛼∈ ℝ 𝐶 , 𝛽∈ ℝ 𝑌 , 𝛾∈ ℝ 𝑋 Compression and Speedup: Parameter reduction: O(𝑋𝑌𝐶) to O 𝑋+𝑌+𝐶 Operation reduction: 𝑂(𝑚𝑛𝐶𝑋𝑌) to 𝑂 𝑚𝑛 𝐶+𝑋+𝑌 (where W f ∈ ℝ 𝑚×𝑛 ) Is that a mistake in the paper in equation 3? x‘ and y’ seem to be swapped Alpha, beta, and gamma each correspond to a convolution in each dimension Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration." arXiv preprint arXiv: (2014). Network Compression and Speedup

Flattening = MF 𝐹 𝑥,𝑦 = 𝑥=1 𝑋 𝑦 ′ =1 𝑌 𝑐=1 𝐶 𝐼 𝑐, 𝑥− 𝑥 ′ , 𝑦− 𝑦 ′ 𝛼 𝑐 𝛽 𝑦 ′ 𝛾 𝑥 ′ = 𝑥=1 𝑋 𝑦 ′ =1 𝑌 𝑐=1 𝐶 𝐼 𝑐, 𝑥− 𝑥 ′ , 𝑦− 𝑦 ′ 𝑊 𝑐, 𝑥 ′ , 𝑦 ′ 𝑊 =𝛼⊗𝛽⊗𝛾, 𝑅𝑎𝑛𝑘 𝑊 =1 𝑊 𝑆 = 𝑘=1 𝐾 𝛼 𝑘 ⊗ 𝛽 𝑘 ⊗ 𝛾 𝑘 , Rank 𝐾 SVD: Can reconstruct the original matrix as 𝐴= 𝑘=1 𝐾 𝑤 𝑘 𝑢 𝑘 ⨂ 𝑣 𝑘 Alpha, beta, and gamma each correspond to a convolution in each dimension Rank(W_f) = 1 can be seen (at least in the case of a matrix) because each row is a scalar multiple of the others Combining: In 2D, it’s actually equivalent to SVD, since a matrix can be reconstructed by adding up the outer product of its singular vectors and the v vectors and multiplying by the singular value -> Thus, 𝛼 𝑘 ⊗ 𝛽 𝑘 = 𝜎 𝑘 𝑢 𝑘 ⊗ 𝑣 𝑘 (page 6) Denton’s paper proposes combining these Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems Network Compression and Speedup

Flattening: Speedup Results
3 convolutional layers (5x5 filters) with 96, 128, and channels Used stacks of 2 rank-1 convolutions Accuracy plot is on the test data for the CIFAR-10 dataset However, when using a stack of 2 of these, it is vulnerable to the vanishing gradient problem Speedup on the GPU is not as significant but still noticeable Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration." arXiv preprint arXiv: (2014). Network Compression and Speedup

Magnitude-based method Iterative pruning + Retraining Pruning with rehabilitation Hessian-based method Quantization method Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining
Pruning connection with small magnitude. Iterative pruning an re-training. Operation Energy [pJ] Relative Cost 32 bit int ADD 0.1 1 32 bit float ADD 0.9 9 32 bit Register File 10 32 bit int MULT 3.1 31 32 bit float MULT 3.7 37 32 bit SRAM Cache 5 50 32 bit DRAM Memory 640 6400 Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining
Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Prune the weights of which magnitudes are less than a threshold 𝜏. 4. Train the network until a reasonable solution is obtained. 5. Iterate to step 3. Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Overall) Network Top-1 Error Top-5 Error Parameters Compression Rate LeNet Ref 1.64% - 267K 12X LeNet Pruned 1.59% 22K LeNet-5 Ref 0.80% 431K LeNet-5 Pruned 0.77% 36K AlexNet Ref 42.78% 19.73% 61M 9X AlexNet Pruned 42.77% 19.67% 6.7M VGG-16 Ref 31.50% 11.32% 138M 13X VGG-16 Pruned 31.34% 10.88% 10.3M Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Lenet) Lenet Layer Weights FLOP Act% Weights% FLOP% fc1 235K 470K 38% 8% fc2 30K 60K 65% 9% 4% fc3 1K 2K 100% 26% 17% Total 266K 532K 46% Lenet-5 Layer Weights FLOP Act% Weights% FLOP% conv1 0.5K 576K 82% 66% conv2 25K 3200K 72% 12% 10% fc1 400K 800K 55% 8% 6% fc2 5K 10K 100% 19% Total 431K 4586K 77% 16% Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining (Experiment: AlexNet) Layer Weights FLOP Act% Weights% FLOP% conv1 35K 211M 88% 84% conv2 307K 448M 52% 38% 33% conv3 885K 299M 37% 35% 18% conv4 663K 224M 40% 14% conv5 442K 150M 34% fc1 38M 75M 36% 9% 3% fc2 17M 34M fc3 4M 8M 100% 25% 10 Total 61M 1.5B 54% 11% 30% Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Magnitude-based method: Iterative Pruning + Retraining (Experiment: Tradeoff) Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup

Pruning with rehabilitation: Dynamic Network Surgery (Motivation)
Pruned connections have no chance to come back. Incorrect pruning may cause severe accuracy loss. Avoid the risk of irretrievable network damage . Improve the learning efficiency. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup

Pruning with rehabilitation: Dynamic Network Surgery (Formulation)
𝑊 𝑘 denotes the weights, and 𝑇 𝑘 denotes the corresponding 0/1 masks. min 𝑊 𝑘 , 𝑇 𝑘 𝐿 𝑊 𝑘 ⨀ 𝑇 𝑘 𝑠.𝑡. 𝑇 𝑘 (𝑖,𝑗) = ℎ 𝑘 𝑊 𝑘 (𝑖,𝑗) , ∀ 𝑖,𝑗 ∈𝔗 ⨀ is the element-wise product. 𝐿 ∙ is the loss function. Dynamic network surgery updates only 𝑊 𝑘 . 𝑇 𝑘 is updated based on ℎ 𝑘 ∙ . ℎ 𝑘 𝑊 𝑘 (𝑖,𝑗) = 0 𝑎 𝑘 ≥ 𝑊 𝑘 (𝑖,𝑗) 𝑇 𝑘 (𝑖,𝑗) 𝑎 𝑘 ≤ 𝑊 𝑘 (𝑖,𝑗) ≤ 𝑏 𝑘 1 𝑏 𝑘 ≤ 𝑊 𝑘 (𝑖,𝑗) 𝑎 𝑘 is the pruning threshold. 𝑏 𝑘 = 𝑎 𝑘 +𝑡, where 𝑡 is a pre-defined small margin. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup

Pruning with rehabilitation: Dynamic Network Surgery (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Update 𝑇 𝑘 based on ℎ 𝑘 ∙ . 4. Update 𝑊 𝑘 based on back-propagation. 5. Iterate to step 3. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup

Pruning with rehabilitation: Dynamic Network Surgery (Experiment on LeNet) Model Layer Parameters Parameters (DNS) LeNet-5 conv1 0.5K 14.2% conv2 25K 3.1% fc1 400K 0.7% fc2 5K 4.3% Total 431K 0.9% LeNet 236K 1.8% 30K fc3 1K 5.5% 267K Network Compression and Speedup

Pruning with rehabilitation: Dynamic Network Surgery (Experiment on AlexNet) Layer Parameters Parameters (Han et al. 2015) Parameters (DNS) conv1 35K 84% 53.8% conv2 307K 38% 40.6% conv3 885K 35% 29.0% conv4 664K 37% 32.3% conv5 443K 32.5% fc1 38M 9% 3.7% fc2 17M 6.6% fc3 4M 25% 4.6% Total 61M 11% 5.7% Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup

Magnitude-based method Hessian-based method Diagonal Hessian-based method Full Hessian-based method Quantization method Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup

Diagonal Hessian-based method: Optimal Brain Damage
The idea of model compression & speed up: traced by to 1990. Actually theoretically more “optimal” compared with the current state of the art, but much more computational inefficient. Delete parameters with small “saliency”. Saliency: effect on the training error Propose a theoretically justified saliency measure. Network Compression and Speedup

Diagonal Hessian-based method: Optimal Brain Damage (Formulation)
Approximate objective function E with Taylor series: 𝛿𝐸= 𝑖 𝜕𝐸 𝜕 𝑢 𝑖 𝛿 𝑢 𝑖 𝑖 𝜕 2 𝐸 𝜕 2 𝑢 𝑖 𝛿 2 𝑢 𝑖 𝑖 𝜕 2 𝐸 𝜕 𝑢 𝑖 𝜕 𝑢 𝑗 𝛿 𝑢 𝑖 𝛿 𝑢 𝑗 +Ο 𝛿𝑈 3 Deletion after training has converged: local minimum with gradients equal 0. Neglect cross terms 𝛿𝐸= 1 2 𝑖 𝜕 2 𝐸 𝜕 2 𝑢 𝑖 𝛿 2 𝑢 𝑖 LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup

Diagonal Hessian-based method: Optimal Brain Damage (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Compute the second derivatives for each parameters. 4. Compute the saliencies for each parameter 𝑆 𝑘 = 𝜕 2 𝐸 𝜕 2 𝑢 𝑘 𝑢 𝑘 2 . 5. Sort the parameters by saliency and delete some low-saliency parameters 6. Iterate to step 2 LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup

Diagonal Hessian-based method: Optimal Brain Damage (Experiment: OBD vs. Magnitude) OBD vs. Magnitude Deletion based on saliency performs better LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup

Diagonal Hessian-based method: Optimal Brain Damage (Experiment: Retraining) How retraining helps? Without retraining Without retraining Retraining Retraining LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup

Full Hessian-based method: Optimal Brain Surgeon
Motivation: A more accurate estimation of saliency. Optimal weight updates. Advantage: More accuracy estimation with saliency. Directly provide the weight updates, which minimize the change of objective function. Disadvantage More computation compared with OBD. Weight updates are not based on minimizing the objective function. Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.” NIPS, 1993 Network Compression and Speedup

Full Hessian-based method: Optimal Brain Surgeon (Formulation)
Approximate objective function E with Taylor series: 𝛿𝐸= 𝜕𝐸 𝜕𝑤 𝑇 ∙𝛿𝑤 𝛿𝑤 𝑇 ∙𝐻∙𝛿𝑤+Ο 𝛿𝑊 3 with constraint 𝑒 𝑞 𝑇 ∙𝛿𝑤+ 𝑤 𝑞 =0 We assume the trained network with local minimum and ignore high order terms. Solve it through Lagrangian form: 𝛿w=− 𝑤 𝑞 𝐻 −1 𝑞𝑞 𝐻 −1 ∙ 𝑒 𝑞 and 𝐿 𝑞 = 𝑤 𝑞 2 2∙ 𝐻 −1 𝑞𝑞 𝐿 𝑞 is saliency for weight 𝑤 𝑞 Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.” NIPS, 1993 Network Compression and Speedup

Full Hessian-based method: Optimal Brain Surgeon (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Find the 𝑞 that gives the smallest saliency 𝐿 𝑞 , and decide to delete 𝑞 or stop pruning. 4. Update all weights based on calculated 𝛿w. 5. Iterate to step 3. Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.” NIPS, 1993 Network Compression and Speedup

Full Hessian-based method: Optimal Brain Surgeon
Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.” NIPS, 1993 Network Compression and Speedup

Full Quantization Fixed-point format Code book Quantization with full-precision copy Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup

Full Quantization : Fixed-point format
Limited Precision Arithmetic 𝑄𝐼.𝑄𝐹 , where 𝑄𝐼 and 𝑄𝐹 correspond to the integer and the fractional part of the number. The number of integer bits (IL) plus the number of fractional bits (FL) yields the total number of bits used to represent the number. WL = IL + FL. Can be represented as 𝐼𝐿, 𝐹𝐿 . 𝐼𝐿, 𝐹𝐿 limits the precision to FL bits. 𝐼𝐿, 𝐹𝐿 sets the range to − 2 𝐼𝐿−1 , 2 𝐼𝐿−1 − 2 −𝐹𝐿 . Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup

Full Quantization : Fixed-point format (Rounding Modes)
Define 𝑥 as the largest integer multiple of 𝜖= 2 −𝐹𝐿 . Round-to-nearest: 𝑅𝑜𝑢𝑛𝑑 𝑥, 𝐼𝐿, 𝐹𝐿 = 𝑥 𝑥 ≤𝑥≤ 𝑥 + 𝜖 𝑥 +𝜖 𝑥 + 𝜖 2 ≤𝑥≤ 𝑥 +𝜖 Stochastic rounding (unbiased): 𝑅𝑜𝑢𝑛𝑑 𝑥, 𝐼𝐿, 𝐹𝐿 = 𝑥 𝑤.𝑝. 1− 𝑥− 𝑥 𝜖 𝑥 +𝜖 𝑤.𝑝 𝑥− 𝑥 𝜖 If 𝑥 lies outside the range of 𝐼𝐿, 𝐹𝐿 , we saturate the result to either the lower or the upper limit of 𝐼𝐿, 𝐹𝐿 : Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup

Multiply and accumulate (MACC) operation
During training: 1. 𝒂 and 𝒃 are two vectors with fixed point format 𝐼𝐿, 𝐹𝐿 . 2. Compute 𝑧= 𝑖=1 𝑑 𝑎 𝑖 𝑏 𝑖 . Results a fixed point number with format 2×𝐼𝐿, 2×𝐹𝐿 . 3. Covert and round 𝑧 back to fixed point format 𝐼𝐿, 𝐹𝐿 . During testing: With fixed point format 𝐼𝐿, 𝐹𝐿 . Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup

Full Quantization: Fixed-point format (Experiment on MNIST with fully connected DNNs) Network Compression and Speedup

Full Quantization: Fixed-point format (Experiment on MNIST with CNNs)
Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup

Full Quantization: Fixed-point format (Experiment on CIFAR10 with fully connected DNNs) Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup

Full Quantization: Code book
Quantization using k-means Perform k-means to find k centers 𝑐 𝑧 for weights 𝑊. 𝑊 𝑖𝑗 = 𝑐 𝑧 where min 𝑧 𝑊 𝑖𝑗 − 𝑐 𝑧 Compression ratio: 32/ log 2 𝑘 (codebook itself is negligible). Product Quantization Partition 𝑊∈ ℝ 𝑚×𝑛 colum-wise into 𝑠 submatrices 𝑊= 𝑊 1 , 𝑊 2 , ⋯, 𝑊 𝑠 . Perform k-means for elements in 𝑊 𝑖 to find k centers 𝑐 𝑧 𝑖 . 𝑊 𝑗 𝑖 = 𝑐 𝑧 𝑖 where min 𝑧 𝑊 𝑗 𝑖 − 𝑐 𝑧 𝑖 Compression ratio: 32𝑚𝑛/ 32𝑘𝑛+ log 2 𝑘 𝑚𝑠 Residual Quantization Quantize the vectors into k centers. Then recursively quantize the residuals for 𝑡 iterations. Compression ratio: 𝑚/ 𝑡𝑘+ log 2 𝑘 ∙𝑡𝑛 Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv: (2014). Network Compression and Speedup

Full Quantization: Code book (Experiment on PQ)
Network Compression and Speedup

Full Quantization: Code book
Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv: (2014). Network Compression and Speedup

Full quantization Quantization with full-precision copy Binnaryconnect BNN Design small architecture: SqueezeNet Network Compression and Speedup

Quantization with full-precision copy: Binaryconnect (Motivation)
Use only two possible value (e.g. +1 or -1) for weights. Replace many multiply-accumulate operations by simple accumulations. Fixed-point adders are much less expensive both in terms of area and energy than fixed-point multiply-accumulators. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup

Quantization with full-precision copy: Binaryconnect (Binarization)
Deterministic Binarization: 𝑤 𝑏 = +1 𝑖𝑓 𝑤≥0 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Stochastic Binarization: 𝑤 𝑏 = +1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝=𝜎 𝑤 𝑏 −1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1−𝑝 𝜎 𝑥 =𝑐𝑙𝑖𝑝 𝑥+1 2 , 0, 1 =𝑚𝑎𝑥 0, 𝑚𝑖𝑛 1, 𝑥+1 2 Stochastic binarization is more theoretically appealing than the deterministic one, but harder to implement as it requires the hardware to generate random bits when quantizing. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup

Quantization with full-precision copy: Binaryconnect
1. Given the DNN input, compute the unit activations layer by layer, leading to the top layer which is the output of the DNN, given its input. This step is referred as the forward propagation. 2. Given the DNN target, compute the training objective’s gradient w.r.t. each layer’s activations, starting from the top layer and going down layer by layer until the first hidden layer. This step is referred to as the backward propagation or backward phase of back-propagation. 3. Compute the gradient w.r.t. each layer’s parameters and then update the parameters using their computed gradients and their previous values. This step is referred to as the parameter update. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup

BinaryConnect only binarize the weights during the forward and backward propagations (steps 1 and 2) but not during the parameter update (step 3). Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup

1. Binarize weights and perform forward pass. 2. Back propagate gradient based on binarized weights. 3. Update the full-precision weights. 4. Iterate to step 1. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup

Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup

Quantization with full-precision copy: Binarized Neural Networks (Motivation) Neural networks with both binary weights and activations at run-time and when computing the parameters’ gradient at train time. Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup

Quantization with full-precision copy: Binarized Neural Networks
Propagating Gradients Through Discretization (“straight-through estimator ”) 𝑞=𝑆𝑖𝑔𝑛 𝑟 Estimator 𝑔 𝑞 of the gradient 𝜕𝐶 𝜕𝑞 Straight-through estimator of 𝜕𝐶 𝜕𝑟 : 𝑔 𝑟 = 𝑔 𝑞 1 𝑟 ≤1 Can be viewed as propagating the gradient through hard tanh Replace multiplications with bit-shift Replace batch normalization with shift-based batch normalization Replace ADAM with shift-based AdaMax Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup

Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup

Operation MUL ADD 8bit Integer 0.2pJ 0.03pJ 32bit Integer 3.1pJ 0.1pJ 16bit Floating Point 1.1pJ 0.4pJ 32bit Floating Point 3.7pJ 0.9pJ Memory size 64-bit memory access 8k 10pJ 32k 20pJ 1M 100pJ DRAM nJ Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup

Pruning + Quantization + Encoding Deep Compression Design small architecture: SqueezeNet Network Compression and Speedup

Pruning + Quantization + Encoding: Deep Compression

1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Prune the network with magnitude-based method until a reasonable solution is obtained. 4. Quantize the network with k-means based method until a reasonable solution is obtained. 5. Further compress the network with Huffman coding. Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup

Network Compression and Speedup

Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup

Design small architecture: SqueezeNet
Compression scheme on pre-trained model VS Design small CNN architecture from scratch (also preserve accuracy?) So far, doing compression on pre-trained model Network Compression and Speedup

SqueezeNet Design Strategies
Strategy 1. Replace 3x3 filters with 1x1 filters Parameters per filter: (3x3 filter) = 9 * (1x1 filter) Strategy 2. Decrease the number of input channels to 3x3 filters Total # of parameters: (# of input channels) * (# of filters) * ( # of parameters per filter) Strategy 3. Downsample late in the network so that convolution layers have large activation maps Size of activation maps: the size of input data, the choice of layers in which to downsample in the CNN architecture S1: 3*3 filters has 9x parameters than 1*1 filters. Given the budget of certain # of filters, should choose 1*1 over 3*3 (expanded layer) S2: Apart from limit the # of 3*3 filters, we can also decrease the number of input channels to the 3x3 filters(expanded layer) S3: Given large activation maps (due to delayed downsampling) can lead to higher classification accuracy, with all else held equal. Most commonly, downsampling is engineered into CNN architectures by setting the (stride > 1) in some of the convolution or pooling layers(max-pool) Situation: If early 3 layers in the network have large strides, then most layers will have small activation maps. Conversely, if most layers in the network have a stride of 1, and the strides greater than 1 are concentrated toward the end of the network, then many layers in the network will have large activation maps. Next, we describe the Fire module, which is our building block for CNN architectures that enables us to successfully employ Strategies 1, 2, and 3. Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Microarchitecture – Fire Module
Fire module is consist of: A squeeze convolution layer full of s1x1 # of 1x1 filters An expand layer mixture of e1x1 # of 1x1 and e3x3 # of 3x3 filters Building blocks of the squeezeNet Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Microarchitecture – Fire Module
Strategy 2. Decrease the number of input channels to 3x3 filters Total # of parameters: (# of input channels) * (# of ﬁlters) * ( # of parameters per filter) Squeeze Layer Set s1x1 < (e1x1 + e3x3), limits the # of input channels to 3*3 filters How much can we limit s1x1? By reducing the number of filters in the “squeeze” layer feeding into the “expand” layer, they are reducing the number of connections entering these 3x3 filters thus reducing the total number of parameters. Strategies 1 and 2 are about judiciously decreasing the quantity of parameters in a CNN while attempting to preserve accuracy. Strategy 1. Replace 3*3 filters with 1*1 filters Parameters per filter: (3*3 filter) = 9 * (1*1 filter) How much can we replace 3*3 with 1*1? (e1x1 vs e3x3 )? Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Parameters in Fire Module
The # of expanded filter(ei) ei = ei,1x1 + ei,3x3 The % of 3x3 filter in expanded layer(pct3x3) ei,3x3 = pct3x3 * ei The Squeeze Ratio(SR) si,1x1 = SR *ei The smaller SR is the smaller model size, From this figure, we learn that increasing SR beyond can further increase ImageNet top-5 accuracy from 80.3% (i.e. AlexNet-level) with a 4.8MB model to 86.0% with a 19MB model. Accuracy plateaus at 86.0% with SR=0.75 (a 19MB model), and setting SR=1.0 further increases model size without improving accuracy. Pct of 3x3 = more accuracy and bigger model size We see in Figure 3(b) that the top-5 accuracy plateaus at 85.6% using 50% 3x3 filters, and further increasing the percentage of 3x3 filters leads to a larger model size but provides no improvement in accuracy on ImageNet. Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Macroarchitecture Strategy 3. Downsample late in the network so that convolution layers have large activation maps Size of activation maps: the size of input data, the choice of layers in which to downsample in the CNN architecture These relative late placements of pooling concentrates activation maps at later phase to preserve higher accuracy Strategy 3 is about maximizing accuracy on a limited budget of parameters. Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Macroarchitecture Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Evaluation of Results Now, with SqueezeNet, we achieve a 50X reduction in model size compared to AlexNet, while meeting or exceeding the top-1 and top-5 accuracy of AlexNet. We summarize all of the aforementioned results in Table 2. Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Further Compression on 4.8M?
Deep Compression + Quantization Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." Network Compression and Speedup

Takeaway Points Compress Pre-trained Networks On Single Layer:
Fully connected layer: SVD Convolutional layer: Flattened Convolutions Weight Pruning: Magnitude-based pruning method is simple and effective, which is the first choice for weight pruning. Retraining is important for model compression. Weight quantization with the full-precision copy can prevent gradient vanishing. Weight pruning, quantization, and encoding are independent. We can use all three methods together for better compression ratio. Design a smaller CNN architecture Example: SqueezeNet Use of Fire module, delay pooling at later stage Network Compression and Speedup

Reading List Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration." arXiv preprint arXiv: (2014). Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv: (2014). Han, Song, et al. "Learning both weights and connections for efficient neural network." Advances in Neural Information Processing Systems Guo, Yiwen, Anbang Yao, and Yurong Chen. "Dynamic Network Surgery for Efficient DNNs." Advances In Neural Information Processing Systems Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "Binaryconnect: Training deep neural networks with binary weights during propagations." Advances in Neural Information Processing Systems Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv: (2015). Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv: (2016). Network Compression and Speedup

Network Compression and Speedup

Similar presentations

Presentation on theme: "Network Compression and Speedup"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Compression and Speedup

Similar presentations

Presentation on theme: "Network Compression and Speedup"— Presentation transcript:

Similar presentations

About project

Feedback