Download presentation
Presentation is loading. Please wait.
1
Network Compression and Speedup
Shuochao Yao, Yiwen Xu, Daniel Calzada Network Compression and Speedup
2
Network Compression and Speedup
Source: Network Compression and Speedup
3
Network Compression and Speedup
Why smaller models? Operation Energy [pJ] Relative Cost 32 bit int ADD 0.1 1 32 bit float ADD 0.9 9 32 bit Register File 10 32 bit int MULT 3.1 31 32 bit float MULT 3.7 37 32 bit SRAM Cache 5 50 32 bit DRAM Memory 640 6400 Source: Network Compression and Speedup
4
Outline Matrix Factorization Weight Pruning Quantization method
Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup
5
Outline Matrix Factorization Weight Pruning Quantization method
Singular Value Decomposition (SVD) Flattened Convolutions Weight Pruning Quantization method Pruning + Quantization + Encoding Design small architecture: SqueezeNet (Give introduction to MF here) A typical CNN (e.g. VGGNet) first has some convolutional layers and then some fully connected layers. Weβll explore how to compress these layers and/or speed them up Network Compression and Speedup
6
Fully Connected Layers: Singular Value Decomposition
Most weights are in the fully connected layers (according to Denton et al.) π=ππ π β€ πβ β πΓπ ,πβ β πΓπ ,πβ β πΓπ , π β€ β β πΓπ π is diagonal, decreasing magnitudes along the diagonal According to Denton, the majority of the weights reside in the FC layers, so compressing those is key Network Compression and Speedup
7
Singular Value Decomposition
By only keeping the π‘ singular values with largest magnitude: π = π π π β€ π β β πΓπ , π β β πΓπ‘ , π β β π‘Γπ‘ , π β€ β β π‘Γπ π
πππ π =π‘ Network Compression and Speedup
8
Network Compression and Speedup
SVD: Compression π=ππ π β€ , πβ β πΓπ ,πβ β πΓπ ,πβ β πΓπ , π β€ β β πΓπ π = π π π β€ , π β π
πΓπ , π β π
πΓπ‘ , π β π
π‘Γπ‘ , π β€ β π
π‘Γπ Storage for π: π(ππ) Storage for π : π(ππ‘+π‘+π‘π) Compression Rate: π ππ π‘ π+π+1 Theoretical error: π΄ π βπ΄π πΉ β€ π π‘+1 π΄ πΉ We can take advantage of the fact that S is diagonal to store only the βtβ highest singular values as a list. As expected, smaller t values yield higher compression rates. Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization."Β arXiv preprint arXiv: Β (2014). Network Compression and Speedup
9
SVD: Compression Results
Trained on ImageNet 2012 database, then compressed 5 convolutional layers, 3 fully connected layers, softmax output layer Observations: For the FC layers, we can obtain a 2-3x compression without affecting accuracy much, or we can be more aggressive and lose some accuracy Paper didnβt mention the number of elements in the FC layers (Monochromatic is when you make all colors on the convolution lie on the same line, described in πΆβ² is the number of color channels to use - STUDY) (Outer product decomposition is discussed later) (Biclustering is where you would first cluster the weight matrices by C and F (C=channel, F=output feature), and then performing SVD or outer product decomposition on each cluster (since they are already similar), we can compress the layers. G=input clusters, H=output clusters) πΎ refers to rank of approximation, π‘ in the previous slides. Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation."Β Advances in Neural Information Processing Systems Network Compression and Speedup
10
Network Compression and Speedup
SVD: Side Benefits Reduced memory footprint Reduced in the dense layers by 5-13x Speedup: π΄ π , π΄β β πΓπ , computed in π πππ‘+π π‘ 2 +ππ‘π instead of π(πππ) Speedup factor is π ππ π‘(π+π‘+π) Regularization βLow-rank projections effectively decrease number of learnable parameters, suggesting that they might improve generalization ability.β Paper applies SVD after training Premise behind the speedup: For computing AW, first do AU, then AUS, and then AUV^T Naturally, speedup is considerable for small t. How small can t be? (error was discussed in previous slides) Most research is centered around speeding up convolutional layers (not with SVD), so I couldnβt find experimental results for speedup Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation."Β Advances in Neural Information Processing Systems Network Compression and Speedup
11
Convolutions: Matrix Multiplication
Most time is spent in the convolutional layers πΉ π₯,π¦ =πΌβπ Network Compression and Speedup
12
Flattened Convolutions
Replace πΓπ¦Γπ₯ convolutions with πΓ1Γ1, 1Γπ¦Γ1, and 1Γ1Γπ₯ convolutions Denton, et al.: βThe majority of forward propagation time is spent on the first two convolutional layers.β If this is the case in their model, it may be the case in many models, and it would be wise to optimize it Discussed in the first week Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration."Β arXiv preprint arXiv: Β (2014). Network Compression and Speedup
13
Flattened Convolutions
πΉ π₯,π¦ =πΌβ π = π₯ β² =1 π π¦ β² =1 π π=1 πΆ πΌ π, π₯β π₯ β² , π¦β π¦ β² πΌ π π½ π¦ β² πΎ π₯ β² πΌβ β πΆ , π½β β π , πΎβ β π Compression and Speedup: Parameter reduction: O(πππΆ) to O π+π+πΆ Operation reduction: π(πππΆππ) to π ππ πΆ+π+π (where W f β β πΓπ ) Is that a mistake in the paper in equation 3? xβ and yβ seem to be swapped Alpha, beta, and gamma each correspond to a convolution in each dimension Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration."Β arXiv preprint arXiv: Β (2014). Network Compression and Speedup
14
Network Compression and Speedup
Flattening = MF πΉ π₯,π¦ = π₯=1 π π¦ β² =1 π π=1 πΆ πΌ π, π₯β π₯ β² , π¦β π¦ β² πΌ π π½ π¦ β² πΎ π₯ β² = π₯=1 π π¦ β² =1 π π=1 πΆ πΌ π, π₯β π₯ β² , π¦β π¦ β² π π, π₯ β² , π¦ β² π =πΌβπ½βπΎ, π
πππ π =1 π π = π=1 πΎ πΌ π β π½ π β πΎ π , Rank πΎ SVD: Can reconstruct the original matrix as π΄= π=1 πΎ π€ π π’ π β¨ π£ π Alpha, beta, and gamma each correspond to a convolution in each dimension Rank(W_f) = 1 can be seen (at least in the case of a matrix) because each row is a scalar multiple of the others Combining: In 2D, itβs actually equivalent to SVD, since a matrix can be reconstructed by adding up the outer product of its singular vectors and the v vectors and multiplying by the singular value -> Thus, πΌ π β π½ π = π π π’ π β π£ π (page 6) Dentonβs paper proposes combining these Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation."Β Advances in Neural Information Processing Systems Network Compression and Speedup
15
Flattening: Speedup Results
3 convolutional layers (5x5 filters) with 96, 128, and channels Used stacks of 2 rank-1 convolutions Accuracy plot is on the test data for the CIFAR-10 dataset However, when using a stack of 2 of these, it is vulnerable to the vanishing gradient problem Speedup on the GPU is not as significant but still noticeable Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. "Flattened convolutional neural networks for feedforward acceleration."Β arXiv preprint arXiv: Β (2014). Network Compression and Speedup
16
Outline Matrix Factorization Weight Pruning Quantization method
Magnitude-based method Iterative pruning + Retraining Pruning with rehabilitation Hessian-based method Quantization method Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup
17
Magnitude-based method: Iterative Pruning + Retraining
Pruning connection with small magnitude. Iterative pruning an re-training. Operation Energy [pJ] Relative Cost 32 bit int ADD 0.1 1 32 bit float ADD 0.9 9 32 bit Register File 10 32 bit int MULT 3.1 31 32 bit float MULT 3.7 37 32 bit SRAM Cache 5 50 32 bit DRAM Memory 640 6400 Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
18
Magnitude-based method: Iterative Pruning + Retraining
Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
19
Magnitude-based method: Iterative Pruning + Retraining (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Prune the weights of which magnitudes are less than a threshold π. 4. Train the network until a reasonable solution is obtained. 5. Iterate to step 3. Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
20
Network Compression and Speedup
Magnitude-based method: Iterative Pruning + Retraining (Experiment: Overall) Network Top-1 Error Top-5 Error Parameters Compression Rate LeNet Ref 1.64% - 267K 12X LeNet Pruned 1.59% 22K LeNet-5 Ref 0.80% 431K LeNet-5 Pruned 0.77% 36K AlexNet Ref 42.78% 19.73% 61M 9X AlexNet Pruned 42.77% 19.67% 6.7M VGG-16 Ref 31.50% 11.32% 138M 13X VGG-16 Pruned 31.34% 10.88% 10.3M Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
21
Network Compression and Speedup
Magnitude-based method: Iterative Pruning + Retraining (Experiment: Lenet) Lenet Layer Weights FLOP Act% Weights% FLOP% fc1 235K 470K 38% 8% fc2 30K 60K 65% 9% 4% fc3 1K 2K 100% 26% 17% Total 266K 532K 46% Lenet-5 Layer Weights FLOP Act% Weights% FLOP% conv1 0.5K 576K 82% 66% conv2 25K 3200K 72% 12% 10% fc1 400K 800K 55% 8% 6% fc2 5K 10K 100% 19% Total 431K 4586K 77% 16% Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
22
Network Compression and Speedup
Magnitude-based method: Iterative Pruning + Retraining (Experiment: AlexNet) Layer Weights FLOP Act% Weights% FLOP% conv1 35K 211M 88% 84% conv2 307K 448M 52% 38% 33% conv3 885K 299M 37% 35% 18% conv4 663K 224M 40% 14% conv5 442K 150M 34% fc1 38M 75M 36% 9% 3% fc2 17M 34M fc3 4M 8M 100% 25% 10 Total 61M 1.5B 54% 11% 30% Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
23
Network Compression and Speedup
Magnitude-based method: Iterative Pruning + Retraining (Experiment: Tradeoff) Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS Network Compression and Speedup
24
Pruning with rehabilitation: Dynamic Network Surgery (Motivation)
Pruned connections have no chance to come back. Incorrect pruning may cause severe accuracy loss. Avoid the risk of irretrievable network damage . Improve the learning efficiency. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup
25
Pruning with rehabilitation: Dynamic Network Surgery (Formulation)
π π denotes the weights, and π π denotes the corresponding 0/1 masks. min π π , π π πΏ π π β¨ π π π .π‘. π π (π,π) = β π π π (π,π) , β π,π βπ β¨ is the element-wise product. πΏ β is the loss function. Dynamic network surgery updates only π π . π π is updated based on β π β . β π π π (π,π) = 0 π π β₯ π π (π,π) π π (π,π) π π β€ π π (π,π) β€ π π 1 π π β€ π π (π,π) π π is the pruning threshold. π π = π π +π‘, where π‘ is a pre-defined small margin. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup
26
Pruning with rehabilitation: Dynamic Network Surgery (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Update π π based on β π β . 4. Update π π based on back-propagation. 5. Iterate to step 3. Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup
27
Network Compression and Speedup
Pruning with rehabilitation: Dynamic Network Surgery (Experiment on LeNet) Model Layer Parameters Parameters (DNS) LeNet-5 conv1 0.5K 14.2% conv2 25K 3.1% fc1 400K 0.7% fc2 5K 4.3% Total 431K 0.9% LeNet 236K 1.8% 30K fc3 1K 5.5% 267K Network Compression and Speedup
28
Network Compression and Speedup
Pruning with rehabilitation: Dynamic Network Surgery (Experiment on AlexNet) Layer Parameters Parameters (Han et al. 2015) Parameters (DNS) conv1 35K 84% 53.8% conv2 307K 38% 40.6% conv3 885K 35% 29.0% conv4 664K 37% 32.3% conv5 443K 32.5% fc1 38M 9% 3.7% fc2 17M 6.6% fc3 4M 25% 4.6% Total 61M 11% 5.7% Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS Network Compression and Speedup
29
Outline Matrix Factorization Weight Pruning Quantization method
Magnitude-based method Hessian-based method Diagonal Hessian-based method Full Hessian-based method Quantization method Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup
30
Diagonal Hessian-based method: Optimal Brain Damage
The idea of model compression & speed up: traced by to 1990. Actually theoretically more βoptimalβ compared with the current state of the art, but much more computational inefficient. Delete parameters with small βsaliencyβ. Saliency: effect on the training error Propose a theoretically justified saliency measure. Network Compression and Speedup
31
Diagonal Hessian-based method: Optimal Brain Damage (Formulation)
Approximate objective function E with Taylor series: πΏπΈ= π ππΈ π π’ π πΏ π’ π π π 2 πΈ π 2 π’ π πΏ 2 π’ π π π 2 πΈ π π’ π π π’ π πΏ π’ π πΏ π’ π +Ξ πΏπ 3 Deletion after training has converged: local minimum with gradients equal 0. Neglect cross terms πΏπΈ= 1 2 π π 2 πΈ π 2 π’ π πΏ 2 π’ π LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup
32
Diagonal Hessian-based method: Optimal Brain Damage (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Compute the second derivatives for each parameters. 4. Compute the saliencies for each parameter π π = π 2 πΈ π 2 π’ π π’ π 2 . 5. Sort the parameters by saliency and delete some low-saliency parameters 6. Iterate to step 2 LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup
33
Network Compression and Speedup
Diagonal Hessian-based method: Optimal Brain Damage (Experiment: OBD vs. Magnitude) OBD vs. Magnitude Deletion based on saliency performs better LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup
34
Network Compression and Speedup
Diagonal Hessian-based method: Optimal Brain Damage (Experiment: Retraining) How retraining helps? Without retraining Without retraining Retraining Retraining LeCun, Yann, et al. "Optimal brain damage." NIPs. Vol Network Compression and Speedup
35
Full Hessian-based method: Optimal Brain Surgeon
Motivation: A more accurate estimation of saliency. Optimal weight updates. Advantage: More accuracy estimation with saliency. Directly provide the weight updates, which minimize the change of objective function. Disadvantage More computation compared with OBD. Weight updates are not based on minimizing the objective function. Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.β NIPS, 1993 Network Compression and Speedup
36
Full Hessian-based method: Optimal Brain Surgeon (Formulation)
Approximate objective function E with Taylor series: πΏπΈ= ππΈ ππ€ π βπΏπ€ πΏπ€ π βπ»βπΏπ€+Ξ πΏπ 3 with constraint π π π βπΏπ€+ π€ π =0 We assume the trained network with local minimum and ignore high order terms. Solve it through Lagrangian form: πΏw=β π€ π π» β1 ππ π» β1 β π π and πΏ π = π€ π 2 2β π» β1 ππ πΏ π is saliency for weight π€ π Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.β NIPS, 1993 Network Compression and Speedup
37
Full Hessian-based method: Optimal Brain Surgeon (Algorithm)
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Find the π that gives the smallest saliency πΏ π , and decide to delete π or stop pruning. 4. Update all weights based on calculated πΏw. 5. Iterate to step 3. Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.β NIPS, 1993 Network Compression and Speedup
38
Full Hessian-based method: Optimal Brain Surgeon
Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon.β NIPS, 1993 Network Compression and Speedup
39
Outline Matrix Factorization Weight Pruning Quantization method
Full Quantization Fixed-point format Code book Quantization with full-precision copy Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup
40
Full Quantization : Fixed-point format
Limited Precision Arithmetic ππΌ.ππΉ , where ππΌ and ππΉ correspond to the integer and the fractional part of the number. The number of integer bits (IL) plus the number of fractional bits (FL) yields the total number of bits used to represent the number. WL = IL + FL. Can be represented as πΌπΏ, πΉπΏ . πΌπΏ, πΉπΏ limits the precision to FL bits. πΌπΏ, πΉπΏ sets the range to β 2 πΌπΏβ1 , 2 πΌπΏβ1 β 2 βπΉπΏ . Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup
41
Full Quantization : Fixed-point format (Rounding Modes)
Define π₯ as the largest integer multiple of π= 2 βπΉπΏ . Round-to-nearest: π
ππ’ππ π₯, πΌπΏ, πΉπΏ = π₯ π₯ β€π₯β€ π₯ + π π₯ +π π₯ + π 2 β€π₯β€ π₯ +π Stochastic rounding (unbiased): π
ππ’ππ π₯, πΌπΏ, πΉπΏ = π₯ π€.π. 1β π₯β π₯ π π₯ +π π€.π π₯β π₯ π If π₯ lies outside the range of πΌπΏ, πΉπΏ , we saturate the result to either the lower or the upper limit of πΌπΏ, πΉπΏ : Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup
42
Multiply and accumulate (MACC) operation
During training: 1. π and π are two vectors with fixed point format πΌπΏ, πΉπΏ . 2. Compute π§= π=1 π π π π π . Results a fixed point number with format 2ΓπΌπΏ, 2ΓπΉπΏ . 3. Covert and round π§ back to fixed point format πΌπΏ, πΉπΏ . During testing: With fixed point format πΌπΏ, πΉπΏ . Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup
43
Network Compression and Speedup
Full Quantization: Fixed-point format (Experiment on MNIST with fully connected DNNs) Network Compression and Speedup
44
Full Quantization: Fixed-point format (Experiment on MNIST with CNNs)
Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup
45
Network Compression and Speedup
Full Quantization: Fixed-point format (Experiment on CIFAR10 with fully connected DNNs) Gupta, Suyog, et al. "Deep Learning with Limited Numerical Precision." ICML Network Compression and Speedup
46
Full Quantization: Code book
Quantization using k-means Perform k-means to find k centers π π§ for weights π. π ππ = π π§ where min π§ π ππ β π π§ Compression ratio: 32/ log 2 π (codebook itself is negligible). Product Quantization Partition πβ β πΓπ colum-wise into π submatrices π= π 1 , π 2 , β―, π π . Perform k-means for elements in π π to find k centers π π§ π . π π π = π π§ π where min π§ π π π β π π§ π Compression ratio: 32ππ/ 32ππ+ log 2 π ππ Residual Quantization Quantize the vectors into k centers. Then recursively quantize the residuals for π‘ iterations. Compression ratio: π/ π‘π+ log 2 π βπ‘π Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv: (2014). Network Compression and Speedup
47
Full Quantization: Code book (Experiment on PQ)
Network Compression and Speedup
48
Full Quantization: Code book
Gong, Yunchao, et al. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv: (2014). Network Compression and Speedup
49
Outline Matrix Factorization Weight Pruning Quantization method
Full quantization Quantization with full-precision copy Binnaryconnect BNN Design small architecture: SqueezeNet Network Compression and Speedup
50
Quantization with full-precision copy: Binaryconnect (Motivation)
Use only two possible value (e.g. +1 or -1) for weights. Replace many multiply-accumulate operations by simple accumulations. Fixed-point adders are much less expensive both in terms of area and energy than fixed-point multiply-accumulators. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
51
Quantization with full-precision copy: Binaryconnect (Binarization)
Deterministic Binarization: π€ π = +1 ππ π€β₯0 β1 ππ‘βπππ€ππ π Stochastic Binarization: π€ π = +1 π€ππ‘β ππππππππππ‘π¦ π=π π€ π β1 π€ππ‘β ππππππππππ‘π¦ 1βπ π π₯ =ππππ π₯+1 2 , 0, 1 =πππ₯ 0, πππ 1, π₯+1 2 Stochastic binarization is more theoretically appealing than the deterministic one, but harder to implement as it requires the hardware to generate random bits when quantizing. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
52
Quantization with full-precision copy: Binaryconnect
1. Given the DNN input, compute the unit activations layer by layer, leading to the top layer which is the output of the DNN, given its input. This step is referred as the forward propagation. 2. Given the DNN target, compute the training objectiveβs gradient w.r.t. each layerβs activations, starting from the top layer and going down layer by layer until the first hidden layer. This step is referred to as the backward propagation or backward phase of back-propagation. 3. Compute the gradient w.r.t. each layerβs parameters and then update the parameters using their computed gradients and their previous values. This step is referred to as the parameter update. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
53
Quantization with full-precision copy: Binaryconnect
BinaryConnect only binarize the weights during the forward and backward propagations (steps 1 and 2) but not during the parameter update (step 3). Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
54
Quantization with full-precision copy: Binaryconnect
1. Binarize weights and perform forward pass. 2. Back propagate gradient based on binarized weights. 3. Update the full-precision weights. 4. Iterate to step 1. Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
55
Quantization with full-precision copy: Binaryconnect
Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
56
Quantization with full-precision copy: Binaryconnect
Courbariaux, et al. "Binaryconnect: Training deep neural networks with binary weights during propagations." NIPS. 2015 Network Compression and Speedup
57
Network Compression and Speedup
Quantization with full-precision copy: Binarized Neural Networks (Motivation) Neural networks with both binary weights and activations at run-time and when computing the parametersβ gradient at train time. Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
58
Quantization with full-precision copy: Binarized Neural Networks
Propagating Gradients Through Discretization (βstraight-through estimator β) π=ππππ π Estimator π π of the gradient ππΆ ππ Straight-through estimator of ππΆ ππ : π π = π π 1 π β€1 Can be viewed as propagating the gradient through hard tanh Replace multiplications with bit-shift Replace batch normalization with shift-based batch normalization Replace ADAM with shift-based AdaMax Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
59
Quantization with full-precision copy: Binarized Neural Networks
Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
60
Quantization with full-precision copy: Binarized Neural Networks
Operation MUL ADD 8bit Integer 0.2pJ 0.03pJ 32bit Integer 3.1pJ 0.1pJ 16bit Floating Point 1.1pJ 0.4pJ 32bit Floating Point 3.7pJ 0.9pJ Memory size 64-bit memory access 8k 10pJ 32k 20pJ 1M 100pJ DRAM nJ Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
61
Quantization with full-precision copy: Binarized Neural Networks
Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
62
Outline Matrix Factorization Weight Pruning Quantization method
Pruning + Quantization + Encoding Deep Compression Design small architecture: SqueezeNet Network Compression and Speedup
63
Pruning + Quantization + Encoding: Deep Compression
Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
64
Pruning + Quantization + Encoding: Deep Compression
1. Choose a neural network architecture. 2. Train the network until a reasonable solution is obtained. 3. Prune the network with magnitude-based method until a reasonable solution is obtained. 4. Quantize the network with k-means based method until a reasonable solution is obtained. 5. Further compress the network with Huffman coding. Courbariaux, Matthieu, et al. "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv: (2016). Network Compression and Speedup
65
Pruning + Quantization + Encoding: Deep Compression
Network Compression and Speedup
66
Pruning + Quantization + Encoding: Deep Compression
Network Compression and Speedup
67
Outline Matrix Factorization Weight Pruning Quantization method
Pruning + Quantization + Encoding Design small architecture: SqueezeNet Network Compression and Speedup
68
Design small architecture: SqueezeNet
Compression scheme on pre-trained model VS Design small CNN architecture from scratch (also preserve accuracy?) So far, doing compression on pre-trained model Network Compression and Speedup
69
SqueezeNet Design Strategies
Strategy 1. Replace 3x3 filters with 1x1 filters Parameters per filter: (3x3 filter) = 9 * (1x1 filter) Strategy 2. Decrease the number of input channels to 3x3 filters Total # of parameters: (# of input channels) * (# of ο¬lters) * ( # of parameters per filter) Strategy 3. Downsample late in the network so that convolution layers have large activation maps Size of activation maps: the size of input data, the choice of layers in which to downsample in the CNN architecture S1: 3*3 filters has 9x parameters than 1*1 filters. Given the budget of certain # of filters, should choose 1*1 over 3*3 (expanded layer) S2: Apart from limit the # of 3*3 filters, we can also decrease the number of input channels to the 3x3 ο¬lters(expanded layer) S3: Given large activation maps (due to delayed downsampling) can lead to higher classiο¬cation accuracy, with all else held equal. Most commonly, downsampling is engineered into CNN architectures by setting the (stride > 1) in some of the convolution or pooling layers(max-pool) Situation: If early 3 layers in the network have large strides, then most layers will have small activation maps. Conversely, if most layers in the network have a stride of 1, and the strides greater than 1 are concentrated toward the end of the network, then many layers in the network will have large activation maps. Next, we describe the Fire module, which is our building block for CNN architectures that enables us to successfully employ Strategies 1, 2, and 3. Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
70
Microarchitecture β Fire Module
Fire module is consist of: A squeeze convolution layer full of s1x1 # of 1x1 filters An expand layer mixture of e1x1 # of 1x1 and e3x3 # of 3x3 filters Building blocks of the squeezeNet Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
71
Microarchitecture β Fire Module
Strategy 2. Decrease the number of input channels to 3x3 filters Total # of parameters: (# of input channels) * (# of ο¬lters) * ( # of parameters per filter) Squeeze Layer Set s1x1 < (e1x1 + e3x3), limits the # of input channels to 3*3 filters How much can we limit s1x1? By reducing the number of filters in the βsqueezeβ layer feeding into the βexpandβ layer, they are reducing the number of connections entering these 3x3 filters thus reducing the total number of parameters. Strategies 1 and 2 are about judiciously decreasing the quantity of parameters in a CNN while attempting to preserve accuracy. Strategy 1. Replace 3*3 filters with 1*1 filters Parameters per filter: (3*3 filter) = 9 * (1*1 filter) How much can we replace 3*3 with 1*1? (e1x1 vs e3x3 )? Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
72
Parameters in Fire Module
The # of expanded filter(ei) ei = ei,1x1 + ei,3x3 The % of 3x3 filter in expanded layer(pct3x3) ei,3x3 = pct3x3 * ei The Squeeze Ratio(SR) si,1x1 = SR *ei The smaller SR is the smaller model size, From this ο¬gure, we learn that increasing SR beyond can further increase ImageNet top-5 accuracy from 80.3% (i.e. AlexNet-level) with a 4.8MB model to 86.0% with a 19MB model. Accuracy plateaus at 86.0% with SR=0.75 (a 19MB model), and setting SR=1.0 further increases model size without improving accuracy. Pct of 3x3 = more accuracy and bigger model size We see in Figure 3(b) that the top-5 accuracy plateaus at 85.6% using 50% 3x3 ο¬lters, and further increasing the percentage of 3x3 ο¬lters leads to a larger model size but provides no improvement in accuracy on ImageNet. Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
73
Network Compression and Speedup
Macroarchitecture Strategy 3. Downsample late in the network so that convolution layers have large activation maps Size of activation maps: the size of input data, the choice of layers in which to downsample in the CNN architecture These relative late placements of pooling concentrates activation maps at later phase to preserve higher accuracy Strategy 3 is about maximizing accuracy on a limited budget of parameters. Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
74
Network Compression and Speedup
Macroarchitecture Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
75
Network Compression and Speedup
Evaluation of Results Now, with SqueezeNet, we achieve a 50X reduction in model size compared to AlexNet, while meeting or exceeding the top-1 and top-5 accuracy of AlexNet. We summarize all of the aforementioned results in Table 2. Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
76
Further Compression on 4.8M?
Deep Compression + Quantization Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β Network Compression and Speedup
77
Takeaway Points Compress Pre-trained Networks On Single Layer:
Fully connected layer: SVD Convolutional layer: Flattened Convolutions Weight Pruning: Magnitude-based pruning method is simple and effective, which is the first choice for weight pruning. Retraining is important for model compression. Weight quantization with the full-precision copy can prevent gradient vanishing. Weight pruning, quantization, and encoding are independent. We can use all three methods together for better compression ratio. Design a smaller CNN architecture Example: SqueezeNet Use of Fire module, delay pooling at later stage Network Compression and Speedup
78
Network Compression and Speedup
Reading List Denton, Emily L., et al.Β "Exploiting linear structure within convolutional networks for efficient evaluation."Β Advances in Neural Information Processing Systems Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello.Β "Flattened convolutional neural networks for feedforward acceleration."Β arXiv preprint arXiv: (2014). Gong, Yunchao, et al.Β "Compressing deep convolutional networks using vector quantization."Β arXiv preprint arXiv: (2014). Han, Song, et al.Β "Learning both weights and connections for efficient neural network."Β Advances in Neural Information Processing Systems Guo, Yiwen, Anbang Yao, and Yurong Chen.Β "Dynamic Network Surgery for Efficient DNNs."Β Advances In Neural Information Processing Systems Gupta, Suyog, et al.Β "Deep Learning with Limited Numerical Precision."Β ICML Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David.Β "Binaryconnect: Training deep neural networks with binary weights during propagations."Β Advances in Neural Information Processing Systems Courbariaux, Matthieu, et al.Β "Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1."Β arXiv preprint arXiv: (2016). Han, Song, Huizi Mao, and William J. Dally.Β "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding."Β arXiv preprint arXiv: (2015). Iandola, Forrest N., et al.Β "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."Β arXiv preprint arXiv: (2016). Network Compression and Speedup
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.