Mimicking deep neural networks with shallow and narrow ones

Slides:

Advertisements

Similar presentations

“Geoffrey Hinton, Oriol Vinyals & Jeff Dean) Google

Advertisements

Neural networks Introduction Fitting neural networks

CS590M 2008 Fall: Paper Presentation

Advanced topics.

Deep Learning and Neural Nets Spring 2015

Distributed Representations of Sentences and Documents

PRLab TUDelft NL. What is the source of this magic?  Deep net has more parameters than the shallow net;  Deep net is deeper than the shallow net; 

How to do backpropagation in a brain

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Deeply-Recursive Convolutional Network for Image Super-Resolution

Deep Residual Learning for Image Recognition

Dimensionality Reduction and Principle Components Analysis

Learning Deep Generative Models by Ruslan Salakhutdinov

Convolutional Neural Network

Deep Feedforward Networks

Machine Learning & Deep Learning

Quantum Simulation Neural Networks

Data Mining, Neural Network and Genetic Programming

ECE 5424: Introduction to Machine Learning

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Adversarial Learning for Neural Dialogue Generation

Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.

Generative Adversarial Networks

Multimodal Learning with Deep Boltzmann Machines

Unrolling: A principled method to develop deep neural networks

Inception and Residual Architecture in Deep Convolutional Networks

Intelligent Information System Lab

Policy Compression for MDPs

Neural Networks and Backpropagation

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Roberto Battiti, Mauro Brunato

Incremental Training of Deep Convolutional Neural Networks

Neural Network Compression

ALL YOU NEED IS A GOOD INIT

CS 4501: Introduction to Computer Vision Training Neural Networks II

Deep Learning Hierarchical Representations for Image Steganalysis

Logistic Regression & Parallel SGD

Tips for Training Deep Network

Very Deep Convolutional Networks for Large-Scale Image Recognition

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Word Embedding Word2Vec.

[Figure taken from googleblog

Neural Networks Geoff Hulten.

المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen

Chapter 11 Practical Methodology

ML – Lecture 3B Deep NN.

Designing Neural Network Architectures Using Reinforcement Learning

Neural networks (1) Traditional multi-layer perceptrons

Artificial Intelligence 10. Neural Networks

Machine learning overview

实习生汇报 ——北邮张安迪.

Word embeddings (continued)

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Introduction to Neural Networks

Natalie Lang Tomer Malach

Batch Normalization.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.

Learning and Memorization

Object Detection Implementations

CSC 578 Neural Networks and Deep Learning

CS249: Neural Language Model

Presentation transcript:

Mimicking deep neural networks with shallow and narrow ones ISHAY BE’ERY ELAD KNOLL

OUTLINES Motivation Model compression: mimicking large networks: FITNETS : HINTS FOR THIN DEEP NETS (A. Romero, 2014) DO DEEP NETS REALLY NEED TO BE DEEP (Rich Caruana & Lei Jimmy Ba 2014) DO DEEP CONVOLUTIONAL NETS REALLY NEED TO BE DEEP AND CONVOLUTIONAL ? (Rich Caruana et al. 2016)

Motivation - why compress? Deep and wide networks are: memory demanding - not suitable for large number of users. Computationally expensive - slower inference time, crucial for real-time tasks.

Motivation - why deep and wide? Going deep Ability to detect more complex features. Better generalization of the network. Going wide Improved performance. parallelization friendly (GPU)

Motivation - how can we compress? Split network into to two - one for training, one for deployment. Training: Demand higher performance. Time and resources are less crucial. Deployment: Faster inference time. Less computation resources required. בשלב האימון יש פחות חשיבות לזמן האימון והמשאבים הנדרשים. לעומת זאת, שלב הזיהוי או הקטלוג נרצה שיהיה מהיר ככל הניתן ויצרוך כמה שפחות משאבים.

Motivation - how can we compress? An analogy to insect world: Larva form - optimized for extracting energy, nutrients from surrounding environment. Adult form - Optimized for traveling, reproduction, etc. בהשאלה מעולם הזחל והפרפר - לרשת נוירונים יש שני שלבים שונים, שלב האימון ושלב הפריסה (הביצוע). ניתן שיהיה להם מבנה שונה ותכונות שונות.

Mimicking a DNN with a narrow one FITNETS : HINTS FOR THIN DEEP NETS (Adriana Romero 1 , Nicolas Ballas 2 , Samira Ebrahimi Kahou 3 , Antoine Chassang 2 , Carlo Gatta 4 & Yoshua Bengio 2, 2014) Teacher Student network (Model compression, Bucila et. al. 2006). Knowledge Distillation (Distilling the knowledge in a student network, Hinton & Dean, 2014). Hint based training. המאמר הנ"ל מדבר על שימוש ברשת צרה ועמוקה יותר על מנת לכווץ את המודל. אבני היסוד שלו לקוחים ממאמר קודם של הינטון ודין. נדבר על פרדיגמת המורה תלמיד ומה זה בעצם זיקוק המידע.

Teacher - Student Teacher - a large network or an ensemble of networks used for training (state of the art). Student - small, compact network used to mimic and approximate the teacher (slower better performing model). Outputs of student should match teacher’s outputs on a transfer set of unlabeled data. Why should this work? Can we describe the same model with less parameters ? Only if training is different.

Knowledge Distillation - soften outputs Soften teacher’s softmax outputs by raising temperature (i.e distillation):

Knowledge Distillation - soften outputs Provides more information of relations between non target classes.

Knowledge Distillation - student training Following cross entropy based loss function is minimized in order to obtain weights of student: Pt, Ps - Teacher’s and students softened outputs on unlabeled data (transfer set).: Ytrue - Teacher’s true labels of original training set. lambda - tunable parameter to balance both cross entropies. טרנספר סט - מורכב או מהטריינינג דאטה של הרשת הגדולה (עם לייבלים) ואז פונקציית המטרה מנסה לאפטם את הלייבלים יחד עם המוצא המרוכך של הרשת הגדולה. אפשר שהטרנספר סט יהיה דאטה חדש לא מסומן, ואז עובדים רק מול המוצא המרוכך של הרשת הגדולה

Knowledge Distillation - summary Student network mimics teacher architecture and performance. Generalizes as well as teacher using less data. So why not try and deepen student? KD Suffers from difficulty of optimizing very deep nets. Let’s try and improve the training process of the student.

Hint based training Hint - output from a teacher’s hidden layer is used as input to a student’s hidden layer, called the guided layer. Student parameters up to the guided layer are trained so that its deep nested function matches the teacher’s : השכבה הנבחרת עבור הסטודנט היא השכבה האמצעית, על מנת לא לעשות יותר מדי רגולריזציה לרשת.

Hint based training Middle layers are chosen for teacher and student, to avoid over regularization. Student outputs at the guided layer are regressed to match teachers outputs at the hidden layer.

Hint based training What is it good for? Accelerates the student’s training convergence. A form of regularization (avoid overfitting).

FitNets - training pipeline Trained teacher net, randomly initialized student net. Hints based training. Knowledge distillation training of entire student

FitNets - Results - Accuracy, compression

FitNets - Results - Accuracy, compression, speed up

FitNets - summary Training very deep student nets with less parameters. Hint based training combined with KD, outperforms teacher network. Faster and lighter network.

DO DEEP NETS REALLY NEED TO BE DEEP? Rich Caruana & Lei Jimmy Ba 2014 ISHAY BE’ERY ELAD KNOLL

OUTLINES Why deep is better? Model Compression – mimicking networks Results Conclusions

deep nets achieved better results in every application in the last few years.

So how this magic works?

More parameters. The ability to learn more complicated functions with same number of parameters. Deep = Hierarchical learning = Better learning. Convolution make life easier. Learning algorithms and regularization working better in deeper nets. All answers above correct. Non of the above correct

Hard to argue with facts: Deep is better under parameter budget Shallow is hard to train when it is big and has large number of parameters. So how can we advance from here?

MODEL COMPRESSION – MIMIC NET Assumption – if shallow net with same number of parameters as the deep teacher net can mimic a deep net -> network not have to be deep.

MODEL COMPRESSION METHOD Train state of the art deep model. Pass unlabeled data through the deep network and collect the output scores. Train the shallow network with the above data with special target function:

For matching the number of parameters between the deep and shallow nets the shallow net must be very wide -> matrix W very large. The input vector and weight matrix multiplication is very costly and also the gradient descent converges very slowly. Solution - braking the W matrix to 2 smaller matrices U,V which reduce the convergence rate and the memory the needed for training.

the shallow net can’t be trained directly from the data set! the above suggests that how you learning the model and the size and complexity of the representation of the model is 2 different things.

RESULTS The mimic method implemented on 2 data sets: TIMIT and CIFAR-10

So how shallow net can mimic deep net? If some of the training examples are hard to learn and predict the teacher may provide more soft labels to the student. The teacher gives much more data for the student than hard 0/1 labels. Mimic model also continue improving when teacher improving -> there is no lack of capacity for learning. It can be seen as form of regularization method.

CONCLUSIONS Shallow nets can be trained to near state of art results Deep nets is more successful due good matching between network architecture and current learning procedures. Maybe future learning procedures would make directly training shallow networks possible.

ISHAY BE’ERY ELAD KNOLL DO DEEP CONVOLUTIONAL NETS REALLY NEED TO BE DEEP AND CONVOLUTIONAL ? Rich Caruana et al. 2016 ISHAY BE’ERY ELAD KNOLL

Yes, they do! המשפט הראשון באבסטרקט.

CRITICISM ON THE FIRST PAPER TIMIT data set is not good example – there is no big improvement after adding convolutional layers. On CIFAR-10 example they needed one convolutional layer to get good results. The number of parameters is very big for the mimic model. The mimic model was not so accurate like the state of art results and the teacher model also.

NEW RESEARCH METHOD Target: be as much as accurate (compared to teacher) with the same number of parameters. Problem: for negative result – how to be sure that the outcome of the experiments are not a result of wrong steps in the process? Solution: to be very cautious at every step.

NEW RESEARCH METHOD Parameters optimization (by Bayesian optimization). Use image augmentation to enlarge the data set and to get more accurate results Create the best “teacher” model that is possible.

RESULTS

RESULTS

RESULTS

RESULTS Big gap between non convolutional net and convolutional net. Big gap between mimic model and normal model that gets smaller when the mimic model get deeper. For 1& 2 conv layers nets the best mimic net is worse than the worst net of the next deeper net – demonstrate the power of convolutional layers on images data set.

CONCLUSIONS For image data set it is clear that convolutional layers are a must. Model compression and knowledge distillation improve the ability of thinner and deeper or shallower networks to achieve accuracy of deeper or wider nets. For other data sets like acoustic, text etc. it is not clear that depth and convolution is needed when using mimic model.

CONCLUSIONS Deep nets are probably contains intrinsic redundancy of data (knowledge distillation, ResNet, FitNet) that with smart enough methods can be exploited to compress the networks for gaining smaller memory consumption and computations resources.