Mimicking deep neural networks with shallow and narrow ones

Mimicking deep neural networks with shallow and narrow ones
ISHAY BE’ERY ELAD KNOLL

OUTLINES Motivation Model compression: mimicking large networks:
FITNETS : HINTS FOR THIN DEEP NETS (A. Romero, 2014) DO DEEP NETS REALLY NEED TO BE DEEP (Rich Caruana & Lei Jimmy Ba 2014) DO DEEP CONVOLUTIONAL NETS REALLY NEED TO BE DEEP AND CONVOLUTIONAL ? (Rich Caruana et al. 2016)

Motivation - why compress?
Deep and wide networks are: memory demanding - not suitable for large number of users. Computationally expensive - slower inference time, crucial for real-time tasks.

Motivation - why deep and wide?
Going deep Ability to detect more complex features. Better generalization of the network. Going wide Improved performance. parallelization friendly (GPU)

Motivation - how can we compress?
Split network into to two - one for training, one for deployment. Training: Demand higher performance. Time and resources are less crucial. Deployment: Faster inference time. Less computation resources required. בשלב האימון יש פחות חשיבות לזמן האימון והמשאבים הנדרשים. לעומת זאת, שלב הזיהוי או הקטלוג נרצה שיהיה מהיר ככל הניתן ויצרוך כמה שפחות משאבים.

Motivation - how can we compress?
An analogy to insect world: Larva form - optimized for extracting energy, nutrients from surrounding environment. Adult form - Optimized for traveling, reproduction, etc. בהשאלה מעולם הזחל והפרפר - לרשת נוירונים יש שני שלבים שונים, שלב האימון ושלב הפריסה (הביצוע). ניתן שיהיה להם מבנה שונה ותכונות שונות.

Mimicking a DNN with a narrow one
FITNETS : HINTS FOR THIN DEEP NETS (Adriana Romero 1 , Nicolas Ballas 2 , Samira Ebrahimi Kahou 3 , Antoine Chassang 2 , Carlo Gatta 4 & Yoshua Bengio 2, 2014) Teacher Student network (Model compression, Bucila et. al. 2006). Knowledge Distillation (Distilling the knowledge in a student network, Hinton & Dean, 2014). Hint based training. המאמר הנ"ל מדבר על שימוש ברשת צרה ועמוקה יותר על מנת לכווץ את המודל. אבני היסוד שלו לקוחים ממאמר קודם של הינטון ודין. נדבר על פרדיגמת המורה תלמיד ומה זה בעצם זיקוק המידע.

Teacher - Student Teacher - a large network or an ensemble of networks used for training (state of the art). Student - small, compact network used to mimic and approximate the teacher (slower better performing model). Outputs of student should match teacher’s outputs on a transfer set of unlabeled data. Why should this work? Can we describe the same model with less parameters ? Only if training is different.

Knowledge Distillation - soften outputs
Soften teacher’s softmax outputs by raising temperature (i.e distillation):

Knowledge Distillation - soften outputs
Provides more information of relations between non target classes.

Knowledge Distillation - student training
Following cross entropy based loss function is minimized in order to obtain weights of student: Pt, Ps - Teacher’s and students softened outputs on unlabeled data (transfer set).: Ytrue - Teacher’s true labels of original training set. lambda - tunable parameter to balance both cross entropies. טרנספר סט - מורכב או מהטריינינג דאטה של הרשת הגדולה (עם לייבלים) ואז פונקציית המטרה מנסה לאפטם את הלייבלים יחד עם המוצא המרוכך של הרשת הגדולה. אפשר שהטרנספר סט יהיה דאטה חדש לא מסומן, ואז עובדים רק מול המוצא המרוכך של הרשת הגדולה

Knowledge Distillation - summary
Student network mimics teacher architecture and performance. Generalizes as well as teacher using less data. So why not try and deepen student? KD Suffers from difficulty of optimizing very deep nets. Let’s try and improve the training process of the student.

Hint based training Hint - output from a teacher’s hidden layer is used as input to a student’s hidden layer, called the guided layer. Student parameters up to the guided layer are trained so that its deep nested function matches the teacher’s : השכבה הנבחרת עבור הסטודנט היא השכבה האמצעית, על מנת לא לעשות יותר מדי רגולריזציה לרשת.

Hint based training Middle layers are chosen for teacher and student, to avoid over regularization. Student outputs at the guided layer are regressed to match teachers outputs at the hidden layer.

Hint based training What is it good for?
Accelerates the student’s training convergence. A form of regularization (avoid overfitting).

FitNets - training pipeline
Trained teacher net, randomly initialized student net. Hints based training. Knowledge distillation training of entire student

FitNets - Results - Accuracy, compression

FitNets - Results - Accuracy, compression, speed up

FitNets - summary Training very deep student nets with less parameters. Hint based training combined with KD, outperforms teacher network. Faster and lighter network.

DO DEEP NETS REALLY NEED TO BE DEEP?
Rich Caruana & Lei Jimmy Ba 2014 ISHAY BE’ERY ELAD KNOLL

OUTLINES Why deep is better? Model Compression – mimicking networks
Results Conclusions

deep nets achieved better results in every application in the last few years.

So how this magic works?

More parameters. The ability to learn more complicated functions with same number of parameters. Deep = Hierarchical learning = Better learning. Convolution make life easier. Learning algorithms and regularization working better in deeper nets. All answers above correct. Non of the above correct

Hard to argue with facts:
Deep is better under parameter budget Shallow is hard to train when it is big and has large number of parameters. So how can we advance from here?

MODEL COMPRESSION – MIMIC NET
Assumption – if shallow net with same number of parameters as the deep teacher net can mimic a deep net -> network not have to be deep.

MODEL COMPRESSION METHOD
Train state of the art deep model. Pass unlabeled data through the deep network and collect the output scores. Train the shallow network with the above data with special target function:

For matching the number of parameters between the deep and shallow nets the shallow net must be very wide -> matrix W very large. The input vector and weight matrix multiplication is very costly and also the gradient descent converges very slowly. Solution - braking the W matrix to 2 smaller matrices U,V which reduce the convergence rate and the memory the needed for training.

the shallow net can’t be trained directly from the data set!
the above suggests that how you learning the model and the size and complexity of the representation of the model is 2 different things.

RESULTS The mimic method implemented on 2 data sets: TIMIT and CIFAR-10

So how shallow net can mimic deep net?
If some of the training examples are hard to learn and predict the teacher may provide more soft labels to the student. The teacher gives much more data for the student than hard 0/1 labels. Mimic model also continue improving when teacher improving -> there is no lack of capacity for learning. It can be seen as form of regularization method.

CONCLUSIONS Shallow nets can be trained to near state of art results
Deep nets is more successful due good matching between network architecture and current learning procedures. Maybe future learning procedures would make directly training shallow networks possible.

ISHAY BE’ERY ELAD KNOLL
DO DEEP CONVOLUTIONAL NETS REALLY NEED TO BE DEEP AND CONVOLUTIONAL ? Rich Caruana et al. 2016 ISHAY BE’ERY ELAD KNOLL

Yes, they do! המשפט הראשון באבסטרקט.

CRITICISM ON THE FIRST PAPER
TIMIT data set is not good example – there is no big improvement after adding convolutional layers. On CIFAR-10 example they needed one convolutional layer to get good results. The number of parameters is very big for the mimic model. The mimic model was not so accurate like the state of art results and the teacher model also.

NEW RESEARCH METHOD Target: be as much as accurate (compared to teacher) with the same number of parameters. Problem: for negative result – how to be sure that the outcome of the experiments are not a result of wrong steps in the process? Solution: to be very cautious at every step.

NEW RESEARCH METHOD Parameters optimization (by Bayesian optimization). Use image augmentation to enlarge the data set and to get more accurate results Create the best “teacher” model that is possible.

RESULTS

RESULTS Big gap between non convolutional net and convolutional net.
Big gap between mimic model and normal model that gets smaller when the mimic model get deeper. For 1& 2 conv layers nets the best mimic net is worse than the worst net of the next deeper net – demonstrate the power of convolutional layers on images data set.

CONCLUSIONS For image data set it is clear that convolutional layers are a must. Model compression and knowledge distillation improve the ability of thinner and deeper or shallower networks to achieve accuracy of deeper or wider nets. For other data sets like acoustic, text etc. it is not clear that depth and convolution is needed when using mimic model.

CONCLUSIONS Deep nets are probably contains intrinsic redundancy of data (knowledge distillation, ResNet, FitNet) that with smart enough methods can be exploited to compress the networks for gaining smaller memory consumption and computations resources.

Mimicking deep neural networks with shallow and narrow ones

Similar presentations

Presentation on theme: "Mimicking deep neural networks with shallow and narrow ones"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mimicking deep neural networks with shallow and narrow ones

Similar presentations

Presentation on theme: "Mimicking deep neural networks with shallow and narrow ones"— Presentation transcript:

Similar presentations

About project

Feedback