Learning and Memorization

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Simple Neural Nets For Pattern Classification
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Deep Belief Networks for Spam Filtering
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
A shallow introduction to Deep Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Reducing Structural Bias in Technology Mapping
Big data classification using neural network
Deep Residual Learning for Image Recognition
Neural Network Architecture Session 2
Deep Feedforward Networks
The Relationship between Deep Learning and Brain Function
Summary of “Efficient Deep Learning for Stereo Matching”
Compact Bilinear Pooling
Data Mining, Neural Network and Genetic Programming
Syntax-based Deep Matching of Short Texts
HyperNetworks Engın denız usta
DeepCount Mark Lenson.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Technology Mapping into General Programmable Cells
Matt Gormley Lecture 16 October 24, 2016
Deep Neural Network with Stochastic Computing
Multimodal Learning with Deep Boltzmann Machines
Classification with Perceptrons Reading:
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Random walk initialization for training very deep feedforward networks
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Bird-species Recognition Using Convolutional Neural Network
CNNs and compressive sensing Theoretical analysis
Grid Long Short-Term Memory
Learning with information of features
A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Deep Learning Hierarchical Representations for Image Steganalysis
INF 5860 Machine learning for image classification
network of simple neuron-like computing elements
8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,
Creating Data Representations
Resolution Proofs for Combinational Equivalence
Neural Networks Geoff Hulten.
On Convolutional Neural Network
Lecture: Deep Convolutional Neural Networks
Outline Background Motivation Proposed Model Experimental Results
Improvements in FPGA Technology Mapping
Analysis of Trained CNN (Receptive Field & Weights of Network)
Inception-v4, Inception-ResNet and the Impact of
Heterogeneous convolutional neural networks for visual recognition
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Introduction to Neural Networks
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Lecture 16. Classification (II): Practical Considerations
CSC 578 Neural Networks and Deep Learning
Circuit-Based Intrinsic Methods to Detect Overfitting
Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Learning and Memorization Alan Mishchenko University of California, Berkeley (original work by Sat Chatterjee, Google Research)

Outline Introduction and motivation Proposed architecture Proposed training procedure Experimental results Conclusions and future work

Outline Introduction and motivation Proposed architecture Proposed training procedure Experimental results Conclusions and future work

Memorization vs Generalization Ability to remember training data Poor classification on testing data Generalization Ability to learn from training data Good classification on testing data

Observation and Question Neural networks (NNs) memorize training data quite well, as shown by getting zero error for ImageNet images with permuted labels C. Zhang et al, “Understanding deep learning requires rethinking generalization”. Proc. ICLR’17 Question: If NNs can memorize random training data, why they generalize on real data?

One Answer and More Questions Generalization and memorization depend not only on the network architecture and the training procedure, but also on the dataset itself D. Arpit et al. “A closer look at memorization in deep networks”. Proc. ICML 2017 Question 1: Is it possible that memorization and generalization do not contradict each other? Question 2: If this is true, how well can we learn, if memorization is all we can do? Is generalization even possible in this setting?

Outline Introduction and motivation Proposed architecture Proposed training procedure Experimental results Conclusions and future work

The Proposed Architecture A naïve approach would be to use one memory block to remember input data and output classes does not scale well and results in overfitting need a smarter method We build a network of memory blocks, called k-input lookup tables (LUTs), arranged in layers Unlike a neural network, training is done through memorization, without such key features as back-propagation gradient descent any explicit search

A Network of LUTs Each LUT is connected to k random outputs of the previous layer (i.e. maps a k-bit bit-vector to 0 or 1) k is typically less than 16 k=2 in this example Input layer First layer of look up tables Second layer of look up tables output

Similarity with Neural Networks A convolutional filter is support-limited A fully connected layer is not support-limited but limited in expressive power Learned weight matrices are often sparse, or can be made so, with no loss in accuracy

Simplified Experimental Setup (1) Consider only binary classification problems (2) Consider only discrete signals A typical LUT has several binary inputs and one binary output As a result, inputs, intermediate signals, and outputs in the proposed LUT network are all binary This restriction is not as extreme as it may appear Research in quantized and binary neural networks (BNNs) shows that limited precision is often sufficient M. Rastegari et al, “Imagenet classification using binary convolutional neural networks”. Proc. ECCV’16.

Outline Introduction and motivation Proposed architecture Proposed training procedure Experimental results Conclusions and future work

Formal Description Consider the problem of learning a function f : Bk → B from a list of training pairs (x, y) where x  Bk and y  B To learn by memorizing, first construct a table of 2k rows (one for each pattern p  Bk) and two columns, y0 and y1 The y0 entry for the row p (denoted by cp0) counts how many times p leads to 0, i.e., the number of times (p, 0) occurs in the training set Similarly, the y1 entry for row p (denoted by cp1) counts how many times p leads to output 1 in the training set Associate function f(p) : Bk → B with the table as follows: 1 if cp1 > cp0 f(p) = 0 if cp1 < cp0 b if cp1 = cp0 where b  B is picked uniformly at random {

Example 1: Single LUT x0 x1 LUT f x2

Example 2: Two LUT Layers f11 f10 f20

Analysis of Training Procedure The procedure is linear in the size of the training data, with two passes over the data on each layer first, counting input patterns second, comparing counters and assigning the output It is efficient since it involves only counting and LUT evaluation and does not use floating point uses only memory lookups and integer addition It is easily parallelizable each LUT in a layer is independent the occurrence counts can be computed for disjoint subsets of the training data and added together

Outline Introduction and motivation Proposed architecture Proposed training procedure Experimental results Conclusions and future work

Experimental Setup Implemented and applied to MNIST and CIFAR-10 Binarize MNIST problem (CIFAR is similar) Input pixel value: 0 = [0; 127] and 1 = [128; 255] Distinguish digits: 0 = {0,1,2,3,4} and 1 = {5,6,7,8,9} Training in two passes for each layer Count the patterns Compare counters and assigns LUT functions Evaluation in one pass for each layer Compute the output value of each LUT The training time is close to 1 minute (for MNIST) The memory used is close to 100MB

Feasibility Check on MNIST Considered a network with 5 hidden layers with 1024 LUTs (k=8) in each layer well-trained CNN = 0.98 training accuracy = 0.89 test accuracy = 0.87 random chance = 0.50

Accuracy as Function of Depth

Accuracy as Function of LUT Size LUT size (k) controls the “degree” of memorization The larger is k, the better it is for random data When k = 14, it is close to a neural network memorizes random data yet generalizes on real data!

Comparison With Other Methods MNIST CIFAR-10 Conclusion: Not state-of-the-art, but much better than chance and close to other methods

Pairwise MNIST Profiling 45 tasks that distinguish digit pairs (e.g. “1” vs “2”) 5 layers 1024 LUTs/layer k=2 1024 LUTs/layer Increasing LUT size leads to overfitting Increasing depth with k=2 does not lead to overfitting Conclusion: k controls memorization; small k generalizes well

Pairwise CIFAR-10

Outline Introduction and motivation Proposed architecture Proposed training procedure Experimental results Conclusion and future work

Conclusion Pure memorization can lead to generalization! The proposed learning model is very simple, yet it replicates some interesting features of neural networks: Increasing depth helps Memorizing random data; generalizing on real data Memorizing random data is harder than real data Interestingly, small values of k (including k=2) lead to good results without overfitting Other logic synthesis methods producing networks composed of two-input gates, could be of interest

Future Work Is this approach to ML useful in practice? How to extend beyond binary classification? Can the accuracy be improved? Need better theoretical understanding

Abstract In the machine learning research community, it is generally believed that there is a tension between memorization and generalization. In this work, we examine to what extent this tension exists, by exploring if it is possible to generalize by memorizing alone. Although direct memorization with one lookup table obviously does not generalize, we find that introducing depth in the form of a network of support-limited lookup tables leads to generalization that is significantly above chance and closer to those obtained by standard learning algorithms on tasks derived from MNIST and CIFAR-10. Furthermore, we demonstrate through a series of empirical results that our approach allows for a smooth tradeoff between memorization and generalization and exhibits the most salient characteristics of neural networks: depth improves performance; random data can be memorized and yet there is generalization on real data; and memorizing random data is harder than memorizing real data. The extreme simplicity of the algorithm and potential connections with generalization theory point to interesting directions for future work.