Learning and Memorization

Learning and Memorization
Alan Mishchenko University of California, Berkeley (original work by Sat Chatterjee, Google Research)

Outline Introduction and motivation Proposed architecture
Proposed training procedure Experimental results Conclusions and future work

Memorization vs Generalization
Ability to remember training data Poor classification on testing data Generalization Ability to learn from training data Good classification on testing data

Observation and Question
Neural networks (NNs) memorize training data quite well, as shown by getting zero error for ImageNet images with permuted labels C. Zhang et al, “Understanding deep learning requires rethinking generalization”. Proc. ICLR’17 Question: If NNs can memorize random training data, why they generalize on real data?

One Answer and More Questions
Generalization and memorization depend not only on the network architecture and the training procedure, but also on the dataset itself D. Arpit et al. “A closer look at memorization in deep networks”. Proc. ICML 2017 Question 1: Is it possible that memorization and generalization do not contradict each other? Question 2: If this is true, how well can we learn, if memorization is all we can do? Is generalization even possible in this setting?

The Proposed Architecture
A naïve approach would be to use one memory block to remember input data and output classes does not scale well and results in overfitting need a smarter method We build a network of memory blocks, called k-input lookup tables (LUTs), arranged in layers Unlike a neural network, training is done through memorization, without such key features as back-propagation gradient descent any explicit search

A Network of LUTs Each LUT is connected to k random outputs of the previous layer (i.e. maps a k-bit bit-vector to 0 or 1) k is typically less than 16 k=2 in this example Input layer First layer of look up tables Second layer of look up tables output

Similarity with Neural Networks
A convolutional filter is support-limited A fully connected layer is not support-limited but limited in expressive power Learned weight matrices are often sparse, or can be made so, with no loss in accuracy

Simplified Experimental Setup
(1) Consider only binary classification problems (2) Consider only discrete signals A typical LUT has several binary inputs and one binary output As a result, inputs, intermediate signals, and outputs in the proposed LUT network are all binary This restriction is not as extreme as it may appear Research in quantized and binary neural networks (BNNs) shows that limited precision is often sufficient M. Rastegari et al, “Imagenet classification using binary convolutional neural networks”. Proc. ECCV’16.

Formal Description Consider the problem of learning a function f : Bk → B from a list of training pairs (x, y) where x  Bk and y  B To learn by memorizing, first construct a table of 2k rows (one for each pattern p  Bk) and two columns, y0 and y1 The y0 entry for the row p (denoted by cp0) counts how many times p leads to 0, i.e., the number of times (p, 0) occurs in the training set Similarly, the y1 entry for row p (denoted by cp1) counts how many times p leads to output 1 in the training set Associate function f(p) : Bk → B with the table as follows: 1 if cp1 > cp0 f(p) = if cp1 < cp0 b if cp1 = cp0 where b  B is picked uniformly at random {

Example 1: Single LUT x0 x1 LUT f x2

Example 2: Two LUT Layers
f11 f10 f20

Analysis of Training Procedure
The procedure is linear in the size of the training data, with two passes over the data on each layer first, counting input patterns second, comparing counters and assigning the output It is efficient since it involves only counting and LUT evaluation and does not use floating point uses only memory lookups and integer addition It is easily parallelizable each LUT in a layer is independent the occurrence counts can be computed for disjoint subsets of the training data and added together

Experimental Setup Implemented and applied to MNIST and CIFAR-10
Binarize MNIST problem (CIFAR is similar) Input pixel value: 0 = [0; 127] and 1 = [128; 255] Distinguish digits: 0 = {0,1,2,3,4} and 1 = {5,6,7,8,9} Training in two passes for each layer Count the patterns Compare counters and assigns LUT functions Evaluation in one pass for each layer Compute the output value of each LUT The training time is close to 1 minute (for MNIST) The memory used is close to 100MB

Feasibility Check on MNIST
Considered a network with 5 hidden layers with 1024 LUTs (k=8) in each layer well-trained CNN = 0.98 training accuracy = 0.89 test accuracy = 0.87 random chance = 0.50

Accuracy as Function of Depth

Accuracy as Function of LUT Size
LUT size (k) controls the “degree” of memorization The larger is k, the better it is for random data When k = 14, it is close to a neural network memorizes random data yet generalizes on real data!

Comparison With Other Methods
MNIST CIFAR-10 Conclusion: Not state-of-the-art, but much better than chance and close to other methods

Pairwise MNIST Profiling 45 tasks that distinguish digit pairs (e.g. “1” vs “2”) 5 layers 1024 LUTs/layer k=2 1024 LUTs/layer Increasing LUT size leads to overfitting Increasing depth with k=2 does not lead to overfitting Conclusion: k controls memorization; small k generalizes well

Pairwise CIFAR-10

Proposed training procedure Experimental results Conclusion and future work

Conclusion Pure memorization can lead to generalization!
The proposed learning model is very simple, yet it replicates some interesting features of neural networks: Increasing depth helps Memorizing random data; generalizing on real data Memorizing random data is harder than real data Interestingly, small values of k (including k=2) lead to good results without overfitting Other logic synthesis methods producing networks composed of two-input gates, could be of interest

Future Work Is this approach to ML useful in practice?
How to extend beyond binary classification? Can the accuracy be improved? Need better theoretical understanding

Abstract In the machine learning research community, it is generally believed that there is a tension between memorization and generalization. In this work, we examine to what extent this tension exists, by exploring if it is possible to generalize by memorizing alone. Although direct memorization with one lookup table obviously does not generalize, we find that introducing depth in the form of a network of support-limited lookup tables leads to generalization that is significantly above chance and closer to those obtained by standard learning algorithms on tasks derived from MNIST and CIFAR-10. Furthermore, we demonstrate through a series of empirical results that our approach allows for a smooth tradeoff between memorization and generalization and exhibits the most salient characteristics of neural networks: depth improves performance; random data can be memorized and yet there is generalization on real data; and memorizing random data is harder than memorizing real data. The extreme simplicity of the algorithm and potential connections with generalization theory point to interesting directions for future work.

Learning and Memorization

Similar presentations

Presentation on theme: "Learning and Memorization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning and Memorization

Similar presentations

Presentation on theme: "Learning and Memorization"— Presentation transcript:

Similar presentations

About project

Feedback