Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Perceptron
Advertisements

Tuomas Sandholm Carnegie Mellon University Computer Science Department
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Simple Neural Nets For Pattern Classification
Neural Networks Marco Loog.
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Experimental Evaluation
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.
Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Ensemble Methods: Bagging and Boosting
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Over-Trained Network Node Removal and Neurotransmitter-Inspired Artificial Neural Networks By: Kyle Wray.
Lecture #23: Arithmetic Circuits-1 Arithmetic Circuits (Part I) Randy H. Katz University of California, Berkeley Fall 2005.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Reducing Structural Bias in Technology Mapping
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Neural Network Architecture Session 2
Negative Numbers and Subtraction
Deep Feedforward Networks
Software Testing.
Fall 2004 Perceptron CS478 - Machine Learning.
Classification: Logistic Regression
Power Optimization Toolbox for Logic Synthesis and Mapping
Matt Gormley Lecture 16 October 24, 2016
Logic and Computer Design Fundamentals
Delay Optimization using SOP Balancing
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Neural Networks and Backpropagation
Integrating an AIG Package, Simulator, and SAT Solver
CSC 578 Neural Networks and Deep Learning
Standard-Cell Mapping Revisited
Objective of This Course
King Fahd University of Petroleum and Minerals
Arithmetic Circuits (Part I) Randy H
SAT-Based Area Recovery in Technology Mapping
Hyperparameters, bias-variance tradeoff, validation
Deep Learning Hierarchical Representations for Image Steganalysis
INF 5860 Machine learning for image classification
of the Artificial Neural Networks.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
FPGA Glitch Power Analysis and Reduction
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Alan Mishchenko UC Berkeley (With many thanks to Donald Knuth,
Neural Networks Geoff Hulten.
Alan Mishchenko UC Berkeley (With many thanks to Donald Knuth for
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Integrating an AIG Package, Simulator, and SAT Solver
Ensemble learning.
ECE 352 Digital System Fundamentals
Machine learning overview
Recording Synthesis History for Sequential Verification
Delay Optimization using SOP Balancing
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
Modeling IDS using hybrid intelligent systems
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Learning and Memorization
Integrating AIG Package, Simulator, and SAT Solver
CSC 578 Neural Networks and Deep Learning
Circuit-Based Intrinsic Methods to Detect Overfitting
CS249: Neural Language Model
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting Sat Chatterjee Alan Mishchenko Google AI UC Berkeley

Outline Introduction to machine learning Intrinsic vs extrinsic methods Counterfactual simulation (CFS) Experiments and discussion Conclusions and future work

Machine Learning (ML) ML learns useful information from application data Data is composed of data samples Data samples are divided into two categories: training set is used for training validation set is used for evaluation of the quality of training ML model is one specific way to do machine learning neural networks, random forests, etc

Accuracy of an ML Model In a typical ML scenario, training data is collected and used to train the ML model is several iterations The more training, the better the result (hopefully) A trained ML model takes an input data sample and produces the result of classification (correct or incorrect) Accuracy is determined by counting the percentage of correct answers A typical learning curve looks as follows:

Overfitting Overfitting occurs when more training leads to improved accuracy on the training set and reduced accuracy on the validation set The opposite of overfitting is the ability to generalize The less overfitting, the better generalization and vice versa Generalization is measured by generalization gap The difference between the accuracy on the training set and that on the evaluation set

Intrinsic vs Extrinsic Methods Intrinsic methods detect overfitting of a model based only on the model and the training data Extrinsic methods rely on additional knowledge the performance of the model on the validation set the details of the training process the size of the parameter space the limitations of the ML model etc

Converting ML Model into a Circuit ML models perform various computations on data Computations can be expressed using operations on floating-point or fixed-point numbers (*, +, >, !=, etc) Each operation can be represented as a bit-level circuit As a result, we can build a bit-level circuit representing the function of the ML model The circuit takes bit-level inputs representing a data sample and produces bit-level outputs representing the result of classification The circuit is composed of some primitives (e.g. AND/INV gates) This circuit can be very-very large (~1 trillion AND gates) We will deal with this later

Benefits of Circuit Representation If we use the circuit of one type (e.g. AIG), we can handle all ML models uniformly In fact, we can find useful info about an ML model using its circuit representation without knowing the model type without knowing the way it was trained This allows us to develop model-agnostic intrinsic methods to detect overfitting

Counterfactual Simulation (CFS) A value is k-rare if it appears no more than k times during simulation of the training set Idea 1: Presence of k-rare patterns suggests overfitting the model uses special logic to handle specific examples simply counting rare patterns does not work well, though Idea 2: Perturbed simulation of training data simulate an example through the model as usual when a k-rare pattern is encountered, instead of propagating it to the fanouts, simulate fanouts with a perturbed value

Example Assign values at the primary inputs according to the training set Multiple simulation patterns are packed into 32- or 64-bit strings Perform bitwise simulation in a topological order Find nodes that have k-rare patterns (few 0s or 1s) In the second round of simulation, complement these rare values Compare the accuracy of the perturbed simulation against the original simulation 1 2 3 4 1 a b c d F 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 1 1

Example Consider a LUT built to detect n specific training examples the model is extremely overfitted, with 100% accuracy on the training set Observe that all of the internal signals s0, s1, … sn-1 are 1-rare when their value is changed to the opposite during CFS, accuracy drops to 0% Loss of accuracy during CFS can be used as a measure of overfitting

CSF Implementation Simulation is performed in two passes over the network First, the counts of different patterns in the circuit are computed Second, the counts are used to perturb the k-rare patterns The accuracy with and without CFS is compared CFS is linear-time in the size of the graph and training data Several tricks are used to improve efficiency Simulation is bit-parallel for all training examples Reference counting is used to recycle simulation info It takes about 10 min and 2GB for a neural network with 300K MACC operations on a 3.7GHz Xeon CPU

Deriving Circuits Train a model (e.g. neural network) on MNIST data set Quantize floating-point values down to 6-bit fixed-point Decompose multipliers, adders, MUXes, RELUs, etc into two-input AND/XOR nodes Each MACC unit multiplies a signed 8-bit constant (the weight) with a signed 16-bit input (the activation) and accumulates the result in 24 bits with saturation The resulting logic circuit has the following parameters The inputs of the circuits are individual bits of the pixel data (for MNIST, there are 28*28*8=6272 inputs) The outputs are signed 16-bit activations before the softmax (for MNIST, there are 16*10=160 outputs) The node count is about 40M for an NN with 300K MACCs

Benchmark Problems 3 neural networks and 2 random forests were trained The first two networks (nn-real-2 and nn-real-100) are trained on the MNIST training set for 2 epochs and 100 epochs, respectively training accuracies are 97% and 99.90%, respectively validation accuracies are 97% (0% gap) and 98.24% (1.66% gap) The third network (nn-random) is trained on a variant of MNIST where the output labels in the training set are permuted pseudo-randomly and trained for 300 epochs training accuracy is 91.27% validation accuracy is 9.73% (i.e., close to chance) The forests (rf-real and rf-random) have 10 trees each and are trained using the default settings on version 0.19.1 of Scikit-learn (the second trained on MNIST with permuted labels) training accuracy is 100% validation accuracy is 95.58% and 10%

Effect of Simple CFS

Impact of Circuit Structure Different multiplier architecture Breaking XORs into ANDs

Counting Rare Patterns

Random Forests

Using Blanket Noise CFS curves for NNs and forests Noise curves for NNs and forests Blanket noise is created by simulating the training set while randomly flipping node values with probability p ranging from 2−30 to 2−5

Sensitivity to Perturbation CFS curves Noise curves A family of NNs was trained with different number of epochs on the input data, which is half-real and half-random (other randomness ratios led to similar results)

Discussion Topics Dependence of CFS on circuit structure Possibility of adversarial attack on CFS CFS compared with blanket noise Generalization in deep learning

CFS Depends on Circuit Structure We observed that circuit structure impacts CFS This is bad news in the adversarial setup A good model with a poor implementation may show steeper degradation under CFS than a more overfit model with a better implementation Ideally, a variant of CFS is needed that does not depend on structure but only on the function

Adversarial Attack on CFS A poorly trained model can be more resilient under CFS than a well-trained model For example, overfit model rf-random trained on random labels falls off more slowly than well-trained nn-real-2 The reason: Although each tree in rf-random is overfit, the circuit nodes have few rare patterns due to the observability don’t cares (ODCs) Experimentally, MUX circuits have 10x more ODCs than adder trees derived from the neural networks

Comparison with Blanket Noise Blanket noise is less sensitive than CFS Forests are 1000x more fault-tolerant to bit flips that neural networks Noise-based intrinsic methods can be easily fooled by an adversary by adding redundancy

Generalization in Deep Learning CFS on nn-random and rare pattern counts provide direct evidence that, even on random data, nets do not “brute-force memorize" but identify common patterns This gives evidence for the claim in Arpit et al. [2017, §1] that “SGD learns simpler patterns first before memorizing"

Conclusion The main result: CFS based on adding small amounts of targeted noise at the logic circuit level detects overfit This is remarkable because circuit structure is uniform for any learning model (a neural network, a random forest, or a lookup table) CFS is naturally free of hyper-parameters By studying rare patterns, we find that SGD does not lead to “brute force" memorization Instead, SGD finds common patterns when trained on data with both real and random labels Neural networks are similar to forests

Future Work Make CFS hierarchical and apply it to larger models Apply CFS at higher levels of abstraction although at that level there are more degrees of freedom Continue searching for an intrinsic method that is independent of circuit structure and/or adversarially robust or to show that it does not exist, at least for practical models Can learning produce a certificate of generalization? similar to SAT solvers producing a certificate of (un)satisfiability

Abstract The focus of this paper is on intrinsic methods to detect overfitting. These rely only on the model and the training data, as opposed to traditional extrinsic methods that rely on performance on a test set or on bounds from model complexity. We propose a family of intrinsic methods called Counterfactual Simulation (CFS) which analyze the flow of training examples through the model by identifying and perturbing rare patterns. By applying CFS to logic circuits we get a method that has no hyper-parameters and works uniformly across different types of models such as neural networks, random forests and lookup tables. Experimentally, CFS can separate models with different levels of overfit using only their logic circuit representations without any access to the high level structure. By comparing lookup tables, neural networks, and random forests using CFS, we get insight into why neural networks generalize. In particular, we find that stochastic gradient descent in neural nets does not lead to “brute force" memorization, but finds common patterns (whether we train with actual or randomized labels), and neural networks are not unlike forests in this regard. Finally, we identify a limitation with our proposal that makes it unsuitable in an adversarial setting, but points the way to future work on robust intrinsic methods.