Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting

Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting
Sat Chatterjee Alan Mishchenko Google AI UC Berkeley

Outline Introduction to machine learning
Intrinsic vs extrinsic methods Counterfactual simulation (CFS) Experiments and discussion Conclusions and future work

Machine Learning (ML) ML learns useful information from application data Data is composed of data samples Data samples are divided into two categories: training set is used for training validation set is used for evaluation of the quality of training ML model is one specific way to do machine learning neural networks, random forests, etc

Accuracy of an ML Model In a typical ML scenario, training data is collected and used to train the ML model is several iterations The more training, the better the result (hopefully) A trained ML model takes an input data sample and produces the result of classification (correct or incorrect) Accuracy is determined by counting the percentage of correct answers A typical learning curve looks as follows:

Overfitting Overfitting occurs when more training leads to improved accuracy on the training set and reduced accuracy on the validation set The opposite of overfitting is the ability to generalize The less overfitting, the better generalization and vice versa Generalization is measured by generalization gap The difference between the accuracy on the training set and that on the evaluation set

Intrinsic vs Extrinsic Methods
Intrinsic methods detect overfitting of a model based only on the model and the training data Extrinsic methods rely on additional knowledge the performance of the model on the validation set the details of the training process the size of the parameter space the limitations of the ML model etc

Converting ML Model into a Circuit
ML models perform various computations on data Computations can be expressed using operations on floating-point or fixed-point numbers (*, +, >, !=, etc) Each operation can be represented as a bit-level circuit As a result, we can build a bit-level circuit representing the function of the ML model The circuit takes bit-level inputs representing a data sample and produces bit-level outputs representing the result of classification The circuit is composed of some primitives (e.g. AND/INV gates) This circuit can be very-very large (~1 trillion AND gates) We will deal with this later

Benefits of Circuit Representation
If we use the circuit of one type (e.g. AIG), we can handle all ML models uniformly In fact, we can find useful info about an ML model using its circuit representation without knowing the model type without knowing the way it was trained This allows us to develop model-agnostic intrinsic methods to detect overfitting

Counterfactual Simulation (CFS)
A value is k-rare if it appears no more than k times during simulation of the training set Idea 1: Presence of k-rare patterns suggests overfitting the model uses special logic to handle specific examples simply counting rare patterns does not work well, though Idea 2: Perturbed simulation of training data simulate an example through the model as usual when a k-rare pattern is encountered, instead of propagating it to the fanouts, simulate fanouts with a perturbed value

Example Assign values at the primary inputs according to the training set Multiple simulation patterns are packed into 32- or 64-bit strings Perform bitwise simulation in a topological order Find nodes that have k-rare patterns (few 0s or 1s) In the second round of simulation, complement these rare values Compare the accuracy of the perturbed simulation against the original simulation 1 2 3 4 1 a b c d F 1 2 3 4 1 1 2 3 4 1 1 2 3 4 1 1 1 1

Example Consider a LUT built to detect n specific training examples
the model is extremely overfitted, with 100% accuracy on the training set Observe that all of the internal signals s0, s1, … sn-1 are 1-rare when their value is changed to the opposite during CFS, accuracy drops to 0% Loss of accuracy during CFS can be used as a measure of overfitting

CSF Implementation Simulation is performed in two passes over the network First, the counts of different patterns in the circuit are computed Second, the counts are used to perturb the k-rare patterns The accuracy with and without CFS is compared CFS is linear-time in the size of the graph and training data Several tricks are used to improve efficiency Simulation is bit-parallel for all training examples Reference counting is used to recycle simulation info It takes about 10 min and 2GB for a neural network with 300K MACC operations on a 3.7GHz Xeon CPU

Deriving Circuits Train a model (e.g. neural network) on MNIST data set Quantize floating-point values down to 6-bit fixed-point Decompose multipliers, adders, MUXes, RELUs, etc into two-input AND/XOR nodes Each MACC unit multiplies a signed 8-bit constant (the weight) with a signed 16-bit input (the activation) and accumulates the result in 24 bits with saturation The resulting logic circuit has the following parameters The inputs of the circuits are individual bits of the pixel data (for MNIST, there are 28*28*8=6272 inputs) The outputs are signed 16-bit activations before the softmax (for MNIST, there are 16*10=160 outputs) The node count is about 40M for an NN with 300K MACCs

Benchmark Problems 3 neural networks and 2 random forests were trained
The first two networks (nn-real-2 and nn-real-100) are trained on the MNIST training set for 2 epochs and 100 epochs, respectively training accuracies are 97% and 99.90%, respectively validation accuracies are 97% (0% gap) and 98.24% (1.66% gap) The third network (nn-random) is trained on a variant of MNIST where the output labels in the training set are permuted pseudo-randomly and trained for 300 epochs training accuracy is 91.27% validation accuracy is 9.73% (i.e., close to chance) The forests (rf-real and rf-random) have 10 trees each and are trained using the default settings on version of Scikit-learn (the second trained on MNIST with permuted labels) training accuracy is 100% validation accuracy is 95.58% and 10%

Effect of Simple CFS

Impact of Circuit Structure
Different multiplier architecture Breaking XORs into ANDs

Counting Rare Patterns

Random Forests

Using Blanket Noise CFS curves for NNs and forests
Noise curves for NNs and forests Blanket noise is created by simulating the training set while randomly flipping node values with probability p ranging from 2−30 to 2−5

Sensitivity to Perturbation
CFS curves Noise curves A family of NNs was trained with different number of epochs on the input data, which is half-real and half-random (other randomness ratios led to similar results)

Discussion Topics Dependence of CFS on circuit structure
Possibility of adversarial attack on CFS CFS compared with blanket noise Generalization in deep learning

CFS Depends on Circuit Structure
We observed that circuit structure impacts CFS This is bad news in the adversarial setup A good model with a poor implementation may show steeper degradation under CFS than a more overfit model with a better implementation Ideally, a variant of CFS is needed that does not depend on structure but only on the function

Adversarial Attack on CFS
A poorly trained model can be more resilient under CFS than a well-trained model For example, overfit model rf-random trained on random labels falls off more slowly than well-trained nn-real-2 The reason: Although each tree in rf-random is overfit, the circuit nodes have few rare patterns due to the observability don’t cares (ODCs) Experimentally, MUX circuits have 10x more ODCs than adder trees derived from the neural networks

Comparison with Blanket Noise
Blanket noise is less sensitive than CFS Forests are 1000x more fault-tolerant to bit flips that neural networks Noise-based intrinsic methods can be easily fooled by an adversary by adding redundancy

Generalization in Deep Learning
CFS on nn-random and rare pattern counts provide direct evidence that, even on random data, nets do not “brute-force memorize" but identify common patterns This gives evidence for the claim in Arpit et al. [2017, §1] that “SGD learns simpler patterns first before memorizing"

Conclusion The main result: CFS based on adding small amounts of targeted noise at the logic circuit level detects overfit This is remarkable because circuit structure is uniform for any learning model (a neural network, a random forest, or a lookup table) CFS is naturally free of hyper-parameters By studying rare patterns, we find that SGD does not lead to “brute force" memorization Instead, SGD finds common patterns when trained on data with both real and random labels Neural networks are similar to forests

Future Work Make CFS hierarchical and apply it to larger models
Apply CFS at higher levels of abstraction although at that level there are more degrees of freedom Continue searching for an intrinsic method that is independent of circuit structure and/or adversarially robust or to show that it does not exist, at least for practical models Can learning produce a certificate of generalization? similar to SAT solvers producing a certificate of (un)satisfiability

Abstract The focus of this paper is on intrinsic methods to detect overfitting. These rely only on the model and the training data, as opposed to traditional extrinsic methods that rely on performance on a test set or on bounds from model complexity. We propose a family of intrinsic methods called Counterfactual Simulation (CFS) which analyze the flow of training examples through the model by identifying and perturbing rare patterns. By applying CFS to logic circuits we get a method that has no hyper-parameters and works uniformly across different types of models such as neural networks, random forests and lookup tables. Experimentally, CFS can separate models with different levels of overfit using only their logic circuit representations without any access to the high level structure. By comparing lookup tables, neural networks, and random forests using CFS, we get insight into why neural networks generalize. In particular, we find that stochastic gradient descent in neural nets does not lead to “brute force" memorization, but finds common patterns (whether we train with actual or randomized labels), and neural networks are not unlike forests in this regard. Finally, we identify a limitation with our proposal that makes it unsuitable in an adversarial setting, but points the way to future work on robust intrinsic methods.

Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting

Similar presentations

Presentation on theme: "Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting

Similar presentations

Presentation on theme: "Model-Agnostic Circuit-Based Intrinsic Methods to Detect Overfitting"— Presentation transcript:

Similar presentations

About project

Feedback