Christoph Rosemann and Helge Voss DESY FLC

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Slides from: Doug Gray, David Poole
Introduction to Neural Networks Computing
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Chapter 4: Linear Models for Classification
Machine Learning Neural Networks
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Pattern Recognition and Machine Learning
x – independent variable (input)
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Neural Networks Marco Loog.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Biointelligence Laboratory, Seoul National University
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Multi-Layer Perceptrons Michael J. Watts
Machine Learning Lecture 11 Summary G53MLE | Machine Learning | Dr Guoping Qiu1.
Michigan REU Final Presentations, August 10, 2006Matt Jachowski 1 Multivariate Analysis, TMVA, and Artificial Neural Networks Matt Jachowski
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
פרקים נבחרים בפיסיקת החלקיקים אבנר סופר אביב
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Neural Networks Winter-Spring 2014
LECTURE 11: Advanced Discriminant Analysis
Learning with Perceptrons and Neural Networks
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Principle Component Analysis (PCA) Networks (§ 5.8)
Multivariate Methods of
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Classification / Regression Neural Networks 2
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Chapter 3. Artificial Neural Networks - Introduction -
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Neural Network - 2 Mayank Vatsa
Feature space tansformation methods
Generally Discriminant Analysis
CS4670: Intro to Computer Vision
Parametric Methods Berlin Chen, 2005 References:
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

Christoph Rosemann and Helge Voss DESY FLC Naïve Bayesian Classifier Data Preprocessing Linear Discriminator Artificial Neural Networks Christoph Rosemann and Helge Voss DESY FLC Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 1

Naïve Bayesian Classifier aka “Projective Likelihood” Kernel methods and Nearest Neighbour estimate the joint pdfs in full D dimensional space If correlations are weak/non-existent the problem can be factorized: This introduces another problem: how to deal with the pdfs? Histogramming (event counting) + Automatic Less than optimal Parametric Fitting Difficult to automate Non-Parametric Fitting (e.g. splines) + Automatable - Possible artefacts, information loss Generated from Gauss Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 2

Likelihood functions and ratios The individual p(x) is usually called the likelihood function It is class dependent: Signal or Background(s) For each variable and class a pdf description is needed The classifier function is the Likelihood ratio (per class) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 3

Likelihood Example Example: Electron identification with two classes (electron, pion) and two variables (track momentum to calo energy, cluster shape) Take a candidate and evaluate each variable in each class: 2. Determine the Likelihood ratio, e.g. for the electron: Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 4

Projective Likelihood One of the most popular MVA methods in HEP Well performing, fast, robust and simple classifier If prior probabilities known, estimate on class probability can be made Big improvement wrt to simple cut approach Background likeliness in one variable can be overcome Problematic if (significant, >10%) correlations exist: Usually the classification performance is decreased Factorization approach doesn’t hold Probability estimation is lost Can you tell the correlation? Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 5

Linear Correlations Reminder: (standard TMVA example) Solution: apply linear transformation to input variables, so they fulfill Two main methods in use Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 6

SQRT Decorrelation Determine square-root C  of correlation matrix C, i.e., C = C C  Compute C  by diagonalising C: Transformation prescription for x: (standard TMVA example before decorrelation) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 7

SQRT Decorrelation Determine square-root C  of correlation matrix C, i.e., C = C C  Compute C  by diagonalising C: Transformation prescription for x: (after decorrelation) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 8

Principal Component Analysis Eigenvalue problem of the correlation matrix Largest eigenvalue = largest correlation Corresponding Eigenvector along axis with largest variance PCA is typically used to: Reduce the dimensionality of a problem Find the most dominant features Transformation rule: Use eigenvectors as basis (k components) Express variables in terms of the new basis (for each class) Matrix of eigenvectors V obey the relation:  PCA eliminates correlations! Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 9

How is the Transformation applied? Usually correlations different in signal and background, which one is applied? Two cases (in general): Explicit difference in pdfs (e.g. projective Likelihood) No differentiation in pdfs Use either signal or background decorrelation Use a mixture of both Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 10

Classifier Decorrelation example Example: linear correlated Gaussians  decorrelation works to 100% 1-D Likelihood on decorrelated sample give best possible performance compare also the effect on the MVA-output variable! correlated variables same variables, with decorrelation (note the different scale on the y-axis) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 11

Decorrelation Caveats What if the correlations are different for signal and background? Background Signal Original correlations After SQRT decorrelation Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 12

Decorrelation Caveats What happens if the correlations are non-linear? Original correlations After SQRT decorrelation Use decorrelation with care! Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 13

Gaussianisation Decorrelation works best if variables are Gauss-distributed Perform another transformation: “Gaussianisation” As two step procedure: 1. Rarity transformation (create uniform distribution) 2. make Gaussian via inverse error function: Decorrelate Gauss-shaped variable distributions Optional: do several iterations of Gaussianisation and Decorrelations Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 14

Example of Gaussianisation Original distributions Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 15

Example of Gaussianisation Signal gaussianised Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 16

Example of Gaussianisation Background gaussianised No simultaneous Gaussianisation of signal and background possible Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 17

Modeling decision boundaries Nearest Neighbor, Kernel estimators and Likelihood estimate the underlying pdfs Move on to determine the decision boundary instead Requires model Specific parametrization Specific determination process Most MVA methods are distinguished by the model In general the parametrization can be expressed as Next specific examples: Linear Discriminator and Neural Networks Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 18

Linear/Fisher Discriminant Linear model of the input functions Results in linear decision boundaries Note: the lines are not the functions! They are given by y(x)= const Weight determination? Maximal class separation: Maximize distance between mean values Minimize variance within class H1 H0 x1 x2 y(x) yS(x) yB(x) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 19

Linear Discriminant Maximise “between variance”, minimise “within variance” Decompose covariance matrix C into within class W and between class B Determining the maximum yields the Fisher coeffients Fk All quantities are known from the training data Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 20

Nonlinear correlations in LD Assume the following non-linear correlated data: Linear discriminant doesn’t give a good result Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 21

Nonlinear correlations in LD Assume the following non-linear correlated data: Linear discriminant doesn’t give a good result Use Non-linear Decorrelation (polar coordinates) Decorrelated data can be separated well Note: Linear discrimination doesn’t separate data with identical means Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 22

Artificial Neural Networks ANN (short: neural network) are directional, weighted graphs Evolution in steps: retrace some of the ideas Statistical concept: To extend the decision boundaries to arbitrary functions, y(x) and in turn the hi(x) need to be non-linear (Extension of) Weierstrass-Theorem: Every continuous function defined on an interval [a,b] can be uniformly approximated as closely as desired by a polynomial function Neural Networks choose a particular set of functions Biggest breakthrough: adaptability/(machine) learning Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 23

Nodes and Connections Neural networks consist of nodes and connections Activation function depends on input Output function depends on activation, usual choices: Sigmoid Tanh Binary Build a network of nodes in layers Input layer has no input nodes Output layer has no next nodes In between are hidden layers Sigmoid Binary input layer hidden layer ouput layer 1 1 . . . . i j Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 24

Example: AND Theorem: any Boolean Function can be represented by (a certain class of) neural network(s) Consider to construct the logical function AND (similar for OR) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 25

Example: XOR Try to build XOR with the same network Another layer in between is needed Increases the power of description (keep this in mind!) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 26

ANN Adaptability: Hebb’s Rule In words: If a node receives an input from another node, and if both are highly active (have the same sign), the connection weight between the nodes should be strengthened. Mathematically: (for realistic learning this has to be modified; e.g. weights can grow without limits) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 27

The ancestor: Perceptron Two layers: input and output One output node Adaptive, modifiable weights Binary transfer function Perceptron convergence theorem: Finite time for learning algorithm convergence, it can learn everything it can represent in finite amount of time Now take a closer look at this representability Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 28

Linear Separability with this is equivalent to for constant threshold value: straight line in the plane Consider XOR problem: separate A0/A1 from B0/B1 -- Can’t be done with a straight line Type depends on number of inputs: n = 2: straight lines n = 3: planes n > 3: hyper planes Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 29

Hidden Layers Solution shown before: add hidden layer! Mathematical: from flat hyper planes to convex polygons (connect e.g. with AND – type functions) Build two-layered perceptron o2 o1 (Detail: assume w3,6= w4,6= w5,6 =1/3 theta6 = 0.9) Still limited to convex and connected areas Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 30

Three layered Perceptron Add another hidden layer Subtractive polygon overlay possible (e.g. with XOR) No further increase by adding more layers Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 31

Multi Layer Perceptron Standard Network for HEP Network topology: Feedforward network Four layers: Input, Output, two hidden Node properties: Transfer function usually sigmoidal Bias nodes* usually present Learning algorithm: Backpropagation Usually online learning (apply experience after every event) Bias node*: use static threshold = 0, add as substitution threshold for all nodes (except in output layer) – and allow learning algorithm to modify the value, thus creating a variable threshold Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 32

Backpropagation Backpropagation is standard method for supervised learning, Iterative process: Forward pass Compute the network answer for a given input vector by successively computing activation and output for each node Compute error with error function Measure the current network performance by comparing to the right answer Also used to determine the generalization power of the net Backward pass Move backward from output to input layer, modifying the weights according to the learning rule (Different choices possible for applying the weight changes) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 33

Error and Loss function Fundamental concept for training: metric for difference between right answer and method result Error or Loss function: Properties: exists, the series converges is continuous and differentiable in Mean Absolute Error, “Classification Error” Mean Squared Error, “Regression Error” Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 34

Error space Define W as vector space of all weights Search for minimal Error Impossible to determine full W defines a surface in W Negative gradient points to next valley, determined by chain rule The output function plays a crucial role Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 35

Bias versus Variance high low Loss function low bias high variance high bias low variance test sample training sample Overtraining! S B x1 x2 S B x2 Common topic, here (ANN) choice of network topology Training error always decreases Common problem to choose the right complexity: Avoid to learn special features of the training sample Find the best generalization Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 36

Short Summary If you have a set of uncorrelated variables use the projective Likelihood If you want to apply Pre-Processing, be careful: Only gaussian distributed, linear correlated data can be (properly) decorrelated Gaussianisation is hard to achieve In case of correlated data, also consider other methods Fisher discriminant simple and powerful but limited Neural networks very powerful, but many options/pitfalls Choose the right network complexity, nothing in addition Validate the results Mantra: Use your brain, inspect and understand the data Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 37