Christoph Rosemann and Helge Voss DESY FLC Naïve Bayesian Classifier Data Preprocessing Linear Discriminator Artificial Neural Networks Christoph Rosemann and Helge Voss DESY FLC Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 1
Naïve Bayesian Classifier aka “Projective Likelihood” Kernel methods and Nearest Neighbour estimate the joint pdfs in full D dimensional space If correlations are weak/non-existent the problem can be factorized: This introduces another problem: how to deal with the pdfs? Histogramming (event counting) + Automatic Less than optimal Parametric Fitting Difficult to automate Non-Parametric Fitting (e.g. splines) + Automatable - Possible artefacts, information loss Generated from Gauss Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 2
Likelihood functions and ratios The individual p(x) is usually called the likelihood function It is class dependent: Signal or Background(s) For each variable and class a pdf description is needed The classifier function is the Likelihood ratio (per class) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 3
Likelihood Example Example: Electron identification with two classes (electron, pion) and two variables (track momentum to calo energy, cluster shape) Take a candidate and evaluate each variable in each class: 2. Determine the Likelihood ratio, e.g. for the electron: Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 4
Projective Likelihood One of the most popular MVA methods in HEP Well performing, fast, robust and simple classifier If prior probabilities known, estimate on class probability can be made Big improvement wrt to simple cut approach Background likeliness in one variable can be overcome Problematic if (significant, >10%) correlations exist: Usually the classification performance is decreased Factorization approach doesn’t hold Probability estimation is lost Can you tell the correlation? Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 5
Linear Correlations Reminder: (standard TMVA example) Solution: apply linear transformation to input variables, so they fulfill Two main methods in use Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 6
SQRT Decorrelation Determine square-root C of correlation matrix C, i.e., C = C C Compute C by diagonalising C: Transformation prescription for x: (standard TMVA example before decorrelation) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 7
SQRT Decorrelation Determine square-root C of correlation matrix C, i.e., C = C C Compute C by diagonalising C: Transformation prescription for x: (after decorrelation) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 8
Principal Component Analysis Eigenvalue problem of the correlation matrix Largest eigenvalue = largest correlation Corresponding Eigenvector along axis with largest variance PCA is typically used to: Reduce the dimensionality of a problem Find the most dominant features Transformation rule: Use eigenvectors as basis (k components) Express variables in terms of the new basis (for each class) Matrix of eigenvectors V obey the relation: PCA eliminates correlations! Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 9
How is the Transformation applied? Usually correlations different in signal and background, which one is applied? Two cases (in general): Explicit difference in pdfs (e.g. projective Likelihood) No differentiation in pdfs Use either signal or background decorrelation Use a mixture of both Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 10
Classifier Decorrelation example Example: linear correlated Gaussians decorrelation works to 100% 1-D Likelihood on decorrelated sample give best possible performance compare also the effect on the MVA-output variable! correlated variables same variables, with decorrelation (note the different scale on the y-axis) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 11
Decorrelation Caveats What if the correlations are different for signal and background? Background Signal Original correlations After SQRT decorrelation Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 12
Decorrelation Caveats What happens if the correlations are non-linear? Original correlations After SQRT decorrelation Use decorrelation with care! Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 13
Gaussianisation Decorrelation works best if variables are Gauss-distributed Perform another transformation: “Gaussianisation” As two step procedure: 1. Rarity transformation (create uniform distribution) 2. make Gaussian via inverse error function: Decorrelate Gauss-shaped variable distributions Optional: do several iterations of Gaussianisation and Decorrelations Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 14
Example of Gaussianisation Original distributions Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 15
Example of Gaussianisation Signal gaussianised Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 16
Example of Gaussianisation Background gaussianised No simultaneous Gaussianisation of signal and background possible Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 17
Modeling decision boundaries Nearest Neighbor, Kernel estimators and Likelihood estimate the underlying pdfs Move on to determine the decision boundary instead Requires model Specific parametrization Specific determination process Most MVA methods are distinguished by the model In general the parametrization can be expressed as Next specific examples: Linear Discriminator and Neural Networks Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 18
Linear/Fisher Discriminant Linear model of the input functions Results in linear decision boundaries Note: the lines are not the functions! They are given by y(x)= const Weight determination? Maximal class separation: Maximize distance between mean values Minimize variance within class H1 H0 x1 x2 y(x) yS(x) yB(x) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 19
Linear Discriminant Maximise “between variance”, minimise “within variance” Decompose covariance matrix C into within class W and between class B Determining the maximum yields the Fisher coeffients Fk All quantities are known from the training data Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 20
Nonlinear correlations in LD Assume the following non-linear correlated data: Linear discriminant doesn’t give a good result Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 21
Nonlinear correlations in LD Assume the following non-linear correlated data: Linear discriminant doesn’t give a good result Use Non-linear Decorrelation (polar coordinates) Decorrelated data can be separated well Note: Linear discrimination doesn’t separate data with identical means Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 22
Artificial Neural Networks ANN (short: neural network) are directional, weighted graphs Evolution in steps: retrace some of the ideas Statistical concept: To extend the decision boundaries to arbitrary functions, y(x) and in turn the hi(x) need to be non-linear (Extension of) Weierstrass-Theorem: Every continuous function defined on an interval [a,b] can be uniformly approximated as closely as desired by a polynomial function Neural Networks choose a particular set of functions Biggest breakthrough: adaptability/(machine) learning Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 23
Nodes and Connections Neural networks consist of nodes and connections Activation function depends on input Output function depends on activation, usual choices: Sigmoid Tanh Binary Build a network of nodes in layers Input layer has no input nodes Output layer has no next nodes In between are hidden layers Sigmoid Binary input layer hidden layer ouput layer 1 1 . . . . i j Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 24
Example: AND Theorem: any Boolean Function can be represented by (a certain class of) neural network(s) Consider to construct the logical function AND (similar for OR) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 25
Example: XOR Try to build XOR with the same network Another layer in between is needed Increases the power of description (keep this in mind!) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 26
ANN Adaptability: Hebb’s Rule In words: If a node receives an input from another node, and if both are highly active (have the same sign), the connection weight between the nodes should be strengthened. Mathematically: (for realistic learning this has to be modified; e.g. weights can grow without limits) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 27
The ancestor: Perceptron Two layers: input and output One output node Adaptive, modifiable weights Binary transfer function Perceptron convergence theorem: Finite time for learning algorithm convergence, it can learn everything it can represent in finite amount of time Now take a closer look at this representability Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 28
Linear Separability with this is equivalent to for constant threshold value: straight line in the plane Consider XOR problem: separate A0/A1 from B0/B1 -- Can’t be done with a straight line Type depends on number of inputs: n = 2: straight lines n = 3: planes n > 3: hyper planes Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 29
Hidden Layers Solution shown before: add hidden layer! Mathematical: from flat hyper planes to convex polygons (connect e.g. with AND – type functions) Build two-layered perceptron o2 o1 (Detail: assume w3,6= w4,6= w5,6 =1/3 theta6 = 0.9) Still limited to convex and connected areas Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 30
Three layered Perceptron Add another hidden layer Subtractive polygon overlay possible (e.g. with XOR) No further increase by adding more layers Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 31
Multi Layer Perceptron Standard Network for HEP Network topology: Feedforward network Four layers: Input, Output, two hidden Node properties: Transfer function usually sigmoidal Bias nodes* usually present Learning algorithm: Backpropagation Usually online learning (apply experience after every event) Bias node*: use static threshold = 0, add as substitution threshold for all nodes (except in output layer) – and allow learning algorithm to modify the value, thus creating a variable threshold Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 32
Backpropagation Backpropagation is standard method for supervised learning, Iterative process: Forward pass Compute the network answer for a given input vector by successively computing activation and output for each node Compute error with error function Measure the current network performance by comparing to the right answer Also used to determine the generalization power of the net Backward pass Move backward from output to input layer, modifying the weights according to the learning rule (Different choices possible for applying the weight changes) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 33
Error and Loss function Fundamental concept for training: metric for difference between right answer and method result Error or Loss function: Properties: exists, the series converges is continuous and differentiable in Mean Absolute Error, “Classification Error” Mean Squared Error, “Regression Error” Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 34
Error space Define W as vector space of all weights Search for minimal Error Impossible to determine full W defines a surface in W Negative gradient points to next valley, determined by chain rule The output function plays a crucial role Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 35
Bias versus Variance high low Loss function low bias high variance high bias low variance test sample training sample Overtraining! S B x1 x2 S B x2 Common topic, here (ANN) choice of network topology Training error always decreases Common problem to choose the right complexity: Avoid to learn special features of the training sample Find the best generalization Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 36
Short Summary If you have a set of uncorrelated variables use the projective Likelihood If you want to apply Pre-Processing, be careful: Only gaussian distributed, linear correlated data can be (properly) decorrelated Gaussianisation is hard to achieve In case of correlated data, also consider other methods Fisher discriminant simple and powerful but limited Neural networks very powerful, but many options/pitfalls Choose the right network complexity, nothing in addition Validate the results Mantra: Use your brain, inspect and understand the data Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April 2011 37