Presentation is loading. Please wait.

Presentation is loading. Please wait.

W.Murray PPD 1 Machine Learning Bill Murray RAL, CCLRC home.cern.ch/~murray IoP Half Day Meeting on Statistics in High Energy Physics.

Similar presentations


Presentation on theme: "W.Murray PPD 1 Machine Learning Bill Murray RAL, CCLRC home.cern.ch/~murray IoP Half Day Meeting on Statistics in High Energy Physics."— Presentation transcript:

1 W.Murray PPD 1 Machine Learning Bill Murray RAL, CCLRC w.murray@rl.ac.uk home.cern.ch/~murray IoP Half Day Meeting on Statistics in High Energy Physics University of Manchester 16 November, 2005 Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Likelihood Neural nets Support vector machines Optimal observables Boosted trees www.physics.ox.ac.uk/phystat05

2 W.Murray PPD 2 Overview ● The state of knowledge in HEP is pretty good ● But we are bad at importing ideas from other disciplines ● I present some familiar, some not, to encourage follow-up

3 W.Murray PPD 3 What do we want to achieve? ● We aim to make optimal use of the data collected – Data are expensive: ● Use powerful techniques – Data processing is also expensive: ● Power is not the only criterion – Systematic errors may well dominate ● We need to be able to justify our results.

4 W.Murray PPD 4 The right answer ● Consider separating a dataset into 2 classes – Call them background and signal ● A simple cut is not optimal Background Signal n-Dimensional space of observables. e.g. E T miss, num. leptons

5 W.Murray PPD 5 The right answer II ● What is optimal? Background Signal ● Maybe something like this might be...

6 W.Murray PPD 6 The right answer III ● For an given efficiency, we want to minimize background ● Also known as the Likelihood Ratio ● Sort space by s/b ratio in small box ● Accept all areas with s/b above some threshold

7 W.Murray PPD 7 The right answer ● For optimal separation, order events by L s / L b – Identical to ordering by L s+b / L b ● More powerful is to fit in the 1-D space defined above. This is the right answer

8 W.Murray PPD 8 Determination of s, b densities ● We may know matrix elements – Not for e.g. a b-tag – But anyway there are detector effects ● Usually taken from simulation

9 W.Murray PPD 9 Using MC to calculate density ● Brute force: – Divide our n-D space into hypercubes with m divisions of each axis – m n elements, need 100 m n events for 10% estimate. – e.g. 1,000,000,000 for 7 dimensions and 10 bins in each ● This assumed a uniform density – actually need far more – The purpose was to separate different distributions

10 W.Murray PPD 10 Better likelihood estimation ● Clever binning – Starts to lead to tree techniques ● Kernel density estimators – Size of kernel grows with dimensions – Edges are an issue ● Ignore correlations in variables – Very commonly done ‘I used likelihood’ ● Pretend measured=true, correct later – Linked to OO techniques, bias correction – (See Zech later)

11 W.Murray PPD 11 Alternative approaches ● Cloud of cuts – Use a random cut generator to optimise cuts ● Neural nets – Well known, good for high-dimensions ● Support vector machines – Computationally easier than kernel ● Decision trees – Boosted or not?

12 W.Murray PPD 12 Digression: Cosmology ● Astronomers have a few hundred TB now – 1 pixel (byte) / sq arc second ~ 4TB – Multi-spectral, temporal, … → 1PB ● They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space ● Data doubles every year ● Data is public after 1 year ● Same access for everyone But: how long can this continue? Alex Szalay

13 W.Murray PPD 13 Next-Generation Data Analysis ● Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy ● Needles are easier than haystacks ● ‘Optimal’ statistics have poor scaling – Correlation functions are N 2, likelihood techniques N 3 – For large data sets main errors are not statistical ● As data and computers grow with Moore’s Law, we can only keep up with N logN ● A way out? – Discard notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory ● Requires combination of statistics & computer science

14 W.Murray PPD 14 Organization & Algorithms ● Use of clever data structures (trees, cubes): – Up-front creation cost, but only N logN access cost – Large speedup during the analysis – Tree-codes for correlations (A. Moore et al 2001) – Data Cubes for OLAP (all vendors) ● Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi et al (2001) ● N logN instead of N 3 => 1 day instead of 10 million years ● Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources

15 W.Murray PPD 15 Terabytes for Cosmology ● Sloan DSS costs were about 1/3 software. Future projects estimating 50%. ● Problem of scaling. Data sets grow with Moores law, but analyses can grow with n 2 – Clever code; tree structures have n log n – Abandon optimal algorithms. ● Cost of data analysis will have to be directly included in setting statistical precision. – Rene Brun: Moore’s law is broken – but grid networks may continue to accelerate.

16 W.Murray PPD 16 How to calculate likelihood

17 W.Murray PPD 17 Kernel Likelihoods ● Directly estimate PDF of distributions based upon training sample events. ● Some kernal, usually Gaussian smears the sample – increases widths ● Fully optimal iff infinite MC statistics ● Metric of kernel, (size, aspect ratio) hard to optimize 1 Watch kernel size dependence on stats. ● Kernel size must grow with dimensions; 1 Lose precision if unnecessary ones added ● Big storage/computational requirements

18 W.Murray PPD 18 Zech: Reduction of Variables ● Multi-dimensional problem where theory well known ● He aims to reduce problem dimensionality ● In case of linear dependence upon parameter can easily reduce dimensionality – If we expand about a point, all is linear... – Thus reduce to n-D = n-unknowns – Method of Optimal Observables? WJM ● Or do fit in bare likelihood; correct bias in MC

19 W.Murray PPD 19 Simple case: 2 random variables, 1 linear parameter Define new variables: We get The only relevant variable is u (The analytic expression of g(u,v|  ) is not required!) The generalization to more than 2 variables is trivial Zech: Taming m n

20 W.Murray PPD 20 Zech: Taming m n Method 2: Use of an approximately sufficient statistic or likelihood estimate No large resolution and acceptance effects: Perform fit with uncorrected data and undistorted likelihood function. Acceptance losses but small distortions: Compute global acceptance by MC and include in the likelihood function. Strong resolution effects: Perform crude unfolding. All approximations are corrected by the Monte Carlo simulation. The loss in precision introduced by the approximations is usually completely negligible.

21 W.Murray PPD 21 Neural Nets Well known in HEP

22 W.Murray PPD 22 Multi-Layer Perceptron NN ● A popular and powerful neural network model: i j k  ji  kj Need to find  ’s and  ’s, the free parameters of the model

23 W.Murray PPD 23 Neural Network Features ● Easy to use – packages exist everywhere ● Performance is good – Especially at handling higher dimensionality – No need to define a metric – But not optimal (we aim to approx. likelihood) – And only trained at one point ● Training is an issue – Optimization of nodes/layers can be difficult – Can over-focus on fluctuations – This is a problem for all machine learning ● Often worth a try

24 W.Murray PPD 24 Support Vector Machines ● Simplify storage/computation of separation by storing the 'support vector' ● The straight line is defined by closest points – the support vectors Vapnik 1996

25 W.Murray PPD 25 Straight lines??? ● Straight lines are not adequate, in general. Trick is to project from observed space into higher (infinite?) dimensionality space, such that a simple hyperplane defines the surfaces ● Projection done implicitly by kernel choice ● Only inner-products are ever evaluated, and these are metric independent, so can be calculated in normal space. ● Never need to explicitly define the higher dimensional space

26 W.Murray PPD 26 Support Vector Machines ● Storing just those points lying closest to the line is much easier than storing the entire space ● Defect is that the cut is only well defined near the line ● Computationally much easier then kernel ● Gaussian/NN kernels both popular

27 W.Murray PPD 27 Decision Trees A standard decision tree divides a problem in a series of steps. Signal/background evaluated in each box

28 W.Murray PPD 28 Decision Trees II Very clear: Each decision is binary, and whole tree can be represented as a tree: y>0.3 x>0.6 x>0.7 This display works for any dimensionality of problem But how much do we value clarity?

29 W.Murray PPD 29 Decision Trees III Downside is lack of stability First 'cut' affects all later ones, classification can vary widely with a different training set Power somewhat below NN/Kernal likelihood for typical HEP problems Stopping rule important, affected by sample size.

30 W.Murray PPD 30 Boosted Decision Trees A first tree is made Add more, constrained not to mimic existing trees Final s/b is average (in some sense) over all trees. Black, cyan purple and green reflect increasing down-weighting of new trees Number of trees

31 W.Murray PPD 31 Boosted Decision Trees Lack of stability removed by averaging Computer intensive – but not N 3 Power very good Trees individually small, whole data set is in each – good use of statistics Fairly fast. Breiman: Boosted trees best off the shelf classifier in the world

32 W.Murray PPD 32 Friedman: Machine learning ● Ensemble Learning: – Basis functions generated from the data. – Many ways exists of doing that – Ensemble Generation Procedure, EGP ● Example EGP’s: – Bagging, random forests, MART etc. ● Decision trees often with Ensemble learning ● Use lasso selection to generate weights – Total experiment is sum of weighted observations – Lasso gives 0 if importance is small See his talk Also Byron Roe

33 W.Murray PPD 33 Method Comparison 0.11690.02Random Forest 0.12789.23C4.5 0.11588.90ADTree 0.14388.57Multilayer Perceptron 0.18787.69Support Vector 0.20185.59Naive Bayes False Positives RateAccuracy (%)Algorithm Ricardo Vilalta Puneet Sarda Gordon Mutchler Paul Padley Looked at interaction classification. e.g.  K *+ 5 best variables from set of 45 kinematics

34 W.Murray PPD 34 Conclusions ● The likelihood ratio underpins everything – Consider using it ● Machine learning brings significant power – Boosted trees very interesting? ● Cost of computing becoming important – Optimal methods not necessarily optimal – But the data is very expensive too. ● Need to see statistics as a tool


Download ppt "W.Murray PPD 1 Machine Learning Bill Murray RAL, CCLRC home.cern.ch/~murray IoP Half Day Meeting on Statistics in High Energy Physics."

Similar presentations


Ads by Google