Presentation is loading. Please wait.

Presentation is loading. Please wait.

W.Murray PPD 1 Summary of conference Emphasis on best-practice Bill Murray Thanks to Barlow, Cowan and Lyons.

Similar presentations


Presentation on theme: "W.Murray PPD 1 Summary of conference Emphasis on best-practice Bill Murray Thanks to Barlow, Cowan and Lyons."— Presentation transcript:

1 W.Murray PPD 1 Summary of conference Emphasis on best-practice Bill Murray Thanks to Barlow, Cowan and Lyons

2 W.Murray PPD 2 Bayesian and Frequentist From: Roger Barlow Manchester IoP meeting November 16 th 2005

3 W.Murray PPD 3 Probability Probability as limit of frequency P(A)= Limit N A /N total Usual definition taught to students Makes sense Works well most of the time- But not all

4 W.Murray PPD 4 Frequentist probability “It will probably rain tomorrow.” “ M t =174.3±5.1 GeV means the top quark mass lies between 169.2 and 179.4, with 68% probability.” “The statement ‘It will rain tomorrow.’ is probably true.” “M t =174.3±5.1 GeV means: the top quark mass lies between 169.2 and 179.4, at 68% confidence.”

5 W.Murray PPD 5 Bayesian Probability P(A) expresses my belief that A is true Limits 0(impossible) and 1 (certain) Calibrated off clear-cut instances (coins, dice, urns)

6 W.Murray PPD 6 Frequentist versus Bayesian? Two sorts of probability – totally different. (Bayesian probability also known as Inverse Probability.) Rivals? Religious differences? Particle Physicists tend to be frequentists. Cosmologists tend to be Bayesians No. Two different tools for practitioners Important to: ● Be aware of the limits and pitfalls of both ● Always be aware which you’re using

7 W.Murray PPD 7 Bayes Theorem (1763) P(A|B) P(B) = P(A and B) = P(B|A) P(A) P(A|B)=P(B|A) P(A) P(B) Frequentist use eg Čerenkov counter P(  | signal)=P(signal |  ) P(  ) / P(signal) Bayesian use P(theory |data) = P(data | theory) P(theory) P(data)

8 W.Murray PPD 8 Bayesian Prior P(theory) is the Prior Expresses prior belief theory is true Can be function of parameter: P(M top ), P(M H ), P(α,β,γ) Bayes’ Theorem describes way prior belief is modified by experimental data But what do you take as initial prior?

9 W.Murray PPD 9 Uniform Prior General usage: choose P(a) uniform in a (principle of insufficient reason) Often ‘improper’: ∫P(a)da =∞. Though posterior P(a|x) comes out sensible BUT! If P(a) uniform, P(a 2 ), P(ln a), P(√a).. are not Insufficient reason not valid (unless a is ‘most fundamental’ – whatever that means) Statisticians handle this: check results for ‘robustness’ under different priors

10 W.Murray PPD 10 Example – Le Diberder Sad Story Fitting CKM angle α from B  6 observables 3 amplitudes: 6 unknown parameters (magnitudes, phases) α is the fundamentally interesting one

11 W.Murray PPD 11 Results Frequentist Bayesian Set one phase to zero Uniform priors in other two phases and 3 magnitudes

12 W.Murray PPD 12 More Results Bayesian Parametrise Tree and Penguin amplitudes Bayesian 3 Amplitudes: 3 real parts, 3 Imaginary parts

13 W.Murray PPD 13 Interpretation ● B  shows same (mis)behaviour ● Removing all experimental info gives similar P(α) ● The curse of high dimensions is at work Uniformity in x,y,z makes P(r) peak at large r This result is not robust under changes of prior

14 W.Murray PPD 14 Happy ending? Jeffreys’ Priors instead of uniform priors for ε and b Not uniform but like 1/ε, 1/b R. Barlow argued for Non-informative priors. These are designed to bias the result least, but seem hard to construct. Coverage (a very frequentist concept) is a useful tool for Bayesians

15 W.Murray PPD 15 Do’s and Dont’s with L ikelihoods Louis Lyons Oxford CDF Manchester, 16 th November 2005

16 W.Murray PPD 16 NORMALISATION FOR LIKELIHOOD data param e.g. Lifetime fit to t 1, t 2,………..t n t MUST be independent of  /1 Missing / )|(       t etP INCORRECT

17 W.Murray PPD 17 “We observed no significant signal, and our 90% conf upper limit is …..” Need to specify method e.g. Chi-squared (data or theory error) Frequentist (Central or upper limit) Feldman-Cousins Bayes with prior = const, “Show your L ” 1) Not always practical 2) Not sufficient for frequentist methods Upper Limits

18 W.Murray PPD 18 Δln L = -1/2 rule If L (μ) is Gaussian, following definitions of σ are equivalent: 1) RMS of L (µ) 2) 1/√(-d 2 L /dµ 2 ) 3) ln( L (μ±σ) = ln( L (μ 0 )) -1/2 If L (μ) is non-Gaussian, these are no longer the same “Procedure 3) above still gives interval that contains the true value of parameter μ with 68% probability” Barlow: Phystat05

19 W.Murray PPD 19 How often does quoted range for parameter include param’s true value? N.B. Coverage is a property of METHOD, not of a particular exptl result Coverage can vary with Study coverage of different methods of Poisson parameter, from observation of number of events n Hope for: Nominal value 100% Coverage

20 W.Murray PPD 20 COVERAGE If true for all : “correct coverage” P< for some “undercoverage” (this is serious !) P> for some “overcoverage” Conservative Loss of rejection power

21 W.Murray PPD 21 Coverage : L approach (Not frequentist) P(n,μ) = e -μ μ n /n! (Joel Heinrich CDF note 6438) -2 lnλ< 1 λ = P(n,μ)/P(n,μ best ) UNDERCOVERS

22 W.Murray PPD 22 Frequentist central intervals, NEVER undercovers (Conservative at both ends)

23 W.Murray PPD 23 Feldman-Cousins Unified intervals Frequentist, so NEVER undercovers

24 W.Murray PPD 24 Probability ordering Frequentist, so NEVER undercovers

25 W.Murray PPD 25 Great?Good?Bad L max Frequency Find params by maximising L So larger L better than smaller L So L max gives Goodness of Fit?? Monte Carlo distribution of unbinned L max L max and goodness of fit

26 W.Murray PPD 26 Fit exponential to times t 1, t 2, t 3 ……. [Joel Heinrich, CDF 5639] L = ln L max = -N(1 + ln t av ) i.e. Depends only on AVERAGE t, but is INDEPENDENT OF DISTRIBUTION OF t (except for……..) (Average t is a sufficient statistic) Variation of L max in Monte Carlo is due to variations in samples’ average t, but NOT TO BETTER OR WORSE FIT pdf Same average t same L max t Why L max is useless

27 W.Murray PPD 27 Example 2 L = cos θ pdf (and likelihood) depends only on cos 2 θ i Insensitive to sign of cosθ i So data can be in very bad agreement with expected distribution e.g. all data with cosθ < 0 and L max does not know about it. Example of general principle Why L max is nearly useless

28 W.Murray PPD 28 Conclusion: L has sensible properties with respect to parameters NOT with respect to data L max within Monte Carlo peak is NECESSARY not SUFFICIENT (‘Necessary’ doesn’t mean that you have to do it!) L max and Goodness of Fit

29 W.Murray PPD 29 Compare Likelihood and PDF Integrating L not very sensible Max prob density not very sensibleConclusion Changes when param is transformedINVARIANT wrt transformation of observableIntegral of function INVARIANT wrt transformation of parameterChanges when observable is transformedValue of function L (λ;t) pdf(t;λ)

30 W.Murray PPD 30 Why talk about integrating likelhood? ● In Bayes statistics, integral of posterior p.d.f. Is very useful P(theory |data) = P(data | theory) P(theory) P(data) ● Ignoring P(data) this is: Posterior PDF L * Prior ● If you use a constant uniform prior then Posterior PDF L ● This is of course wrong.

31 W.Murray PPD 31 Getting L wrong: Punzi effect Giovanni Punzi @ PHYSTAT2003 “Comments on L fits with variable resolution” Separate two close signals, when resolution σ varies event by event, and is different for 2 signals e.g. 1) Signal 1 1+cos 2 θ Signal 2 Isotropic and different parts of detector give different σ 2) M (or τ) Different numbers of tracks  different σ M (or σ τ )

32 W.Murray PPD 32 Events characterised by x i and  σ i A events centred on x = 0, B events centred on x = 1 L (f) wrong = Π [f * G(x i,0,σ i ) + (1-f) * G(x i,1,σ i )] L (f) right = Π [f*p(x i,σ i ;A) + (1-f) * p(x i,σ i ;B)] p(S,T) = p(S|T) * p(T) p(x i,σ i |A) = p(x i |σ i,A) * p(σ i |A) = G(x i,0,σ i ) * p(σ i |A) So L (f) right = Π[f * G(x i,0,σ i ) * p(σ i |A) + (1-f) * G(x i,1,σ i ) * p(σ i |B)] If p(σ|A) = p(σ|B), L right = L wrong but NOT otherwise

33 W.Murray PPD 33 Giovanni’s Monte Carlo for A : G(x,0,   ) B : G(x,1   ) f A = 1/3 L wrong L right       f A  f  f A  f 1. 0 1. 0 0. 336(3) 0. 08 Same 1. 0 1. 1 0. 374(4) 0. 08 0. 333(0) 0  1. 0 . 0  0. 645(6) 0. 12 0. 333(0)  0  1  2 1.5  3 0. 514(7) 0. 14 0. 335(2) 0. 03  1.0 1  2 0.482(9) 0.09 0.333(0) 0  1) L wrong OK for p    p   , but otherwise BIASSED  2) L right unbiassed, but L wrong biassed (enormously)!  3) L right gives smaller σ f than L wrong

34 W.Murray PPD 34 Explanation of Punzi bias σ A = 1 σ B = 2 A events with σ = 1 B events with σ = 2 x  x  ACTUAL DISTRIBUTION FITTING FUNCTION [N A /N B variable, but same for A and B events] Fit gives upward bias for N A /N B because (i) that is much better for A events; and (ii) it does not hurt too much for B events

35 W.Murray PPD 35 Avoiding Punzi Bias Include p(σ|A) and p(σ|B) in fit (But then, for example, particle identification may be determined more by momentum distribution than by PID) OR Fit each range of σ i separately, and add (N A ) i  (N A ) total, and similarly for B Incorrect method using L wrong uses weighted average of (f A ) j, assumed to be independent of j Talk by Catastini at PHYSTAT05

36 W.Murray PPD 36 Nuisance parameters and systematic uncertainties Glen Cowan Stolen from: Glen Cowan Royal Holloway, University of London Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

37 W.Murray PPD 37 Nuisance parameters Glen Cowan Suppose the outcome of the experiment is some set of data values x (here shorthand for e.g. x 1,..., x n ). We want to determine a parameter  (could be a vector of parameters  1,...,  n ). The probability law for the data x depends on  : L(x|  ) (the likelihood function) E.g. maximize L to find estimator Now suppose, however, that the vector of parameters: contains some that are of interest, and others that are not of interest: Symbolically: The are called nuisance parameters. Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

38 W.Murray PPD 38 Example #1: fitting a straight line Glen Cowan Data: Model: measured y i independent, Gaussian: assume x i and  i known. Goal: estimate  0 (don’t care about  1 ). Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

39 W.Murray PPD 39 Case #1:  1 known a priori Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester For Gaussian y i, ML same as LS Minimize  2 → estimator Come up one unit from to find

40 W.Murray PPD 40 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Correlation between causes errors to increase. Standard deviations from tangent lines to contour Case #2: both  0 and  1 unknown

41 W.Murray PPD 41 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester The information on  1 improves accuracy of Case #3: we have a measurement t 1 of  1

42 W.Murray PPD 42 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester The ‘tangent plane’ method is a special case of using the profile likelihood: The profile likelihood is found by maximizing L (  0,  1 ) for each  0. Equivalently use The interval obtained from is the same as what is obtained from the tangents to Well known in HEP as the ‘MINOS’ method in MINUIT. Profile likelihood is one of several ‘pseudo-likelihoods’ used in problems with nuisance parameters. See e.g. talk by Rolke at PHYSTAT05.

43 W.Murray PPD 43 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester The Bayesian approach In Bayesian statistics we can associate a probability with a hypothesis, e.g., a parameter value . Interpret probability of  as ‘degree of belief’ (subjective). Need to start with ‘prior pdf’  (  ), this reflects degree of belief about  before doing the experiment. Our experiment has data x, → likelihood function L(x|  ). Bayes’ theorem tells how our beliefs should be updated in light of the data x: Posterior pdf p(  |x) contains all our knowledge about .

44 W.Murray PPD 44 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Case #4: Bayesian method We need to associate prior probabilities with  0 and  1, e.g., Putting this into Bayes’ theorem gives: posterior Q likelihood  prior ← based on previous measurement reflects ‘prior ignorance’, in any case much broader than

45 W.Murray PPD 45 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Bayesian method (continued) Ability to marginalize over nuisance parameters is an important feature of Bayesian statistics. We then integrate (marginalize) p(  0,  1 | x) to find p(  0 | x): In this example we can do the integral (rare). We find

46 W.Murray PPD 46 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Likelihood ratio limits (Feldman-Cousins) Define likelihood ratio for hypothesized parameter value s: Here is the ML estimator, note Critical region defined by low values of likelihood ratio. Resulting intervals can be one- or two-sided (depending on n). (Re)discovered for HEP by Feldman and Cousins, Phys. Rev. D 57 (1998) 3873.

47 W.Murray PPD 47 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Nuisance parameters and limits In general we don’t know the background b perfectly. Suppose we have a measurement of b, e.g., b meas ~ N (b,  b ) So the data are really: n events and the value b meas. In principle the confidence interval recipe can be generalized to two measurements and two parameters. Difficult and rarely attempted, but see e.g. talk by G. Punzi at PHYSTAT05. G. Punzi, PHYSTAT05

48 W.Murray PPD 48 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Bayesian limits with uncertainty on b Uncertainty on b goes into the prior, e.g., Put this into Bayes’ theorem, Marginalize over b, then use p(s|n) to find intervals for s with any desired probability content. Controversial part here is prior for signal  s (s) (treatment of nuisance parameters is easy).

49 W.Murray PPD 49 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Cousins-Highland method Regard b as ‘random’, characterized by pdf  (b). Makes sense in Bayesian approach, but in frequentist model b is constant (although unknown). A measurement b meas is random but this is not the mean number of background events, rather, b is. Compute anyway This would be the probability for n if Nature were to generate a new value of b upon repetition of the experiment with  b (b). Now e.g. use this P(n;s) in the classical recipe for upper limit at CL = 1  : Result has hybrid Bayesian/frequentist character.

50 W.Murray PPD 50 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester ‘Integrated likelihoods’ Consider again signal s and background b, suppose we have uncertainty in b characterized by a prior pdf  b (b). Define integrated likelihood as also called modified profile likelihood, in any case not a real likelihood. Now use this to construct likelihood ratio test and invert to obtain confidence intervals. Feldman-Cousins & Cousins-Highland (FHC 2 ), see e.g. J. Conrad et al., Phys. Rev. D67 (2003) 012002 and Conrad/Tegenfeldt PHYSTAT05 talk. Calculators available (Conrad, Tegenfeldt, Barlow).

51 W.Murray PPD 51 Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Interval from inverting profile LR test Suppose we have a measurement b meas of b. Build the likelihood ratio test with profile likelihood: and use this to construct confidence intervals. See PHYSTAT05 talks by Cranmer, Feldman, Cousins, Reid.

52 W.Murray PPD 52 Wrapping up Glen Cowan I’ve shown a few ways of treating nuisance parameters in two examples (fitting line, Poisson mean with background). No guarantee this will bear any relation to the problem you need to solve... At recent PHYSTAT meetings the statisticians have encouraged physicists to: learn Bayesian methods, don’t get too fixated on coverage, try to see statistics as a ‘way of thinking’ rather than a collection of recipes. I tend to prefer the Bayesian methods for systematics but still a very open area of discussion. Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

53 W.Murray PPD 53 Machine Learning Bill Murray RAL, CCLRC Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester Likelihood Neural nets Support vector machines Optimal observables Boosted trees

54 W.Murray PPD 54 What do we want to achieve? ● We aim to make optimal use of the data collected – Data are expensive: ● Use powerful techniques – Data processing is also expensive: ● Power is not the only criterion – Systematic errors may well dominate ● We need to be able to justify our results.

55 W.Murray PPD 55 The right answer ● Consider separating a dataset into 2 classes – Call them background and signal ● A simple cut is not optimal Background Signal n-Dimensional space of observables. e.g. E T miss, num. leptons

56 W.Murray PPD 56 The right answer II ● What is optimal? Background Signal ● Maybe something like this might be...

57 W.Murray PPD 57 The right answer III ● For an given efficiency, we want to minimize background ● Also known as the Likelihood Ratio ● Sort space by s/b ratio in small box ● Accept all areas with s/b above some threshold

58 W.Murray PPD 58 The right answer ● For optimal separation, order events by L s / L b – Identical to ordering by L s+b / L b ● More powerful is to fit in the 1-D space defined above. This is the right answer

59 W.Murray PPD 59 Determination of s, b densities ● We may know matrix elements – Not for e.g. a b-tag – But anyway there are detector effects ● Usually taken from simulation

60 W.Murray PPD 60 Using MC to calculate density ● Brute force: – Divide our n-D space into hypercubes with m divisions of each axis – m n elements, need 100 m n events for 10% estimate. – e.g. 1,000,000,000 for 7 dimensions and 10 bins in each ● This assumed a uniform density – actually need far more – The purpose was to separate different distributions

61 W.Murray PPD 61 Better likelihood estimation ● Clever binning – Starts to lead to tree techniques ● Kernel density estimators – Size of kernel grows with dimensions – Edges are an issue ● Ignore correlations in variables – Very commonly done ‘I used likelihood’ ● Pretend measured=true, correct later – Linked to OO techniques, bias correction – (See Zech later)

62 W.Murray PPD 62 Alternative approaches ● Cloud of cuts – Use a random cut generator to optimise cuts ● Neural nets – Well known, good for high-dimensions ● Support vector machines – Computationally easier than kernel ● Decision trees – Boosted or not?

63 W.Murray PPD 63 Digression: Cosmology ● Astronomers have a few hundred TB now – 1 pixel (byte) / sq arc second ~ 4TB – Multi-spectral, temporal, … → 1PB ● They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space ● Data doubles every year ● Data is public after 1 year ● Same access for everyone But: how long can this continue? Alex Szalay

64 W.Murray PPD 64 Next-Generation Data Analysis ● Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy ● Needles are easier than haystacks ● ‘Optimal’ statistics have poor scaling – Correlation functions are N 2, likelihood techniques N 3 – For large data sets main errors are not statistical ● As data and computers grow with Moore’s Law, we can only keep up with N logN ● A way out? – Discard notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory ● Requires combination of statistics & computer science

65 W.Murray PPD 65 Organization & Algorithms ● Use of clever data structures (trees, cubes): – Up-front creation cost, but only N logN access cost – Large speedup during the analysis – Tree-codes for correlations (A. Moore et al 2001) – Data Cubes for OLAP (all vendors) ● Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi et al (2001) ● N logN instead of N 3 => 1 day instead of 10 million years ● Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources

66 W.Murray PPD 66 Terabytes for Cosmology ● Sloan DSS costs were about 1/3 software. Future projects estimating 50%. ● Problem of scaling. Data sets grow with Moores law, but analyses can grow with n 2 – Clever code; tree structures have n log n – Abandon optimal algorithms. ● Cost of data analysis will have to be directly included in setting statistical precision. – Rene Brun: Moore’s law is broken – but grid networks may continue to accelerate.

67 W.Murray PPD 67 How to calculate likelihood

68 W.Murray PPD 68 Kernel Likelihoods ● Directly estimate PDF of distributions based upon training sample events. ● Some kernal, usually Gaussian smears the sample – increases widths ● Fully optimal iff infinite MC statistics ● Metric of kernel, (size, aspect ratio) hard to optimize ● Watch kernel size dependence on stats. ● Kernel size must grow with dimensions; ● Lose precision if unnecessary ones added ● Big storage/computational requirements

69 W.Murray PPD 69 Kernel Likelihoods II

70 W.Murray PPD 70 Zech: Reduction of Variables ● Multi-dimensional problem where theory well known ● He aims to reduce problem dimensionality ● In case of linear dependence upon parameter can easily reduce dimensionality – If we expand about a point, all is linear... – Thus reduce to n-D = n-unknowns – Method of Optimal Observables? WJM ● Or do fit in bare likelihood; correct bias in MC

71 W.Murray PPD 71 Simple case: 2 random variables, 1 linear parameter Define new variables: We get The only relevant variable is u (The analytic expression of g(u,v|  ) is not required!) The generalization to more than 2 variables is trivial Zech: Taming m n

72 W.Murray PPD 72 Zech: Taming m n Method 2: Use of an approximately sufficient statistic or likelihood estimate No large resolution and acceptance effects: Perform fit with uncorrected data and undistorted likelihood function. Acceptance losses but small distortions: Compute global acceptance by MC and include in the likelihood function. Strong resolution effects: Perform crude unfolding. All approximations are corrected by the Monte Carlo simulation. The loss in precision introduced by the approximations is usually completely negligible.

73 W.Murray PPD 73 Neural Nets Well known in HEP

74 W.Murray PPD 74 Multi-Layer Perceptron NN ● A popular and powerful neural network model: i j k  ji  kj Need to find  ’s and  ’s, the free parameters of the model

75 W.Murray PPD 75 Neural Network Features ● Easy to use – packages exist everywhere ● Performance is good – Especially at handling higher dimensionality – No need to define a metric – But not optimal (we aim to approx. likelihood) – And only trained at one point ● Training is an issue – Optimization of nodes/layers can be difficult – Can over-focus on fluctuations – This is a problem for all machine learning ● Often worth a try

76 W.Murray PPD 76 Support Vector Machines ● Simplify storage/computation of separation by storing the 'support vector' ● The straight line is defined by closest points – the support vectors Vapnik 1996

77 W.Murray PPD 77 Straight lines??? ● Straight lines are not adequate, in general. Trick is to project from observed space into higher (infinite?) dimensionality space, such that a simple hyperplane defines the surfaces ● Projection done implicitly by kernel choice ● Only inner-products are ever evaluated, and these are metric independent, so can be calculated in normal space. ● Never need to explicitly define the higher dimensional space

78 W.Murray PPD 78 Support Vector Machines ● Storing just those points lying closest to the line is much easier than storing the entire space ● Defect is that the cut is only well defined near the line ● Computationally much easier then kernel ● Gaussian/NN kernels both popular

79 W.Murray PPD 79 Decision Trees A standard decision tree divides a problem in a series of steps. Signal/background evaluated in each box

80 W.Murray PPD 80 Decision Trees II Very clear: Each decision is binary, and whole tree can be represented as a tree: y>0.3 x>0.6 x>0.7 This display works for any dimensionality of problem But how much do we value clarity?

81 W.Murray PPD 81 Decision Trees III Downside is lack of stability First 'cut' affects all later ones, classification can vary widely with a different training set Power somewhat below NN/Kernal likelihood for typical HEP problems Stopping rule important, affected by sample size.

82 W.Murray PPD 82 Boosted Decision Trees A first tree is made Add more, constrained not to mimic existing trees Final s/b is average (in some sense) over all trees. Black, cyan purple and green reflect increasing down-weighting of new trees Number of trees

83 W.Murray PPD 83 Boosted Decision Trees Lack of stability removed by averaging Computer intensive – but not N 3 Power very good Trees individually small, whole data set is in each – good use of statistics Fairly fast. Breiman: Boosted trees best off the shelf classifier in the world

84 W.Murray PPD 84 Friedman: Machine learning ● Ensemble Learning: – Basis functions generated from the data. – Many ways exists of doing that – Ensemble Generation Procedure, EGP ● Example EGP’s: – Bagging, random forests, MART etc. ● Decision trees often with Ensemble learning ● Use lasso selection to generate weights – Total experiment is sum of weighted observations – Lasso gives 0 if importance is small See his talk Also Byron Roe

85 W.Murray PPD 85 Method Comparison 0.11690.02Random Forest 0.12789.23C4.5 0.11588.90ADTree 0.14388.57Multilayer Perceptron 0.18787.69Support Vector 0.20185.59Naive Bayes False Positives RateAccuracy (%)Algorithm Ricardo Vilalta Puneet Sarda Gordon Mutchler Paul Padley Looked at interaction classification. e.g.  K *+ 5 best variables from set of 45 kinematics

86 W.Murray PPD 86 Conclusions ● The likelihood ratio underpins everything – Use it, but do it right ● Bayesian and Frequentist methods are both needed today – If in doubt, check your method with the other ● Cost of computing becoming important – Optimal methods not necessarily optimal – But the data is very expensive too.


Download ppt "W.Murray PPD 1 Summary of conference Emphasis on best-practice Bill Murray Thanks to Barlow, Cowan and Lyons."

Similar presentations


Ads by Google