Multivariate Analysis Past, Present and Future Harrison B. Prosper Florida State University PHYSTAT 2003 10 September 2003 Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Multivariate Analysis Harrison B. Prosper Durham, UK 2002
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Outline Introduction Historical Note Current Practice Issues Summary Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Multivariate Analysis Harrison B. Prosper Durham, UK 2002
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Introduction Data are invariably multivariate Particle physics (h, f, E, f) Astrophysics (θ, f, E, t) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction – II A Textbook Example Objects Jet 1 (b) 3 Jet 2 3 Jet 3 3 Jet 4 (b) 3 Positron 3 Neutrino 2 17 Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Introduction – III Astrophysics/Particle physics: Similarities Events Interesting events occur at random Poisson processes Backgrounds are important Experimental response functions Huge datasets Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Introduction – IV Differences In particle physics we control when events occur and under what conditions We have detailed predictions of the relative frequency of various outcomes Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction – V All we do is Count! Our experiments are ideal Bernoulli trials At Fermilab, each collision, that is, trial, is conducted the same way every 400ns de Finetti’s analysis of exchangeable trials is an accurate model of what we do Time → Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Introduction – VI Typical analysis tasks Data Compression Clustering and cluster characterization Classification/Discrimination Estimation Model selection/Hypothesis testing Optimization Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Historical Note Karl Pearson (1857 – 1936) R.A. Fisher (1890 – 1962) P.C. Mahalanobis (1893 – 1972) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Historical Note – Iris Data Iris Versicolor Iris Sotosa R.A. Fisher, The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, v. 7, p. 179-188 (1936) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Iris Data Variables X1 Sepal length X2 Sepal width X3 Petal length X4 Petal width “What linear function of the four measurements will maximize the ratio of the difference between the specific means to the standard deviations within species?” R.A. Fisher Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Fisher Linear Discriminant (1936) Solution: Which is the same, within a constant, as Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Current Practice in Particle Physics Reducing number of variables Principal Component Analysis (PCA) Discrimination/Classification Fisher Linear Discriminant (FLD) Random Grid Search (RGS) Feedforward Neural Network (FNN) Kernel Density Estimation (KDE) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Current Practice – II Parameter Estimation Maximum Likelihood (ML) Bayesian (KDE and analytical methods) e.g., see talk by Florencia Canelli (12A) Weighting Usually 0, 1, referred to as “cuts” Sometimes use the R. Barlow method Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Cuts (0, 1 weights) S = B = Points that lie below the cuts are “cut out” 1 We refer to (x0, y0) as a cut-point Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Multivariate Analysis Harrison B. Prosper Durham, UK 2002
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Grid Search S = B = Apply cuts at each grid point compute some measure of their effectiveness and choose most effective cuts Curse of dimensionality: number of cut-points ~ NbinNdim Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Multivariate Analysis Harrison B. Prosper Durham, UK 2002
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Random Grid Search Take each point of the signal class as a cut-point Signal fraction Background fraction 1 y n = # events in sample k = # events after cuts fraction = n/k x H.B.P. et al, Proceedings, CHEP 1995 Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Multivariate Analysis Harrison B. Prosper Durham, UK 2002
Example: DØ Top Discovery (1995) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Optimal Discrimination r(x,y) = constant defines the optimal decision boundary Bayes Discriminant Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
FeedForward Neural Networks Applications Discrimination Parameter estimation Function and density estimation Basic Idea Encode mapping (Kolmogorov, 1950s). using a set of 1-D functions. Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Example: DØ Search for LeptoQuarks LQ g q LQ Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Issues Method choice Life is short and data finite; so how should one choose a method? Model complexity How to reduce dimensionality of data, while minimizing loss of “information”? How many model parameters? How should one avoid over-fitting? Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Issues – I I Model robustness Is a cut on a multivariate discriminant necessarily more sensitive to modeling errors than a cut on each of its input variables? What is a practical, but useful, way to assess sensitivity to modeling errors and robustness with respect to assumptions? Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Issues - III Accuracy of predictions How should one place “error bars” on multivariate-based results? Is a Bayesian approach useful? Goodness of fit How can this be done in multiple dimensions? Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Summary After ~ 80 years of effort we have many powerful methods of analysis A few of which are now used routinely in physics analyses The most pressing need is to understand some issues better so that when the data tsunami strikes we can respond sensibly Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
FNN – Probabilistic Interpretation Minimize the empirical risk function with respect to w Solution (for large N) If t(x) = kd[1-I(x)], where I(x) = 1 if x is of class k, 0 otherwise D.W. Ruck et al., IEEE Trans. Neural Networks 1(4), 296-298 (1990) E.A. Wan, IEEE Trans. Neural Networks 1(4), 303-305 (1990) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Multivariate Analysis Harrison B. Prosper Durham, UK 2002
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper Self Organizing Map Basic Idea (Kohonen, 1988) Map each of K feature vectors X = (x1,..,xN)T into one of M regions of interest defined by the vector wm so that all X mapped to a given wm are closer to it than to all remaining wm. Basically, perform a coarse-graining of the feature space. Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Support Vector Machines Basic Idea Data that are non-separable in N-dimensions have a higher chance of being separable if mapped into a space of higher dimension Use a linear discriminant to partition the high dimensional feature space. Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Independent Component Analysis Basic Idea Assume X = (x1,..,xN)T is a linear sum X = AS of independent sources S = (s1,..,sN)T. Both A, the mixing matrix, and S are unknown. Find a de-mixing matrix T such that the components of U = TX are statistically independent Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper