Machine Learning for Data Certification at CMS by Humza Khan Mentor: Dr. Nural Akchurin, Federico De Guio
Overview Compact Muon Solenoid (CMS) Data Certification Machine Learning Introduction Preprocessing Supervised Learning Support Vector Machines, Boosted Decision Trees, Stochastic Gradient Descent Unsupervised Learning One-class SVM, Isolation Forest, Autoencoders Further Steps
Compact Muon Solenoid (CMS) Major experiment, along with ATLAS LHC smashes particles together CMS is like a giant camera for collisions Multiple layers for tracking different particles
Compact Muon Solenoid (CMS) continued
Data Certification LHC produces approximately 25 petabytes of data per year Not all of it is interesting for physics “good” vs. “bad” data Preliminary filters get sort a lot of the data Still leaves a large amount of unclassified data Needs to be manually checked by detector experts
Machine Learning Introduction Make computers learn a problem without being explicitly programmed Clever uses of statistics and computer science to analyze patterns in data Classification vs. Regression Both assign numerical values to data samples Regression assigns from a continuous set Classification assigns from a discrete set Given 𝑛 data samples Each has 𝑞 features (values that describe the object) Each sample (usually) has a label
Machine Learning Introduction Split data into training and testing sets Train model with training set Feed testing set to model, then see how accurate predictions are Supervised learning has labels attached to samples Unsupervised learning does not have labels attached to samples
Preprocessing Feature scaling Feature selection Dimensionality reduction Given 43 features Represent Pt, Eta, Phi, MetPt MetPhi, Vertices, Cross Section Mean, RMS, Q1, Q2, Q3, Q4, Q5
Preprocessing
Support Vector Machines (SVM) Data with 𝑛 features is in 𝑛-dimensional space Find 𝑛−1 dimensional hyperplane to divide data Maximize distance between hyperplane and points Not all data is linearly separable
Support Vector Machines (SVM)
Stochastic Gradient Descent (SGD) Finds greatest derivative at point and moves that way Good at finding minima quickly Can get stuck at local minimum instead of global SGD only updates based on one sample instead of all
Stochastic Gradient Descent (SGD)
One-class SVM Novelty detection Train only on good data, Test on both Useful when classes are extremely different in size Only have 5% background
One-class SVM
Autoencoders Neural network Dimensionality reducer Computational representation of human neurons Dimensionality reducer Finds non-linear correlations within features
Further Steps Inspect labels Further explore deep learning Image recognition
Sources http://cms.web.cern.ch/sites/cms.web.cern.ch/files/styles/large/public/field/image/0611042_01- A4-at-140001.jpg?itok=NaAYCj1Z http://cms.web.cern.ch/news/what-cms http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples- svm-plot-oneclass-py https://twiki.cern.ch/twiki/bin/viewauth/CMS/ML4DC https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png http://nghiaho.com/wp-content/uploads/2012/12/autoencoder_network1.png
Thanks! Jean Kirsch, Steven Goldfarb, and Thomas Schwarz Lounsberry Foundation Federico de Guio Nural Akchurin Giovanni Franconi Filip Sîroky