Machine Learning for Data Certification at CMS

Machine Learning for Data Certification at CMS
by Humza Khan Mentor: Dr. Nural Akchurin, Federico De Guio

Overview Compact Muon Solenoid (CMS) Data Certification
Machine Learning Introduction Preprocessing Supervised Learning Support Vector Machines, Boosted Decision Trees, Stochastic Gradient Descent Unsupervised Learning One-class SVM, Isolation Forest, Autoencoders Further Steps

Compact Muon Solenoid (CMS)
Major experiment, along with ATLAS LHC smashes particles together CMS is like a giant camera for collisions Multiple layers for tracking different particles

Compact Muon Solenoid (CMS) continued

Data Certification LHC produces approximately 25 petabytes of data per year Not all of it is interesting for physics “good” vs. “bad” data Preliminary filters get sort a lot of the data Still leaves a large amount of unclassified data Needs to be manually checked by detector experts

Machine Learning Introduction
Make computers learn a problem without being explicitly programmed Clever uses of statistics and computer science to analyze patterns in data Classification vs. Regression Both assign numerical values to data samples Regression assigns from a continuous set Classification assigns from a discrete set Given 𝑛 data samples Each has 𝑞 features (values that describe the object) Each sample (usually) has a label

Machine Learning Introduction
Split data into training and testing sets Train model with training set Feed testing set to model, then see how accurate predictions are Supervised learning has labels attached to samples Unsupervised learning does not have labels attached to samples

Preprocessing Feature scaling Feature selection
Dimensionality reduction Given 43 features Represent Pt, Eta, Phi, MetPt MetPhi, Vertices, Cross Section Mean, RMS, Q1, Q2, Q3, Q4, Q5

Preprocessing

Support Vector Machines (SVM)
Data with 𝑛 features is in 𝑛-dimensional space Find 𝑛−1 dimensional hyperplane to divide data Maximize distance between hyperplane and points Not all data is linearly separable

Support Vector Machines (SVM)

Stochastic Gradient Descent (SGD)
Finds greatest derivative at point and moves that way Good at finding minima quickly Can get stuck at local minimum instead of global SGD only updates based on one sample instead of all

Stochastic Gradient Descent (SGD)

One-class SVM Novelty detection
Train only on good data, Test on both Useful when classes are extremely different in size Only have 5% background

One-class SVM

Autoencoders Neural network Dimensionality reducer
Computational representation of human neurons Dimensionality reducer Finds non-linear correlations within features

Further Steps Inspect labels Further explore deep learning
Image recognition

Sources A4-at jpg?itok=NaAYCj1Z svm-plot-oneclass-py

Thanks! Jean Kirsch, Steven Goldfarb, and Thomas Schwarz
Lounsberry Foundation Federico de Guio Nural Akchurin Giovanni Franconi Filip Sîroky

Machine Learning for Data Certification at CMS

Similar presentations

Presentation on theme: "Machine Learning for Data Certification at CMS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning for Data Certification at CMS

Similar presentations

Presentation on theme: "Machine Learning for Data Certification at CMS"— Presentation transcript:

Similar presentations

About project

Feedback