Download presentation
Published byGloria Marshall Modified over 9 years ago
1
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Yinghua Zhang Huangxun CHEN
2
Outline Background Proposed Methods Evaluation
Learning from Winning Solutions Conclusion
3
Background Imperfection of the standard model of particle physics
Matter-antimatter asymmetry in the Universe The existence of the dark matter Indication of new physics 𝜏→ 𝜇 − 𝜇 − 𝜇 + decay which is forbidden in the standard model LHCb experiment: search for the 𝜏→ 𝜇 − 𝜇 − 𝜇 + decay A data mining challenge on Kaggle Data sets: from the largest particle accelerator in the world Goal: A classifier to predict whether 𝜏→ 𝜇 − 𝜇 − 𝜇 + decay happened given a list of collision events and their properties
4
Data Description Training set Test set Check agreement data set
Labelled dataset (attribute ‘signal’: 1 for signal events) Signal events are simulated, background events are real data 67,553 training samples, 49 attributes Test set Non-labelled dataset(i.e., without attribute ‘signal’) 855,819 test samples, 46 attributes (all attributes in training set except ‘mass’, ‘production’ and ‘minANNmuon’) Check agreement data set Check correlation data set
5
Exploratory Data Analysis
Feature importance: xgboost
6
Three most important features
p0_track_Chi2Dof IPSig p2_track_Chi2Dof Box Histogram
7
Three least important features
dira isolatione isolationf Box Histogram
8
Outline Background Proposed Methods Evaluation
Learning from Winning Solutions Conclusion
9
Proposed method Baseline model
Logistic Regression Random Forest Boosted Decision Tree + Logistic Regression Ensemble method Voting method Stacked generalization
10
Logistic Regression A linear classifier Problem
Given a binary output valuable Y, model the conditional probability Pr 𝑌=1 𝑋=𝑥) as a function of x. Logistic regression model 𝑙𝑜𝑔 𝑝(𝑥) 1−𝑝(𝑥) = 𝛽 0 +𝑥∙𝛽 ⇒𝑝= 1 1+ 𝑒 −( 𝛽 0 +𝑥∙𝛽) Predict 𝑌=1 when 𝑝≥0.5 and 𝑌=0 when 𝑝<0.5. Decision boundary the solution of 𝛽 0 +𝑥∙𝛽=0 Likelihood Function 𝐿 𝛽 0 ,𝛽 = 𝑖=1 𝑛 𝑝( 𝑥 𝑖 ) 𝑦 𝑖 (1−𝑝( 𝑥 𝑖 )) 1− 𝑦 𝑖 Unknown parameters in the function are to be estimated by maximum likelihood.
11
Random forest An ensemble learning method (Leo Breiman, Adele Cutler)
Training: Constructing a multitude of decision trees Each tree is grown as follows Training set cases number is N, sample N cases at random with replacement, from the original data to be the training set for growing the tree. M input variables, at each node, m (m<<M) variables are selected at random out of the M and the best split on these m is used to split the node. m is held constant during the forest growing. Each tree is grown to the largest extent possible. There is no pruning. Test Output the class that is the mode of the classes(classification) or mean prediction (regression) of the individual trees Advantage prevent decision trees' habit of overfiting to training set
12
Proposed method Baseline model
Logistic Regression Random Forest Boosted Decision Tree + Logistic Regression Ensemble method Voting method Stacked generalization
13
GBDT+LR The most important thing is to have the right features.
Widely used in industry (Facebook and Tencent) He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., ... & Candela, J. Q. (2014, August). Practical lessons from predicting clicks on ads at facebook. InProceedings of 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 1-9). ACM.
14
Proposed method Baseline model
Logistic Regression Random Forest Boosted Decision Tree + Logistic Regression Ensemble method Voting method Stacked generalization
15
Voting Ensemble existing model predictions Property
No need to retrain a model Work better to ensemble low-correlated model predictions Usually improve when adding more ensemble members Types of voting: Majority vote Weighted majority vote give a better model more weight in a vote Averaging(bagging) taking the mean of individual model predictions
16
Stacked Generalization
Introduced by Wolpert in 1992 Basic idea use a pool of base classifiers use another classifier to combine their predictions A stacker model gets more information on the problem space, by using the first -stage predictions as features Goal: reduce the generalization error 2-fold stacking Split the train set in 2 parts: train_a and train_b Fit a first-stage model on train_a and create predictions for train_b Fit the same model on train_b and create predictions for train_a Finally fit the model on the entire train set and create predictions for the test set Now train a second-stage stacker model on the probabilities from the first-stage model(s)
17
Outline Background Proposed Methods Evaluation
Learning from Winning Solutions Conclusion
18
Metric: Weighed AUC
19
Parameter Tuning
20
Ensemble
21
Outline Background Proposed Methods Evaluation
Learning from Winning Solutions AUC=1.0000 Public repository Conclusion
22
Feature Engineering The significant improvement: discover mass calculation Projection of the momentum to the z-axis for each small particle 𝑝0𝑝𝑧= 𝑝0 𝑝 2 −𝑝0 𝑝𝑡 2 , 𝑝1𝑝𝑧= 𝑝1 𝑝 2 −𝑝1 𝑝𝑡 2 , 𝑝2𝑝𝑧= 𝑝2 𝑝 2 −𝑝2 𝑝𝑡 2 Summarize all of them 𝑝𝑧=𝑝0𝑝𝑧+𝑝1𝑝𝑧+𝑝2𝑝𝑧 Find full mometum p 𝑝= 𝑝𝑧 2 + 𝑝𝑡 2 Calculate Velocity 𝑠𝑝𝑒𝑒𝑑= 𝐹𝑙𝑖𝑔ℎ𝑡𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝐿𝑖𝑓𝑒𝑇𝑖𝑚𝑒 Calculate mass 𝑛𝑒𝑤 𝑚𝑎𝑠𝑠= 𝑝 𝑠𝑝𝑒𝑒𝑑
23
Classifiers XGB1 XGB2 XGB3 XGB4
Small binary logistic xgboost(five trees) Good AUC, Medium KS error, Not-so-bad Cramer-von Mises error XGB2 Satisfactory AUC, Medium KS error Very low Cramer-von Mises error(use geometrical mean of the models) XGB3 Simple XGBoost with 700 trees and bagging Good AUC, high KS error, very low Cramer-von Mises error XGB4 Small forest(three trees) Bagging and cutting-edging parameter
24
Corrected mass and new feature
‘new mass’ problem poorly correlate with real mass and generate both false-positive and false-negative errors near signal/background bordor Predict new mass error XGBoost with almost three thousand trees and all features Calculate two new features new_mass_delta = new_mass – new_mass2 new_mass_ratio = new_mass / new_mass2
25
𝐹𝑖𝑛𝑎𝑙=0.5∗ 𝑠𝑒𝑐𝑜𝑛𝑑 3.9 +0.2∗ 𝑓𝑖𝑟𝑠𝑡 0.6 +0.0001∗ 𝑠𝑒𝑐𝑜𝑛𝑑 0.2 ∗ 𝑓𝑖𝑟𝑠𝑡 0.01
More classifiers XGB5 Heavy XGB with 1500 trees With all new features: new_mass2, new_mass_delta, new_mass_ratio Bagging Very high AUC, High KS error, High Cramer-von Mises error Neural Network One DenseLayer with 8 neurons Final combination: 𝐹𝑖𝑟𝑠𝑡= 𝑋𝐺𝐵 𝑋𝐺𝐵2 2 ∗0.5 𝑆𝑒𝑐𝑜𝑛𝑑=(𝑋𝐺𝐵1 ∗ 𝑋𝐺𝐵 0.85 ∗ 𝑋𝐺𝐵 𝑋𝐺𝐵2 ∗ 𝑋𝐺𝐵4 900 ∗0.85+ 𝑋𝐺𝐵 ∗2)/3.85 𝐹𝑖𝑛𝑎𝑙=0.5∗ 𝑠𝑒𝑐𝑜𝑛𝑑 ∗ 𝑓𝑖𝑟𝑠𝑡 ∗ 𝑠𝑒𝑐𝑜𝑛𝑑 0.2 ∗ 𝑓𝑖𝑟𝑠𝑡 0.01
26
Outline Background Proposed Methods Evaluation
Learning from Winning Solutions Conclusion
27
Conclusion Learn data mining tools
Data visualization: ggplot2 in R Machine learning packages: sklearn, pandas, xgboost A better understanding of classification algorithms Logistic Regression Random Forest Gradient Boosting Decision Tree Ensemble methods Learn from winning solution Domain knowledge and feature engineering
28
Thanks for your attention!
Check out our code here
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.