Download presentation
Presentation is loading. Please wait.
Published byVera Hartono Modified over 6 years ago
1
Using decision trees and their ensembles for analysis of NIR spectroscopic data
WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution
2
Outline S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
3
What decision trees are? Decision trees ensembles Cases
Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions bpimediagroup.com S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
4
Why decision trees? Why not?
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
5
But why decision trees? Kaggle CEO and Founder Anthony Goldbloom:
”…in the history of Kaggle competitions, there are only two Machine Learning approaches that win competitions: Handcrafted and Neural Networks” ”…It used to be random forest that was the big winner, but over the last six months a new algorithm called XGboost has cropped up, and it’s winning practically every competition in the structured data category” S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
6
Why NIR spectroscopic data?
When a linear regression can be better that the decision trees methods? when relationship between X and y is fully linear when there is a very large number of features with low S/N ratio when covariate shift is likely S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
7
What decision trees are? Decision trees ensembles Cases
Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
8
What decision trees are?
Drinks beer? yes no Knows statistics? Not chemometrician Chemometrician Steals ideas from statisticians? Chemometrician Not chemometrician S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
9
Decision trees for numeric variables
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
10
Decision trees for numeric variables
Where are other variables? On every split the best variable is used Number of splits (tree depth) is limited Efficiency of split is a reduction of misclassification errors S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
11
Decision trees for numeric variables
How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
12
Decision trees for numeric variables
How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
13
Decision trees for numeric variables
How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
14
Decision trees for numeric variables
How many splits? Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
15
Decision trees for numeric variables
How many splits? –50% Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits 50 –44% 6 –2% 4 S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
16
Decision trees for numeric variables
How many splits? –50% Limit minimum number of objects in each bucket Limit the maximum tree size (depth/split number) Make a big tree and prune all inefficient splits 50% –44% 6% –2% 4% Use cross-validation to calculate the errors S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
17
Decision trees regression
Variable importance Is calculated for each variable individually Take s into account the role of a variable in different splits Is accumulated across all splits and normalized S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
18
Decision trees regression
Response variable is split into several bins Minimize variance in each node S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
19
What decision trees are? Decision trees ensembles Cases
Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions viapesnyary.ru S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
20
Decision trees ensembles
Ensemble learning — combine several models together A group of week learners can perform better when together decrease variance, make prediction more stable and reliable S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
21
Decision trees ensembles
Bagging Create N random subsets (sampling with replacement) Train model for every subset (parallel) Use simple average for prediction Random forest Boosting Train a model from a random subset Make N better models by using new subsets (sequential) Use weighted average for prediction Randomly with replacement Gradient boosting S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
22
What decision trees are? Decision trees ensembles Cases
Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
23
Prediction of fat content in chopped meat samples by NIR spectra
Tecator Prediction of fat content in chopped meat samples by NIR spectra 100 predictors (NIR spectra by Tecator Infratec Food and Feed Analyzer, 850–1050 nm) 215 measurements (172 for calibration and 43 for test) S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
24
Single tree — predictions
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
25
Single tree — the tree S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
26
Single tree — variable importance
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
27
Single tree — variable selection
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
28
Single tree — variable selection
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
29
Random forest — predictions
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
30
Random forests — importance of variables
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
31
Random forests — variable selection
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
32
What decision trees are? Decision trees ensembles Cases
Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
33
Olives S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
34
Single tree — the tree and splits
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
35
Single tree — classification
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
36
Random forest — classification
S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
37
Variable importance S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
38
What decision trees are? Decision trees ensembles Cases
Outline Why decision trees? What decision trees are? Decision trees ensembles Cases Tecator Olives Conclusions S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
39
Conclusions ”The bottom line is: You can spend 3 hours playing with the data, generating features and interaction variables and get a 77% r-squared; and I can “from sklearn.ensemble import RandomForestRegressor” and in 3 minutes get an 82% r-squared.” S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
40
IASIM-2016 S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
41
IASIM-2018, June 17-20 2018, Seattle, WA, USA
12 March — for student scholarship April 5 — for abstract S. Kucheryavskiy, WSC-11, Saint Petersburg 2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.