Machine Learning practical

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
A Quick Overview By Munir Winkel. What do you know about: 1) decision trees 2) random forests? How could they be used?
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CLASSIFICATION: Ensemble Methods
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Kaggle Competition Prudential Life Insurance Assessment
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Data Mining and Decision Support
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Machine Learning in Practice Lecture 18
Introduction to Machine Learning
Bagging and Random Forests
Computational Intelligence: Methods and Applications
Rule Induction for Classification Using
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
ECE 5424: Introduction to Machine Learning
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
Lecture 18: Bagging and Boosting
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensembles.
Overfitting and Underfitting
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Lecture 06: Bagging and Boosting
Core Methods in Educational Data Mining
Ensemble learning Reminder - Bagging of Trees Random Forest
Data Mining Ensembles Last modified 1/9/19.
Predicting Loan Defaults
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Machine Learning practical with Kaggle Alan Chalk

Disclaimer The views expressed in this presentation are those of the author, Alan Chalk, and not necessarily of the Staple Inn Actuarial Society

Over the next 45 minutes ? Loss functions R Greedy algorithms Python Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions

Renthop (2 sigma connect)

Example renthop posting

What is our task?

Evaluation (loss function) Example (i) True class (j) Probability high Probability medium Probability low 1 medium 0.1 0.6 0.3 2 high 0.0 0.5 Why not use accuracy? Is it possible to do machine learning without performance measurement?

Process… Create Pipeline Baseline guess Improve guess

Create pipeline and baseline Read raw data Clean data Create guess Submit on Kaggle Baseline guess Now go through code 00a, 00b, 01a, 02a, 04__ and 04a

First R code 00a_Packages.R 00b_Working Directories.R 01a_ReadRawData.R 01b_CleanData.R 04__LoadAndPrepareData.R 04a_BaselinePredictions.R

Bottom of the leaderboard Any comments from people? Someone should notice I got 26.5, NOT 0.79. (The columns were in the wrong order!)

Decision trees bathrooms, bedrooms, latitude, longitude, price, listing_id Given the above list of features – starting with all the rentals, which subset would you choose to separate out those of high and low interest? Why did you choose this? You must have had some “inherent loss function”. If we get the computer to exhaustively go through all to find the best improvement to loss – which loss function should we choose?

Some decision tree vocab CART (rpart), C5.0 etc split rule (loss function) NP-hard greedy over-fitting complexity parameter

R code: formulas In an R formula, ~ means “is allowed to depend on”. Our first formula is: interest_level ~ bathrooms + bedrooms + latitude + longitude + price + listing_id (In our code, you will see that this formula is saved in a variable called “fmla_”)

R code: rpart code rpart_1 <- rpart(fmla_ , data = dt_all[idx_train1,], method = "class", cp = 1e-8, )

R code 04b01_rpart.R Go through the rcode upto creating a first simple tree and interpreting it

A first tree for renthop Ask someone to describe the node with highest interest (2 or more bedrooms for less than $1,829 per month)

How can we do better? Better techniques More features (feature engineering) Anything else? Ask audience – we are looking for 2 things – better technigues and more features.

Feature engineering? Based only on the data and files provided: bathrooms, bedrooms, building_id, created, description, display_address, features, latitude, listing_id, longitude, manager_id, photos, price, street_address, interest_level Note: You also have loads and loads of photos. “description” is free format. “features” is a list of words.

Feature engineering? Simple features; price per bedroom, bathroom / bedroom ratio, created hour or day of week Simplifications of complex features: number of photos, number of words in description Presence of each feature or not; e.g. laundry yes or no Good value rental

High cardinality features? manager_id, building_id Simplifications: “size” of manager or building turn into numeric What else? Talk about approach to credibility – ask audience how many examples for one manager or building – before we rely on that experience rather than overall model. Stattistics – 1000? And what is shape of curve? ML says – use the data to find out.

Leakage features? A key aspect of winning Machine Learning competitions Where might there be leakage in the data we have been given Paper: Leakage in Data Mining: Formulation, Detection, and Avoidance, (Kaufman, Rosset and Perlich)

Now what? We have loads of features – good But there is every chance that our decision tree will pick up random noise in the training data (called “variance”) How can we control for this? Cost complexity pruning

R code 02b_FeatureCreation_1/2/3.R 04b_01_rpart.R 04b_02_VariableImportance.R

Training and validation curves

Variable importance

Random Forest Introduce randomness. Why? Boostrapping and then aggregating the results (“bagging”) How else can we create randomness Sample the features available to each split OOB error What are our hyper-parameters? number of trees? nodesize? mtry?

R code: random forest code x = dt_train, y = as.factor(y_train), ntree = 300, nodesize = 1, mtry = 6, keep.forest = TRUE)

R code 04c_RandomForest.R

gradient boosting Add lots of “weak learners” Create new weak learners by focusing on examples which are incorrectly classified Add all the weak learners using weights which are higher for the better weak learners The weak learners are “adaptive basis functions”

hpyerparameters learning rate depth of trees min child weight data subsampling column subsampling grid search random search hyperopt

Python code xgb.train(params = param , dtrain = xg_train , num_boost_round = num_rounds , evals = watchlist , early_stopping_rounds = 20 , verbose_eval = False )

Gradient boosting (machines) 04d_GradientBoosting_presentation.ipynb

Over the next 45 minutes ? Loss functions R Greedy algorithms Python Performance measurement Feature engineering Generalisation error Bias and variance Penalisation Training and validation curves Hyperparameter tuning R Python Decision trees Random forests Gradient boosting Basis functions Adaptive basis functions