Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005.

Slides:



Advertisements
Similar presentations
Data collected over a period of time Example Data seems pretty random!!
Advertisements

Plans to improve estimators to better utilize panel data John Coulston Southern Research Station Forest Inventory and Analysis.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Random Testing Tor Stålhane Jonas G. Brustad. What is random testing The principle of random testing is simple and can be described as follows: 1.For.
Predicting Genetic Regulatory Response Using Classification Us v. Them (“Them” being Manuel Middendorf, Anshul Kundaje, Chris Wiggins, Yoav Freund, and.
Imbalanced data David Kauchak CS 451 – Fall 2013.
G. Alonso, D. Kossmann Systems Group
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 18 Sampling Distribution Models
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
ICS 273A Intro Machine Learning
Sparse vs. Ensemble Approaches to Supervised Learning
Algorithmic Complexity Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Machine Learning CS 165B Spring 2012
Software Reliability Growth. Three Questions Frequently Asked Just Prior to Release 1.Is this version of software ready for release (however “ready” is.
Jani Pousi Supervisor: Jukka Manner Espoo,
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
1 FORECASTING Regression Analysis Aslı Sencer Graduate Program in Business Information Systems.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Model Building III – Remedial Measures KNNL – Chapter 11.
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University,
Students: Nidal Hurani, Ghassan Ibrahim Supervisor: Shai Rozenrauch Industrial Project (234313) Tube Lifetime Predictive Algorithm COMPUTER SCIENCE DEPARTMENT.
Chapter 9 – Classification and Regression Trees
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Bug Localization with Machine Learning Techniques Wujie Zheng
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Initial Population Generation Methods for population generation: Grow Full Ramped Half-and-Half Variety – Genetic Diversity.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.
PCB 3043L - General Ecology Data Analysis.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
1 Illustration of the Classification Task: Learning Algorithm Model.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Introduction to Machine Learning and Tree Based Methods
Eco 6380 Predictive Analytics For Economists Spring 2016
PCB 3043L - General Ecology Data Analysis.
Sampling Distribution Models
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Ungraded quiz Unit 6.
Ensembles.
Ensemble learning Reminder - Bagging of Trees Random Forest
Classification with CART
COMP6321 MACHINE LEARNING PROJECT PRESENTATION
… 1 2 n A B V W C X 1 2 … n A … V … W … C … A X feature 1 feature 2
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May

2 Motivation Crash report – too late to collect program information until the program crashes Testing – large number of test cases. Can we focus on the failing cases?

3 Motivation – failure prediction Instrument program to monitor behavior Predict if the program is going to fail Collect program data if the program is predicted to likely fail Stop running the test if the test program is not likely to fail

4 The problem Large number of instrumentation predictors What instrumentation predictors to picked?

5 The questions to answer Can a good model be found for predicting failing runs based on all available data? Can an equally good model be created based on a random selection of k% of the predictors?

6 Experiment Instrumentation on a calculator program 295 predictors Instrumentation data collected every 50 milli-seconds 100 runs – 81 success, 19 failure Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10

7 Sample data Pass Run Run ResRec0x40bda0DataItem3MSF-0x40bda0MSF-DataItem3 1pass pass pass pass pass Failure Run Run ResRec0x40bda0DataItem3MSF-0x40bda0MSF-DataItem3 10fail fail fail fail fail

8 Background – Random Forests Many classification trees Each tree gives a classification – vote The classification is chosen by the most votes

9 Background – Random Forests Need a training set to grow the forests M predictors are randomly selected at each node to split the node (mtry) One-third of the training data (oob) is used to get an estimation error

10 Background – Random Forests To classify a test run as pass or fail Sample model estimation OOB error rate: "fail" "pass" "class.error" "fail" "pass"

11 Background - R Software for data manipulation, analysis and calculation Provide script capability Provide an implementation of Random Forests

12 Experiment steps 1. Determine which slice of the data to be used as modeling and testing 2. Find which parameter (ntree, mtry) affect the model 3. Find the optimal parameter values for all the random models 4. Build the random models by randomly picking N predictors 5. Verify the random models by prediction

13 Find the good data

14 Influential parameters in Random Forest Two possible parameters – ntree and mtry Building model by fixing either ntree or mtry and vary the other variable Ntree: 200 – 1000 Mtry: 10 – 295 Only Mtry matters

15 Optimal mtry Need to decide optimal mtry for different number of predictors (N) The default mtry is square root of N For different number of predicator (295 – 10): N/2 – 3N

16 Random model Randomly pick the predictors from the full set of the predictors Generate 5 sets of data for each number of predictor Use the 5 sets of the data to build the random forest model and average the result

17 Random prediction For each trained random forest, do prediction on a total different set of test data (records 401 – 450)

18 Random Prediction Result

19 Analysis of the random model Why not linear

20 Important predictors Random Forests can give importance to each predictor – the number of correct votes involving the predictor Top 20 important predictors DataItem11 RT-DataItem11PC-DataItem11MSF-DataItem11 AC-DataItem11 RT-DataItem9RT-DataItem6PC-DataItem6 AC-DataItem6 MSF-DataItem9MSF-DataItem6PC-DataItem9 DataItem9 AC-DataItem9DataItem6DataItem12 MSF-DataItem12 AC-DataItem12RT-DataItem12PC-DataItem12

21 Top model Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)

22 Top model prediction result

23 Observation and analysis The fail error rate is still high (> 30%) No all the runs fail at the same time Fail:Success = 19:81 (too few fail cases to build a good model) Some predictors are raw, while others are derived – MSF, AC, PC, RT

24 Improvements Get the last N records for a particular run For a set of data, randomly drop some pass data and duplicate the fail data Randomly pick the raw predictors then all its derived predictors

25 Improved random prediction result

26 Improved top prediction result

27 Conclusion so far Random selection does not achieve a good error rate Some predictors have a stronger prediction power A small set of important predictor can achieve good error rate

28 Future work Why some predictors have stronger prediction power? Any pattern for the important predictors? How many important predictors should we pick? How soon can we predict a fail run before it actually fails?

29

30 Random model estimation result

31 Top model estimation result

32 Improved random model