Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005.

Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005

2 Motivation Crash report – too late to collect program information until the program crashes Testing – large number of test cases. Can we focus on the failing cases?

3 Motivation – failure prediction Instrument program to monitor behavior Predict if the program is going to fail Collect program data if the program is predicted to likely fail Stop running the test if the test program is not likely to fail

4 The problem Large number of instrumentation predictors What instrumentation predictors to picked?

5 The questions to answer Can a good model be found for predicting failing runs based on all available data? Can an equally good model be created based on a random selection of k% of the predictors?

6 Experiment Instrumentation on a calculator program 295 predictors Instrumentation data collected every 50 milli-seconds 100 runs – 81 success, 19 failure Predictors: 275, 250, 225, 200, 175, 150, 125, 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10

7 Sample data Pass Run Run ResRec0x40bda0DataItem3MSF-0x40bda0MSF-DataItem3 1pass13244032440 1pass23206032060 1pass33232032320 1pass43203032030 1pass53243032430 Failure Run Run ResRec0x40bda0DataItem3MSF-0x40bda0MSF-DataItem3 10fail13200032000 10fail23200032000 10fail33251032510 10fail43251032510 10fail53248032480

8 Background – Random Forests Many classification trees Each tree gives a classification – vote The classification is chosen by the most votes

9 Background – Random Forests Need a training set to grow the forests M predictors are randomly selected at each node to split the node (mtry) One-third of the training data (oob) is used to get an estimation error

10 Background – Random Forests To classify a test run as pass or fail Sample model estimation OOB error rate:0.0044 "fail" "pass" "class.error" "fail" 933 17 0.0178947368421053 "pass" 5 4045 0.00123456790123455

11 Background - R Software for data manipulation, analysis and calculation Provide script capability Provide an implementation of Random Forests

12 Experiment steps 1. Determine which slice of the data to be used as modeling and testing 2. Find which parameter (ntree, mtry) affect the model 3. Find the optimal parameter values for all the random models 4. Build the random models by randomly picking N predictors 5. Verify the random models by prediction

13 Find the good data

14 Influential parameters in Random Forest Two possible parameters – ntree and mtry Building model by fixing either ntree or mtry and vary the other variable Ntree: 200 – 1000 Mtry: 10 – 295 Only Mtry matters

15 Optimal mtry Need to decide optimal mtry for different number of predictors (N) The default mtry is square root of N For different number of predicator (295 – 10): N/2 – 3N

16 Random model Randomly pick the predictors from the full set of the predictors Generate 5 sets of data for each number of predictor Use the 5 sets of the data to build the random forest model and average the result

17 Random prediction For each trained random forest, do prediction on a total different set of test data (records 401 – 450)

18 Random Prediction Result

19 Analysis of the random model Why not linear

20 Important predictors Random Forests can give importance to each predictor – the number of correct votes involving the predictor Top 20 important predictors DataItem11 RT-DataItem11PC-DataItem11MSF-DataItem11 AC-DataItem11 RT-DataItem9RT-DataItem6PC-DataItem6 AC-DataItem6 MSF-DataItem9MSF-DataItem6PC-DataItem9 DataItem9 AC-DataItem9DataItem6DataItem12 MSF-DataItem12 AC-DataItem12RT-DataItem12PC-DataItem12

21 Top model Pick the top important predictors from the full set of the predictors to build the model (top 100, 90, 80, 70, 60, 50, 40, 30, 25, 20, 15, 10)

22 Top model prediction result

23 Observation and analysis The fail error rate is still high (> 30%) No all the runs fail at the same time Fail:Success = 19:81 (too few fail cases to build a good model) Some predictors are raw, while others are derived – MSF, AC, PC, RT

24 Improvements Get the last N records for a particular run For a set of data, randomly drop some pass data and duplicate the fail data Randomly pick the raw predictors then all its derived predictors

25 Improved random prediction result

26 Improved top prediction result

27 Conclusion so far Random selection does not achieve a good error rate Some predictors have a stronger prediction power A small set of important predictor can achieve good error rate

28 Future work Why some predictors have stronger prediction power? Any pattern for the important predictors? How many important predictors should we pick? How soon can we predict a fail run before it actually fails?

30 Random model estimation result

31 Top model estimation result

32 Improved random model

Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005.

Similar presentations

Presentation on theme: "Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005.

Similar presentations

Presentation on theme: "Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May 3 2005."— Presentation transcript:

Similar presentations

About project

Feedback