Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September 25, 2002 Bologna, Italy

How Predictable Are a User’s Computer Interactions? Command sequences The time of day The type of computer your using Clusters of command sequences Command typos

Characteristics of the Problem Time sequenced problem with dependent variables Not a standard classification problem Predicting a nominal value rather than a Boolean value Concept shift

Dataset Davison and Hirsh – Rutgers university Collected history sessions of 77 different users for 2 – 6 months Three categories of users: professor, graduate, undergraduate Average number of commands per sessions: 2184 Average number of distinct commands per session : 77

Rutgers Study 5 different algorithms implemented C4.5 a decision-tree learner An omniscient predictor The most recent command just issued The most frequently used command of the training set The longest matching prefix to the current command Most successful – C4.5 Predictive accuracy 38%

Typical History Session 96100720:13:31 green-486 vs100 BLANK 96100720:13:31 green-486 vs100 vi 96100720:13:31 green-486 vs100 ls 96100720:13:47 green-486 vs100 lpr 96100720:13:57 green-486 vs100 vi 96100720:14:10 green-486 vs100 make 96100720:14:33 green-486 vs100 vis 96100720:14:46 green-486 vs100 vi

WEKA System Provides Learning algorithms Simple format for importing data –ARFF format Graphical user interface

History Session in ARFF Format @relation user10 @attribute ct-2 {BLANK,vi,ls,lpr,make,vis} @attribute ct-1 {BLANK,vi,ls,lpr,make,vis} @attribute ct0 {vi,ls,lpr,make,vis} @data BLANK,vi,ls vi, ls, lpr ls,lpr, make lpr, make, vis make, vis, vi

Learning Techniques Decision tree using 2 previous commands as attributes Minimize size of the tree Maximize information gain Boosted decision trees - AdaBoost Decision table Match determined by k nearest neighbors Verification by 10-fold cross validation Verification by splitting data into training/test sets Match determined by majority

emacs time = 0 emacsls pwd more make pine man ls emacs pwd gcc ls more vi make gcc time = -1 Learning a Decision Tree time = -2 lss makes dir Command Values

Boosting a Decision Tree Decision Tree Solution Set

Example Learning a Decision Table K - Nearest Neighbors (IBk)

Prediction Metrics Macro-average – average predictive accuracy per person What was the average predictive accuracy for the users in the study ? Micro-average – average predictive accuracy for the commands in the study What percentage of the commands in the study did we predict correctly?

Macro-average Results

Micro-average Results

Results: Decision Trees Decision trees – expected results Compute-intensive algorithm Predictability results are similar to simpler algorithms No interesting findings Duplicated the Rutger’s study results

Results: AdaBoost AdaBoost – very disappointing Unfortunately none or few boosting iterations performed Only 12 decision trees were boosted Boosted trees predictability only increased by 2.4% on average Correctly predicted 115 more commands than decision trees ( out of 118,409 wrongly predicted commands) Very compute intensive and no substantial increase in predictability percentage

Results: Decision Tables Decision table – satisfactory results good predictability results relatively speedy Validation is done incrementally Potential candidate for an online system

Summary of Prediction Results Ibk decision table produced the highest micro-average Boosted decision trees produced the highest macro-average Difference was negligible 1.37% - micro-average 2.21% - macro-average

Findings Ibk decision tables can be used in an online system Not a compute-intensive algorithm Predictability is better or as good as decision trees Consistent results achieved on fairly small log sessions (> 100 commands) No improvement in prediction for larger log sessions (> 1000 commands) due to concept shift

Summary of Benefits Automatic typo correction Savings in keystrokes is on average 30% Given an average command length is 3.77 characters Predicted command can be issued with 1 keystroke

Questions

The algorithm. Let D t (i) denote the weight of example i in round t. Initialization: Assign each example (x i, y i ) E the weight D 1 (i) := 1/n. For t = 1 to T: Call the weak learning algorithm with example set E and weight s given by D t. Get a weak hypothesis h t : X. Update the weights of all examples. Output the final hypothesis, generated from the hypotheses of rounds 1 to T. AdaBoost Description

Complete Set of Results 35 36 37 38 39 40 41 42 43 Decision table using Ibk Decision table using majority match Decision table using percentage split Decision treesAdaBoost Macro-averageMicro-average

Learning a Decision Tree Command at time = t-2 lsmake Command at t-1 make dir grep dir … grep ls pwd Command at t-1 emacs ls pwd grep Predicted Commands time = t

Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Similar presentations

Presentation on theme: "Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Similar presentations

Presentation on theme: "Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September."— Presentation transcript:

Similar presentations

About project

Feedback