Weka Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato
Overview Classifiers, Regressors, and clusterers Multiple evaluation schemes Bagging and Boosting Feature Selection Experimenter Visualizer Text not up to date. They welcome additions.
Learning Tasks Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.
Data Format: IRIS @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General from @atttribute attribute-name REAL or list of values
J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation Correctly Classified Instances 143 95.3% Incorrectly Classified Instances 7 4.67 % Default 10-fold cross validation i.e. Split data into 10 equal sized pieces Train on 9 pieces and test on remainder Do for all possibilities and average
J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica
Other Evaluation Schemes Leave-one-out cross-validation Cross-validation where n = number of training instanced Specific train and test set Allows for exact replication Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling Randomly select n with replacement from n Expect about 2/3 to be chosen for training Prob of not chosen = (1-1/n)^n ~ 1/e. Testing on remainder Repeat about 30 times and average. Avoids partition bias