Weka Free and Open Source ML Suite Ian Witten & Eibe Frank

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Projects Data Representation Basic testing and evaluation schemes
Florida International University COP 4770 Introduction of Weka.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Evaluation.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
A Short Introduction to Weka Natural Language Processing Thursday, September 25th.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Supervised Learning & Classification, part I Reading: W&F ch 1.1, 1.2, , 3.2, 3.3, 4.3, 6.1*
Arko Barman Slightly edited by Ch. Eick COSC 6335 Data Mining
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
A Short Introduction to Weka Natural Language Processing Thursday, September 27 Frank Enos and Andrew Rosenberg.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
CLassification TESTING Testing classifier accuracy
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Classification Ensemble Methods 1
Introduction to Weka ML Seminar for Rookies Byoung-Hee Kim Biointelligence Lab, Seoul National University.
Machine Learning (ML) with Weka Weka can classify data or approximate functions: choice of many algorithms.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Diagonal is sum of variances In general, these will be larger when “within” class variance is larger (a bad thing) Sw(iris[,1:4],iris[,5]) Sepal.Length.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Clustering CSC 600: Data Mining Class 21.
Chapter 18 From Data to Knowledge
Genetic-Algorithm-Based Instance and Feature Selection
Erich Smith Coleman Platt
Chapter 13 – Ensembles and Uplift
Chapter 6 Classification and Prediction
9. Credibility: Evaluating What’s Been Learned
CS 235 Decision Tree Classification
SAD: 6º Projecto.
Machine Learning Dr. Mohamed Farouk.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Week 1.
Figure 1.1 Rules for the contact lens data.
Machine Learning Techniques for Data Mining
Introduction to Data Mining, 2nd Edition
Machine Learning with Weka
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
CS539: Project 3 Zach Pardos.
Learning Algorithm Evaluation
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Chapter 7: Transformations
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Copyright: Martin Kramer
CS639: Data Management for Data Science
Arko Barman COSC 6335 Data Mining
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 8
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Weka Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato

Overview Classifiers, Regressors, and clusterers Multiple evaluation schemes Bagging and Boosting Feature Selection Experimenter Visualizer Text not up to date. They welcome additions.

Learning Tasks Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.

Data Format: IRIS @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General from @atttribute attribute-name REAL or list of values

J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)

Cross-validation Correctly Classified Instances 143 95.3% Incorrectly Classified Instances 7 4.67 % Default 10-fold cross validation i.e. Split data into 10 equal sized pieces Train on 9 pieces and test on remainder Do for all possibilities and average

J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica

Other Evaluation Schemes Leave-one-out cross-validation Cross-validation where n = number of training instanced Specific train and test set Allows for exact replication Ok if train/test large, e.g. 10,000 range.

Bootstrap sampling Randomly select n with replacement from n Expect about 2/3 to be chosen for training Prob of not chosen = (1-1/n)^n ~ 1/e. Testing on remainder Repeat about 30 times and average. Avoids partition bias