Statistical Learning Introduction to Weka

Slides:



Advertisements
Similar presentations
Machine Learning Homework
Advertisements

Florida International University COP 4770 Introduction of Weka.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
How to Run WEKA Demo SVM in WEKA T.B. Chen
WEKA (sumber: Machine Learning with WEKA). What is WEKA? Weka is a collection of machine learning algorithms for data mining tasks. Weka contains.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
March 25, 2004Columbia University1 Machine Learning with Weka Lokesh S. Shrestha.
A Short Introduction to Weka Natural Language Processing Thursday, September 25th.
An Extended Introduction to WEKA. Data Mining Process.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
1 Statistical Learning Introduction to Weka Michel Galley Artificial Intelligence class November 2, 2006.
A Short Introduction to Weka Natural Language Processing Thursday, September 27 Frank Enos and Andrew Rosenberg.
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 04: Data Preparation Tools.
An Exercise in Machine Learning
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Contributed by Yizhou Sun 2008 An Introduction to WEKA.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
WEKA – Knowledge Flow & Simple CLI
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of.
Appendix: The WEKA Data Mining Software
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
For ITCS 6265/8265 Fall 2009 TA: Fei Xu UNC Charlotte.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Weka – A Machine Learning Toolkit October 2, 2008 Keum-Sung Hwang.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Introduction to Weka Xingquan (Hill) Zhu Slides copied from Jeffrey Junfeng Pan (UST)
W E K A Waikato Environment for Knowledge Aquisition.
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Exercise in Machine Learning
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
An Introduction to WEKA
Learning to Detect and Classify Malicious Executables in the Wild by J
Data preprocessing and transformation
CS 8520: Artificial Intelligence
Waikato Environment for Knowledge Analysis
Prepared by Kimberly Sayre and Jinbo Bi
WEKA.
Sampath Jayarathna Cal Poly Pomona
Machine Learning with WEKA
Machine Learning with WEKA
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
Machine Learning with Weka
Tutorial for LightSIDE
An Introduction to WEKA
Tutorial for WEKA Heejun Kim June 19, 2018.
CSCI N317 Computation for Scientific Applications Unit Weka
CS4705 – Natural Language Processing Thursday, September 28
Machine Learning in Practice Lecture 23
Machine Learning in Practice Lecture 22
Machine Learning with Weka
Machine Learning with WEKA
CSE 491/891 Lecture 25 (Mahout).
Lecture 10 – Introduction to Weka
Chapter 7: Transformations
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Machine Learning: Decision Trees in AIMA and WEKA
Data Mining CSCI 307, Spring 2019 Lecture 7
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

Statistical Learning Introduction to Weka Michel Galley Artificial Intelligence class November 2, 2006

Machine Learning with Weka Comprehensive set of tools: Pre-processing and data analysis Learning algorithms (for classification, clustering, etc.) Evaluation metrics Three modes of operation: GUI command-line (not discussed today) Java API (not discussed today)

Weka Resources Web page At Columbia http://www.cs.waikato.ac.nz/ml/weka/ Extensive documentation (tutorials, trouble-shooting guide, wiki, etc.) At Columbia Installed locally at: ~mg2016/weka (CUNIX network) ~galley/weka (CS network) Downloads for Windows or UNIX: http://www1.cs.columbia.edu/~galley/weka/downloads

Attribute-Relation File Format (ARFF) Weka reads ARFF files: @relation adult @attribute age numeric @attribute name string @attribute education {College, Masters, Doctorate} @attribute class {>50K,<=50K} @data 50,Leslie,Masters,>50K ?,Morgan,College,<=50K Supported attributes: numeric, nominal, string, date Details at: http://www.cs.waikato.ac.nz/~ml/weka/arff.html Header Comma Separated Values (CSV)

Sample database: the sensus data (“adult”) Binary classification: Task: predict whether a person earns > $50K a year Attributes: age, education level, race, gender, etc. Attribute types: nominal and numeric Training/test instances: 32,000/16,300 Original UCI data available at: ftp.ics.uci.edu/pub/machine-learning-databases/adult Data already converted to ARFF: http://www1.cs.columbia.edu/~galley/weka/datasets/

Starting the GUI CS accounts CUNIX accounts Start “Explorer” > java -Xmx128M -jar ~galley/weka/weka.jar > java -Xmx512M -jar ~galley/weka/weka.jar (with more mem.) CUNIX accounts > java -Xmx128M -jar ~mg2016/weka/weka.jar Start “Explorer”

Weka Explorer What we will use today in Weka: Pre-process: Visualize: Load, analyze, and filter data Visualize: Compare pairs of attributes Plot matrices Classify: All algorithms seem in class (Naive Bayes, etc.) Feature selection: Forward feature subset selection, etc.

load filter analyze

visualize attributes

Demo #1: J48 decision trees (=C4.5) Steps: load data from URL: http://www1.cs.columbia.edu/~galley/weka/datasets/adult.train.arff select only three attributes: age, education-num, class weka.unsupervised.attribute.Remove –V –R 1,5,last visualize the age/education-num matrix: find this in the Visualize pane classify with decision trees, percent split of 66%: weka.classifier.trees.J48 visualize decision tree: (right)-click on entry in result list, select “Visualize tree” compare matrix with decision tree: does it make sense to you? Try it for yourself after the class!

Demo #1: J48 decision trees EDUCATION-NUM >50K <=50K AGE

Demo #1: J48 decision trees >50K <=50K _ _ + _ _ + _ +

Demo #1: J48 decision trees 13 EDUCATION-NUM >50K 31 34 36 60 <=50K AGE

Demo #1: J48 result analysis

Comparing classifiers Classifiers allowed in assignment: decision trees (seen) naive Bayes (seen) linear classifiers (next week) Repeating many experiments in Weka: Previous experiment easy to reproduce with other classifiers and parameters (e.g., inside “Weka Experimenter”) Less time coding and experimenting means you have more time for analyzing intrinsic differences between classifiers.

Linear classifiers Prediction is a linear function of the input in the case of binary predictions, a linear classifier splits a high-dimensional input space with a hyperplane (i.e., a plane in 3D, or a straight line in 2D). Many popular effective classifiers are linear: perceptron, linear SVM, logistic regression (a.k.a. maximum entropy, exponential model).

Comparing classifiers Results on “adult” data Majority-class baseline: 76.51% (always predict <=50K) weka.classifier.rules.ZeroR Naive Bayes: 79.91% weka.classifier.bayes.NaiveBayes Linear classifier: 78.88% weka.classifier.function.Logistic Decision trees: 79.97% weka.classifier.trees.J48

Why this difference? A linear classifier in a 2D space: it can classify correctly (“shatter”) any set of 3 points; not true for 4 points; we say then that 2D-linear classifiers have capacity 3. A decision tree in a 2D space: can shatter as many points as leaves in the tree; potentially unbounded capacity! (e.g., if no tree pruning)

Demo #2: Logistic Regression Can we improve upon logistic regression results? Steps: use same data as before (3 attributes) discretize and binarize data (numeric  binary): weka.filters.unsupervised.attribute.Discretize –D –F –B 10 classify with logistic regression, percent split of 66%: weka.classifier.function.Logistic compare result with decision tree: your conclusion? repeat classification experiment with all features, comparing the three classifiers: J48, Logistic, and Logistic with binarization: your conclusion? weka.filters.unsupervised.attribute.Discretize –D –F –B 10: discretize with 10 bins of equal frequencies, and create binary variables.

Demo #2: Results two features (age, education-num): all features: decision tree 79.97% logistic regression 78.88% logistic regression with feature binarization 79.97% all features: decision tree 84.38% logistic regression 85.03% logistic regression with feature binarization 85.82% Number of binary/numeric features: 5) 18, 4) 104, 6) 152 Questions: How could a low capacity outperform a high capacity one? We actually increased the capacity of the linear classifier, so that it is close decision trees. Is high capacity always better? If not, under what circumstances is it better to have low capacity? Discuss amount of training data, number of features, generalization error. What is the capacity of a naive Bayes classifier? Is it sound to do feature selection with low-capacity classifiers? Why?

Feature Selection Feature selection: find a feature subset that is a good substitute to all features good for knowing which features are actually useful often gives better accuracy (especially on new data) Forward feature selection (FFS): [John et al., 1994] wrapper feature selection: uses a classifier to determine the goodness of feature sets. greedy search: fast, but prone to search errors

Feature Selection in Weka Forward feature selection: search method: GreedyStepwise select a classifier (e.g., NaiveBayes) number of folds in cross validation (default: 5) attribute evaluator: WrapperSubsetEval generateRanking: true numToSelect (default: maximum) startSet: good features you previously identified attribute selection mode: full training data or cross validation Notes: double cross validation because of GreedyStepwise change number of folds to achieve desired tade-off between selection accuracy and running time.

Weka Experimenter If you need to perform many experiments: Experimenter makes it easy to compare the performance of different learning schemes Results can be written into file or database Evaluation options: cross-validation, learning curve, etc. Can also iterate over different parameter settings Significance-testing built in.

Beyond the GUI How to reproduce experiments with the command-line/API GUI, API, and command-line all rely on the same set of Java classes Generally easy to determine what classes and parameters were used in the GUI. Tree displays in Weka reflect its Java class hierarchy. > java -cp ~galley/weka/weka.jar weka.classifiers.trees.J48 –C 0.25 –M 2 -t <train_arff> -T <test_arff>

Important command-line parameters where options are: Create/load/save a classification model: -t <file> : training set -l <file> : load model file -d <file> : save model file Testing: -x <N> : N-fold cross validation -T <file> : test set -p <S> : print predictions + attribute selection S > java -cp ~galley/weka/weka.jar weka.classifiers.<classifier_name> [classifier_options] [options]