University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin WEKA Tutorial Sugato Basu and Prem Melville.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Data Mining Classification: Alternative Techniques
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
What you need to know to get started with writing code for machine learning.
WEKA (sumber: Machine Learning with WEKA). What is WEKA? Weka is a collection of machine learning algorithms for data mining tasks. Weka contains.
K nearest neighbor and Rocchio algorithm
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
March 25, 2004Columbia University1 Machine Learning with Weka Lokesh S. Shrestha.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
A Short Introduction to Weka Natural Language Processing Thursday, September 25th.
An Extended Introduction to WEKA. Data Mining Process.
1 Statistical Learning Introduction to Weka Michel Galley Artificial Intelligence class November 2, 2006.
Scalable Text Mining with Sparse Generative Models
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
An Exercise in Machine Learning
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Issues with Data Mining
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of.
Appendix: The WEKA Data Mining Software
Weka Project assignment 3
Figure 1.1 Rules for the contact lens data.. Figure 1.2 Decision tree for the contact lens data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CSC 196k Semester Project: Instance Based Learning
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Weka – A Machine Learning Toolkit October 2, 2008 Keum-Sung Hwang.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Introduction to Weka Xingquan (Hill) Zhu Slides copied from Jeffrey Junfeng Pan (UST)
W E K A Waikato Environment for Knowledge Aquisition.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
Machine Learning with WEKA - Yohan Chin. WEKA ? Waikato Environment for Knowledge Analysis A Collection of Machine Learning algorithms for data tasks.
Machine Learning in GATE Valentin Tablan. 2 Machine Learning in GATE Uses classification. [Attr 1, Attr 2, Attr 3, … Attr n ]  Class Classifies annotations.
WEKA's Knowledge Flow Interface Data Mining Knowledge Discovery in Databases ELIE TCHEIMEGNI Department of Computer Science Bowie State University, MD.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Waikato Environment for Knowledge Analysis
WEKA.
Sampath Jayarathna Cal Poly Pomona
Figure 1.1 Rules for the contact lens data.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Machine Learning with Weka
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Implementation Based on Inverted Files
Tutorial for WEKA Heejun Kim June 19, 2018.
6. Implementation of Vector-Space Retrieval
Machine Learning with Weka
Machine Learning with WEKA
Lecture 10 – Introduction to Weka
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin WEKA Tutorial Sugato Basu and Prem Melville

University of Texas at Austin Machine Learning Group 2 What is WEKA? Collection of ML algorithms – open-source Java package – Schemes for classification include: –decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, voted perceptrons, multi-layer perceptron Schemes for numeric prediction include: –linear regression, model tree generators, locally weighted regression, instance- based learners, decision tables, multi-layer perceptron Meta-schemes include: –Bagging, boosting, stacking, regression via classification, classification via regression, cost sensitive classification Schemes for clustering: –EM and Cobweb

University of Texas at Austin Machine Learning Group 3 Getting Started Set environment variable WEKAHOME –setenvWEKAHOME /u/ml/software/weka Add $WEKAHOME/weka.jar to your CLASSPATH –setenvCLASSPATH /u/ml/software/weka/weka.jar Test –java weka.classifiers.j48.J48 –t $WEKAHOME/data/iris.arff

University of Texas at Austin Machine Learning Group 4 ARFF File Format Require @RELATION declaration associates a name with the declaration specifies the name and type of an attribute –Datatype can be numeric, nominal, string or sepallength petalwidth class declaration is a single line denoting the start of the data segment –Missing values are represented by 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, ?, 1.4, ?, Iris-versicolor

University of Texas at Austin Machine Learning Group 5 Sparse ARFF Files Similar to AARF files except that data value 0 are not represented Non-zero attributes are specified by attribute number and value For examples of ARFF files see 0, X, 0, Y, “class A” 0, 0, W, 0, "class {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

University of Texas at Austin Machine Learning Group 6 Running Learning Schemes java [options] Example learner classes C4.5 weka.classifiers.j48.J48 Naïve bayes weka.classifiers.NaiveBayes KNN weka.classifiers.IBk Important generic options -t Specify training file -T If none, CV is performed on training data -x Number of folds for cross-validation -s For CV -l Use saved model -d Output model to file Invoking a learner without any options will list all the scheme-specific options

University of Texas at Austin Machine Learning Group 7 Output Summary of model – if possible Statistics on training data Cross-validation statistics Example Output for numeric prediction is different –Correlation coefficient instead of accuracy –No confusion matrices

University of Texas at Austin Machine Learning Group 8 Using Meta-Learners java -W [meta-options] -- [base-options] –The double minus sign (--) separates the two lists of options, e.g. java weka.classifiers.Bagging –I 8 -W weka.classifiers.j48.J48 -t iris.arff -- -U MultiClassClassifier allows you to use a binary classifier for multiclass data java weka.classifiers.MultiClassClassifier –W weka.classifiers.SMO –t weather.arff CVParameterSelection finds best value for specified param using CV –Use –P option to specify the parameter and space to search -P “, e.g. java …CVParameterSelection –W …OneR –P “B ” –t iris.arff

University of Texas at Austin Machine Learning Group 9 Using Filters Filters can be used to change data files, e.g. –delete first and second attributes java weka.filters.AttributeFilter –R 1,2 –i iris.arff –o iris.new.arff AttributeSelectionFilter lets you select a set of attributes using classes in the weka.attributeSelection package java weka.filters.AttributeSelectionFilter –E weka.attributeSelection.InfoGainAttributeEval –i weather.arff Other filters DiscretizeFilterDiscretizes a range of numeric attributes in the dataset into nominal attributes NominalToBinaryFilterConverts nominal attributes into binary ones, replacing each attribute with k values with k-1 new binary attributes NumericTransformFilter Transforms numeric attributes using given method ( java weka.filters. NumericTransformFilter –C java.lang.Math –M sqrt … )

University of Texas at Austin Machine Learning Group 10 The Instance Class All attribute values are stored as doubles –Value of nominal attribute is index of the nominal value in attribute definition Some important methods classAttribute()Returns class attribute classValue() Returns an instance's class value value(int) Returns an specified attribute value in internal format enumerateAttributes() Returns an enumeration of all the attributes weight()Returns the instance's weight Instances is a collection of Instance objects numInstances() Returns the number of instances in the dataset instance(int)Returns the instance at the given position enumerateInstances()Returns an enumeration of all instances in the dataset

University of Texas at Austin Machine Learning Group 11 Writing Classifiers Import the following packages import weka.classifiers.*; import weka.core.*; import weka.util.*; Extend Classifier –If predicting class probabilities then extend DistributionClassifier Essential methods buildClassifier(Instances)Generates a classifier classifyInstance(Instance) Classifies a given instance distributionForInstance(Instance)Predicts the class memberships (for DistributionClassifier) Interfaces that can be implemented UpdateableClassifierFor incremental classifiers WeightedInstanceHandlerIf classifier can make use of instance weights

University of Texas at Austin Machine Learning Group 12 Example: ZeroR (Majority Class) public class ZeroR extends DistributionClassifier implements WeightedInstancesHandler { private double m_ClassValue; //The class value 0R predicts private double [] m_Counts; //The number of instances in each class public void buildClassifier(Instances instances) throws Exception { m_Counts = new double [instances.numClasses()]; for (int i = 0; i < m_Counts.length; i++) { //Initialize counts m_Counts[i] = 1; } Enumeration enum = instances.enumerateInstances(); while (enum.hasMoreElements()) { //Add up class counts Instance instance = (Instance) enum.nextElement(); m_Counts[(int)instance.classValue()] += instance.weight(); } m_ClassValue = Utils.maxIndex(m_Counts); //Find majority class Utils.normalize(m_Counts); } //Normalize counts

University of Texas at Austin Machine Learning Group 13 Example: ZeroR - II //Return index of the predicted class public double classifyInstance(Instance instance) { return m_ClassValue; } //Return predicted class probability distribution public double [] distributionForInstance(Instance instance) throws Exception { return (double []) m_Counts.clone(); }

University of Texas at Austin Machine Learning Group 14 WekaUT: Extensions to WEKA Clusterers package: –SemiSupClusterer: Interface for semi-supervised clustering –SeededEM, SeededKMeans: Implements SemiSupClusterer, has seeding –HAC, MatrixHAC: Implements top-down agglomerative clustering –ConsensusClusterer: Abstract class for consensus clustering –ConsensusPairwiseClusterer: Takes output of many clusterings, uses cluster collocation statistics as similarity values, applies clustering algo –CoTrainableClusterer: Performs co-trainable clustering, similar to Nigam’s Co-EM –CVEvaluation: 10-fold cross-validation with learning curves, in transductive framework

University of Texas at Austin Machine Learning Group 15 WekaUT (contd.) Metrics: –Metric: Abstract class for metric –LearnableMetric: Abstract class for learnable distance metric –Weighted DotP: Learnable –WeightedL1Norm: Learnable –WeightedEuclid: Learnable –Mahalanobis metric: Uses Jama for matrix operations

University of Texas at Austin Machine Learning Group 16 Making Weka Text-friendly Preprocess text by making wrapper calls to: –Mooney’s IR package: Tokenize, Porter Stemming, TFIDF –McCallum’s BOW package: Tokenize, Stem, TFIDF, Information- theoretic pruning, N-gram tokens, different smoothing algorithms –Fan’s MC toolkit: Tokenize, TFIDF, pruning, CCS format No inverted index in Weka: OK if not doing IR, but KNN is inefficient –May want to integrate VSR package of IR with Weka Probability underflow currently: have to do calculations with logs –NaiveBayes, KNN, etc: Can have 2 versions of each (sparse, dense) Sparse vector format: –Weka’s SparseInstance –IR’s hashMapVector

University of Texas at Austin Machine Learning Group 17 Weka’s SparseInstance format Non-zero attributes explicitly stated, 0 values not {1:”the”,3: ”small”,6:”boy”,9: “ate”,13: “the”,17: “small”,21: “pie”} Strings mapped to integer indices using a hashtable: the0 small1 boy2 ate3 the4 small5 pie6 Use StringToWordVectorFilter to convert text SparseInstance to word vector (in Weka 3-2-2)

University of Texas at Austin Machine Learning Group 18 Comparison of sparse vector formats hashMapVector + Compact hashMap representation + Amortized constant-time access –Does not store position information, maybe necessary for future apps –Will need a lot of modification to Weka SparseInstance + Efficient storage, in terms of indices of string values and position + Contains position information of tokens + Will not require any modification to Weka –Uses binary search to insert new element to vector –Would need filters for TF, IDF, token counts, etc. –Will require a hack to bypass soft-bug during multiple read-writes

University of Texas at Austin Machine Learning Group 19 Future Work Write wrappers for existing C/C++ packages –mc, spkmeans, rainbow, svmlight, cluto Data format converters e.g. CCStoARFF 10 fold CVevaluation with learning curves –inductive (modify Weka’s) –transductive (use clusterer CV code) Statistical tests e.g. t-tests for classification Cluster evaluation metrics –we have KL, MI, Pairwise Making changes to handle text documents

University of Texas at Austin Machine Learning Group 20 Weka Problems Internal variables private –Should have protected or package-level access SparseInstance for Strings requires dummy at index 0 –Problem: Strings are mapped into internal indices to an array String at position 0 is mapped to value “0” When written out as SparseInstance, it will not be written (0 value) If read back in, first String missing from Instances –Solution: Put dummy string in position 0 when writing a SparseInstance with strings Dummy will be ignored while writing, actual instance will be written properly