An Introduction to WEKA

An Introduction to WEKA
As presented by PACE 8/9/2012 DISCLAIMER: These slides are comprised from a common slide stack and may have been modified slightly

Content What is WEKA? The Explorer Application Weka on Trestles
Preprocess Classify Cluster Associate Select Attributes Visualize Weka on Trestles References and Resources 5/2/2018

What is WEKA? Weka is a bird found only in New Zealand.
Waikato Environment for Knowledge Analysis Weka is a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software in JAVA issued under the GNU General Public License 5/2/2018

Download and Install WEKA
Website: Support multiple platforms (written in java): Windows, Mac OS X and Linux Datasets(iris.arff, weather.arff) Available on Trestles at: /home/diag/opt/weka/data Available with Download: …../weka/data/ 5/2/2018

Main Features 49 data preprocessing tools
76 classification/regression algorithms 8 clustering algorithms 3 algorithms for finding association rules 15 attribute/subset evaluators + 10 search algorithms for feature selection 5/2/2018

Main GUI Three graphical user interfaces
“The Explorer” (exploratory data analysis) pre-process data build “classifiers” cluster data find associations attribute selection data visualization “The Experimenter” (experimental environment) used to compare performance of different learning schemes “The KnowledgeFlow” (new process model inspired interface) Java-Beans-based interface for setting up and running machine learning experiments. Command line Interface (“Simple CLI”) Explorer An environment for exploring data with WEKA • Experimenter An environment for performing experiments and conduct- ing statistical tests between learning schemes. • KnowledgeFlow This environment supports essentially the same func- tions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. • SimpleCLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. More at: 5/2/2018

Content What is WEKA? The Explorer: Weka on Trestles
Preprocess Classify Cluster Associate Select Attributes Visualize Weka on Trestles References and Resources Preprocess. Choose and modify the data being acted on. 2. Classify. Train and test learning schemes that classify or perform regression. 3. Cluster. Learn clusters for the data. 4. Associate. Learn association rules for the data. 5. Select attributes. Select the most relevant attributes in the data. 6. Visualize. View an interactive 2D plot of the data. 5/2/2018

University of Waikato 5/2/2018

WEKA:: Explorer: Preprocess
Data format Uses flat text files to describe the data Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database (using JDBC)

WEKA:: ARFF file format
@relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... numeric attribute nominal attribute numeric {nominal-specification} string date [<date-format>] Numeric attributes Numeric attributes can be real or integer numbers. Nominal attributes Nominal values are defined by providing an <nominal-specification> listing the possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...} For example, the class value of the Iris dataset can be defined as follows: @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} Values that contain spaces must be quoted. String attributes String attributes allow us to create attributes containing arbitrary textual values. This is very useful in text-mining applications, as we can create datasets with string attributes, then write Weka Filters to manipulate strings (like StringToWordVectorFilter). String attributes are declared as follows: @ATTRIBUTE LCC string Date attributes Date attribute declarations take the form: @attribute <name> date [<date-format>] where <name> is the name for the attribute and <date-format> is an optional string specifying how date values should be parsed and printed (this is the same format used by SimpleDateFormat). The default format string accepts the ISO-8601 combined date and time format: "yyyy-MM-dd'T'HH:mm:ss". Dates must be specified in the data section as the corresponding string representations of the date/time (see example below). A more thorough description is available here

University of Waikato 5/2/2018
Relation. The name of the relation, as given in the file it was loaded from. Filters (described below) modify the name of a relation. 2. Instances. The number of instances (data points/records) in the data. 3. Attributes. The number of attributes (features) in the data. University of Waikato 5/2/2018

Name. The name of the attribute, the same as that given in the attribute list. 2. Type. The type of attribute, most commonly Nominal or Numeric. 3. Missing. The number (and percentage) of instances in the data for which this attribute is missing (unspecified). 4. Distinct. The number of different values that the data contains for this attribute. 5. Unique. The number (and percentage) of instances in the data having a value for this attribute that no other instances have. University of Waikato 5/2/2018

University of Waikato 5/2/2018 If the attribute is nominal,
the list consists of each possible value for the attribute along with the number of instances that have that value. If the attribute is numeric, the list gives four statistics describing the distribution of values in the data—the minimum, maximum, mean and standard deviation. University of Waikato 5/2/2018

University of Waikato 5/2/2018 And below these statistics there is a
colored histogram, color-coded according to the attribute chosen as the Class using the box above the histogram. Note that only nominal Class attributes will result in a color-coding. University of Waikato 5/2/2018

WEKA:: Explorer: Preprocess
Used to define filters to transform Data. WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc

The GenericObjectEditor dialog box lets you configure a filter. University of Waikato 5/2/2018

An instance filter that discretizes a range of numeric attributes in the dataset into nominal attributes. Discretization is by simple binning. Skips the class attribute if set. OPTIONS attributeIndices -- Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values. Specify an inclusive range with "-". E.g: "first-3,5,6-10,last". bins -- Number of bins. desiredWeightOfInstancesPerInterval -- Sets the desired weight of instances per interval for equal-frequency binning. findNumBins -- Optimize number of equal-width bins using leave-one-out. Doesn't work for equal-frequency binning ignoreClass -- The class index will be unset temporarily before the filter is applied. invertSelection -- Set attribute selection mode. If false, only selected (numeric) attributes in the range will be discretized; if true, only non-selected attributes will be discretized. makeBinary -- Make resulting attributes binary. useEqualFrequency -- If set to true, equal-frequency binning will be used instead of equal-width binning. University of Waikato 5/2/2018

WEKA:: Explorer: building “classifiers”
Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, … “Meta”-classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, …

Decision Tree Induction: Training Dataset
This follows an example of Quinlan’s ID3 May 2, 2018

Output: A Decision Tree for “buys_computer”
age? overcast student? credit rating? <=30 >40 no yes 31..40 fair excellent May 2, 2018

Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) May 2, 2018

University of Waikato 5/2/2018 Rt. Click show properties
More to get detailed explanation of variables University of Waikato 5/2/2018

binarySplits -- Whether to use binary splits on nominal attributes when building the trees. confidenceFactor -- The confidence factor used for pruning (smaller values incur more pruning). debug -- If set to true, classifier may output additional info to the console. minNumObj -- The minimum number of instances per leaf. numFolds -- Determines the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. reducedErrorPruning -- Whether reduced-error pruning is used instead of C.4.5 pruning. saveInstanceData -- Whether to save the training data for visualization. seed -- The seed used for randomizing the data when reduced-error pruning is used. subtreeRaising -- Whether to consider the subtree raising operation when pruning. unpruned -- Whether pruning is performed. useLaplace -- Whether counts at leaves are smoothed based on Laplace. University of Waikato 5/2/2018

University of Waikato 5/2/2018 4 test modes:
1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on. 2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on. 3. Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field. 4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field. Note: No matter which evaluation method is used, the model that is output is always the one build from all the training data. Further testing options can be set by clicking on the More options... button: University of Waikato 5/2/2018

1. Output model. The classification model on the full training set is output so that it can be viewed, visualized, etc. This option is selected by default. 2. Output per-class stats. The precision/recall and true/false statistics for each class are output. This option is also selected by default. 3. Output entropy evaluation measures. Entropy evaluation measures are included in the output. This option is not selected by default. 4. Output confusion matrix. The confusion matrix of the classifier’s predictions is included in the output. This option is selected by default. 5. Store predictions for visualization. The classifier’s predictions are remembered so that they can be visualized. This option is selected by default. 6. Output predictions. The predictions on the evaluation data are output. Note that in the case of a cross-validation the instance numbers do not correspond to the location in the data! 7. Output additional attributes. If additional attributes need to be output alongside the predictions, e.g., an ID attribute for tracking misclassi- fications, then the index of this attribute can be specified here. The usual Weka ranges are supported,“first” and “last” are therefore valid indices as well (example: “first-3,6,8,12-last”). 8. Cost-sensitive evaluation. The errors is evaluated with respect to a cost matrix. The Set... button allows you to specify the cost matrix used. 9. Random seed for xval / % Split. This specifies the random seed used when randomizing the data before it is divided up for evaluation purposes. 10. Preserve order for % Split. This suppresses the randomization of the data before splitting into train and test set. 11. Output source code. If the classifier can output the built model as Java source code, you can specify the class name here. The code will be printed in the “Classifier output” area. University of Waikato 5/2/2018

Run information. A list of information giving the learning scheme options, relation name, instances, attributes and test mode that were involved in the process. Classifier model (full training set). A textual representation of the classification model that was produced on the full training data. The results of the chosen test mode are broken down thus: 1 Summary. A list of statistics summarizing how accurately the classifier was able to predict the true class of the instances under the chosen test mode. 2 Detailed Accuracy By Class. A more detailed per-class break down of the classifier’s prediction accuracy. 3 Confusion Matrix. Shows how many instances have been assigned to each class. Elements show the number of test examples whose actual class is the row and whose predicted class is the column. 4 Source code (optional). This section lists the Java source code if one choose “Output source code” in the “More options” dialog. University of Waikato 5/2/2018

The first part is a human-readable form of the training set model. In this case, it is a decision tree. petalwidth is at the root of the tree and determines the first decision. In case it is petalwidth <=0.6, iris is classified as iris-sentosa. The numbers in (parentheses) at the end of each leaf tell us the number of examples in this leaf. If one or more leaves were not pure (= all of the same class), the number of misclassified examples would also be given, after a /slash/ University of Waikato 5/2/2018

University of Waikato 5/2/2018 Accuracy is ~96%
The kappa statistic measures the agreement of prediction with the true class signifies complete agreement. The following error values are not very meaningful for classification tasks, however for regression tasks e.g. the root of the mean squared error per example would be a reasonable criterion. The confusion matrix is more commonly named contingency table. We have 3 classes, and therefore a 3x3 confusion matrix, the matrix could be arbitrarily large. The number of correctly classified instances is the sum of diagonals in the matrix; all others are incorrectly classified (class “c" gets misclassified as "b" exactly twice). The True Positive (TP) rate is the proportion of examples which were classified as class x, among all examples which truly have class x, i.e. how much part of the class was captured. It is equivalent to Recall. In the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e. 7/(7+2)=0.778 for class yes and 2/(3+2)=0.4 for class no in our example. The False Positive (FP) rate is the proportion of examples which were classified as class x, but belong to a different class, among all examples which are not of class x. In the matrix, this is the column sum of class x minus the diagonal element, divided by the rows sums of all other classes; i.e. 3/5=0.6 for class yes and 2/9=0.222 for class no. The Precision is the proportion of the examples which truly have class x among all those which were classified as class x. In the matrix, this is the diagonal element divided by the sum over the relevant column, i.e. 7/(7+3)=0.7 for class yes and 2/(2+2)=0.5 for class no. The F-Measure is simply 2*Precision*Recall/(Precision+Recall), a combined measure for precision and recall. University of Waikato 5/2/2018

Explorer: Select Attributes
Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, … Very flexible: WEKA allows (almost) arbitrary combinations of these two 5/2/2018

Use full training set. The worth of the attribute subset is determined using the full set of training data. 2. Cross-validation. The worth of the attribute subset is determined by a process of cross-validation. The Fold and Seed fields set the number of folds to use and the random seed used when shuffling the data. University of Waikato 5/2/2018

Explorer: Visualize Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values “Jitter” option to deal with nominal attributes (and to detect “hidden” data points) “Zoom-in” function 5/2/2018

scatter plot matrix, University of Waikato 5/2/2018

Using Weka On Trestles

Using Weka on Trestles Shared Resources Batch and Interactive
Use GUI and Command Line Use GUI on login nodes to create command line Use command line to run interactive or batch jobs on production nodes

Weka Gui To launch Weka Gui on a: Load weka module Windows machine
to run software on remote machine with GUI requires a secure shell with x forwarding enabled to establish a remote connection and an X Server to handle the local display. Suggested software putty and Xming Linux and MAC OS X support X Forwarding Mac users need to run Applications > Utilities > Xterm ssh –Y Load weka module Weka installation available at: /home/diag/opt/weka At command prompt > weka

PBS Script

Output file

Hands On with Weka

The Weather Data Set .arff file
Weather.arff file Available on Trestles at: /home/diag/opt/weka/data On line: With Weka download Data Set: @relation PlayTennis @attribute day numeric @attribute outlook {Sunny, Overcast, Rain} @attribute temperature {Hot, Mild, Cool} @attribute humidity {High, Normal} @attribute wind {Weak, Strong} @attribute playTennis {Yes, No} @data 1,Sunny,Hot,High,Weak,No,? 2,Sunny,Hot,High,Strong,No,? 3,Overcast,Hot,High,Weak,Yes,? 4,Rain,Mild,High,Weak,Yes,? 5,Rain,Cool,Normal,Weak,Yes,? 6,Rain,Cool,Normal,Strong,No,? 7,Overcast,Cool,Normal,Strong,Yes,? 8,Sunny,Mild,High,Weak,No,? .

The Problem Each instance describes the facts of the day and the action of the observed person (played or no play). The Data Set 14 Instances 6 attributes (day, outlook, temp, humidity, wind, play tennis) Based on the given records we can assess which factors affected the person's decision about playing tennis.

The Question Use j48 decision tree learner to model for class attribute play tennis Make prediction for “play”. Make predictions for the ‘temperature’ attribute. Do you need to do any additional data preparation?

Result

References and Resources
WEKA website: WEKA Tutorial: Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. A presentation which explains how to use Weka for exploratory data mining. WEKA Data Mining Book: Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) WEKA Wiki:

An Introduction to WEKA

Similar presentations

Presentation on theme: "An Introduction to WEKA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Introduction to WEKA

Similar presentations

Presentation on theme: "An Introduction to WEKA"— Presentation transcript:

Similar presentations

About project

Feedback