CSCI N317 Computation for Scientific Applications Unit Weka

CSCI N317 Computation for Scientific Applications Unit 3 - 2 Weka
Classification

Classification Decision Tree Classification Typical applications
predicts categorical class labels (discrete or nominal) classifies data (constructs a tree model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical applications Credit approval Target marketing Medical diagnosis Fraud detection

Classification Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as decision trees

Classification age? student? credit rating? <=30 >40 no yes fair
Model usage: for classifying future or unknown objects Estimate accuracy of the model using a test set The known label of test set is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known age? student? credit rating? <=30 >40 no yes 31..40 fair excellent

Presentation of Classification Results
February 16, 2019 Data Mining: Concepts and Techniques

Visualization of a Decision Tree in SGI/MineSet 3.0
February 16, 2019 Data Mining: Concepts and Techniques

Classifier Accuracy Measures
Confusion matrix: a matrix (also known as an error matrix) that allows visualization of the performance of an algorithm, typically a supervised learning one. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The diagonal values are the correct classifications while others are errors. E.g. Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) classes buy_computer = yes buy_computer = no total recognition(%) 6954 46 7000 99.34 412 2588 3000 86.27 7366 2634 10000 95.42

Data Mining: Concepts and Techniques
Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data(adjusting values measured on different scales to a notionally common scale) Data Mining: Concepts and Techniques

Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Prune trees to avoid overfitting Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree”

Weka Introduction Data Mining and Weka: Weka Datasets:

Build a Classifier https://www.youtube.com/watch?v=da-6IBnqzsg Example
Dataset: glass.arff Files used by Weka has the .arff extension Can examine the file using NotePad++

Build a Classifier Steps: Weka -> Explorer Open file and check data

Build a Classifier Choose a classifier
In our class, we will use a J48 tree classifier (to build a decision tree)

Build a Classifier Run the classifier

Build a Classifier Understand the result Information about the dataset
Information about the result tree Information about accuracy

Building a Classifier Tree pruning https://youtu.be/ncR_6UsuggY
Default – pruned Compare accuracy with an unpruned tree

Building a Classifier Tree pruning
Change the configuration of the classifier and run again

Building a Classifier Tree pruning Compare the accuracy

Building a Classifier Tree pruning
Can manually set the minimum number of instances per leaf Configure the classifier again and change the minNumObj setting. Heavier leaves will result in smaller trees:

Building a Classifier Visualize the tree
Right click on a running result

Building a Classifier Visualize the tree
Right click on the tree window and “fit to screen”

Building a Classifier Remove attributes and instances
Run Classifier with less attributes and instances Compare results Remove Attributes

Building a Classifier Remove Instances:

Evaluation Training and testing

Evaluation Training and testing
Need to have a training data set and a testing data set It is important that the training set is different from testing set If you have only one data set, separate them into two sets

Evaluation Training and testing example Supply a test set:
segment-challenge.arff (training set) segment-test.arff (testing set) If only one data set, specify a percentage split to create two sets

Evaluation Training and testing example
Repeat the process with different percentage split Repeat the process with different random number seed

Evaluation Cross-validation https://youtu.be/V0eL6MWxY-w
Divide dataset into 10 parts (folds) For each run, use one part for testing, and others for training Each data point used once for testing, 9 times for training Average the results Stratified cross-validation Ensure that each fold has the right proportion of each class value Rule of thumb If having lots of data, use percentage split Otherwise, use stratified 10-fold cross-validation 10-fold is the standard operation. Can increase the number if the data set is large and if having enough data in the testing set.

CSCI N317 Computation for Scientific Applications Unit Weka

Similar presentations

Presentation on theme: "CSCI N317 Computation for Scientific Applications Unit Weka"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI N317 Computation for Scientific Applications Unit Weka

Similar presentations

Presentation on theme: "CSCI N317 Computation for Scientific Applications Unit Weka"— Presentation transcript:

Similar presentations

About project

Feedback