CSCI N317 Computation for Scientific Applications Unit 3 - 2 Weka Classification
Classification Decision Tree Classification Typical applications predicts categorical class labels (discrete or nominal) classifies data (constructs a tree model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical applications Credit approval Target marketing Medical diagnosis Fraud detection
Classification Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as decision trees
Classification age? student? credit rating? <=30 >40 no yes fair Model usage: for classifying future or unknown objects Estimate accuracy of the model using a test set The known label of test set is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known age? student? credit rating? <=30 >40 no yes 31..40 fair excellent
Presentation of Classification Results February 16, 2019 Data Mining: Concepts and Techniques
Visualization of a Decision Tree in SGI/MineSet 3.0 February 16, 2019 Data Mining: Concepts and Techniques
Classifier Accuracy Measures Confusion matrix: a matrix (also known as an error matrix) that allows visualization of the performance of an algorithm, typically a supervised learning one. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The diagonal values are the correct classifications while others are errors. E.g. Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) classes buy_computer = yes buy_computer = no total recognition(%) 6954 46 7000 99.34 412 2588 3000 86.27 7366 2634 10000 95.42
Data Mining: Concepts and Techniques Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data(adjusting values measured on different scales to a notionally common scale) Data Mining: Concepts and Techniques
Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Prune trees to avoid overfitting Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree”
Weka Introduction Data Mining and Weka: https://www.youtube.com/watch?v=Exe4Dc8FmiM https://www.youtube.com/watch?v=nHm8otvMVTs Weka Datasets: https://www.youtube.com/watch?v=BO6XJSaFYzk
Build a Classifier https://www.youtube.com/watch?v=da-6IBnqzsg Example Dataset: glass.arff Files used by Weka has the .arff extension Can examine the file using NotePad++
Build a Classifier Steps: Weka -> Explorer Open file and check data
Build a Classifier Choose a classifier In our class, we will use a J48 tree classifier (to build a decision tree)
Build a Classifier Run the classifier
Build a Classifier Understand the result Information about the dataset Information about the result tree Information about accuracy
Building a Classifier Tree pruning https://youtu.be/ncR_6UsuggY Default – pruned Compare accuracy with an unpruned tree
Building a Classifier Tree pruning Change the configuration of the classifier and run again
Building a Classifier Tree pruning Compare the accuracy
Building a Classifier Tree pruning Can manually set the minimum number of instances per leaf Configure the classifier again and change the minNumObj setting. Heavier leaves will result in smaller trees:
Building a Classifier Visualize the tree Right click on a running result
Building a Classifier Visualize the tree Right click on the tree window and “fit to screen”
Building a Classifier Remove attributes and instances https://youtu.be/XySEe4uNsCY Run Classifier with less attributes and instances Compare results Remove Attributes
Building a Classifier Remove Instances:
Evaluation Training and testing https://www.youtube.com/watch?v=FMiCOx95lAc https://www.youtube.com/watch?v=7lFie7V__Gs https://www.youtube.com/watch?v=V0eL6MWxY-w https://www.youtube.com/watch?v=dGF475wS5eY
Evaluation Training and testing Need to have a training data set and a testing data set It is important that the training set is different from testing set If you have only one data set, separate them into two sets
Evaluation Training and testing example Supply a test set: segment-challenge.arff (training set) segment-test.arff (testing set) If only one data set, specify a percentage split to create two sets
Evaluation Training and testing example Repeat the process with different percentage split Repeat the process with different random number seed
Evaluation Cross-validation https://youtu.be/V0eL6MWxY-w https://youtu.be/dGF475wS5eY Divide dataset into 10 parts (folds) For each run, use one part for testing, and others for training Each data point used once for testing, 9 times for training Average the results Stratified cross-validation Ensure that each fold has the right proportion of each class value Rule of thumb If having lots of data, use percentage split Otherwise, use stratified 10-fold cross-validation 10-fold is the standard operation. Can increase the number if the data set is large and if having enough data in the testing set.