Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI N317 Computation for Scientific Applications Unit Weka

Similar presentations


Presentation on theme: "CSCI N317 Computation for Scientific Applications Unit Weka"— Presentation transcript:

1 CSCI N317 Computation for Scientific Applications Unit 3 - 2 Weka
Classification

2 Classification Decision Tree Classification Typical applications
predicts categorical class labels (discrete or nominal) classifies data (constructs a tree model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical applications Credit approval Target marketing Medical diagnosis Fraud detection

3 Classification Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as decision trees

4 Classification age? student? credit rating? <=30 >40 no yes fair
Model usage: for classifying future or unknown objects Estimate accuracy of the model using a test set The known label of test set is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known age? student? credit rating? <=30 >40 no yes 31..40 fair excellent

5 Presentation of Classification Results
February 16, 2019 Data Mining: Concepts and Techniques

6 Visualization of a Decision Tree in SGI/MineSet 3.0
February 16, 2019 Data Mining: Concepts and Techniques

7 Classifier Accuracy Measures
Confusion matrix: a matrix (also known as an error matrix) that allows visualization of the performance of an algorithm, typically a supervised learning one. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The diagonal values are the correct classifications while others are errors. E.g. Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) classes buy_computer = yes buy_computer = no total recognition(%) 6954 46 7000 99.34 412 2588 3000 86.27 7366 2634 10000 95.42

8 Data Mining: Concepts and Techniques
Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data(adjusting values measured on different scales to a notionally common scale) Data Mining: Concepts and Techniques

9 Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Prune trees to avoid overfitting Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree”

10 Weka Introduction Data Mining and Weka: Weka Datasets:

11 Build a Classifier https://www.youtube.com/watch?v=da-6IBnqzsg Example
Dataset: glass.arff Files used by Weka has the .arff extension Can examine the file using NotePad++

12 Build a Classifier Steps: Weka -> Explorer Open file and check data

13 Build a Classifier Choose a classifier
In our class, we will use a J48 tree classifier (to build a decision tree)

14 Build a Classifier Run the classifier

15 Build a Classifier Understand the result Information about the dataset
Information about the result tree Information about accuracy

16 Building a Classifier Tree pruning https://youtu.be/ncR_6UsuggY
Default – pruned Compare accuracy with an unpruned tree

17 Building a Classifier Tree pruning
Change the configuration of the classifier and run again

18 Building a Classifier Tree pruning Compare the accuracy

19 Building a Classifier Tree pruning
Can manually set the minimum number of instances per leaf Configure the classifier again and change the minNumObj setting. Heavier leaves will result in smaller trees:

20 Building a Classifier Visualize the tree
Right click on a running result

21 Building a Classifier Visualize the tree
Right click on the tree window and “fit to screen”

22 Building a Classifier Remove attributes and instances
Run Classifier with less attributes and instances Compare results Remove Attributes

23 Building a Classifier Remove Instances:

24 Evaluation Training and testing

25 Evaluation Training and testing
Need to have a training data set and a testing data set It is important that the training set is different from testing set If you have only one data set, separate them into two sets

26 Evaluation Training and testing example Supply a test set:
segment-challenge.arff (training set) segment-test.arff (testing set) If only one data set, specify a percentage split to create two sets

27 Evaluation Training and testing example
Repeat the process with different percentage split Repeat the process with different random number seed

28 Evaluation Cross-validation https://youtu.be/V0eL6MWxY-w
Divide dataset into 10 parts (folds) For each run, use one part for testing, and others for training Each data point used once for testing, 9 times for training Average the results Stratified cross-validation Ensure that each fold has the right proportion of each class value Rule of thumb If having lots of data, use percentage split Otherwise, use stratified 10-fold cross-validation 10-fold is the standard operation. Can increase the number if the data set is large and if having enough data in the testing set.


Download ppt "CSCI N317 Computation for Scientific Applications Unit Weka"

Similar presentations


Ads by Google