CSCI N317 Computation for Scientific Applications Unit Weka

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Lecture outline Classification Decision-tree classification.
Classification and Prediction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Classification.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 7 Decision Tree.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Evaluating Classifiers
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
An Exercise in Machine Learning
Overview DM for Business Intelligence.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Mohammad Ali Keyvanrad
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Classification and Prediction
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning with Spark MLlib
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Evaluating Classifiers
Chapter 6 Classification and Prediction
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Prepared by: Mahmoud Rafeek Al-Farra
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining – Chapter 3 Classification
Classification and Prediction
CS 685: Special Topics in Data Mining Jinze Liu
Intro to Machine Learning
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Evaluating Classifiers
Classification.
MIS2502: Data Analytics Classification Using Decision Trees
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS639: Data Management for Data Science
CS 685: Special Topics in Data Mining Jinze Liu
Classification 1.
COSC 4368 Intro Supervised Learning Organization
Data Mining CSCI 307, Spring 2019 Lecture 8
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

CSCI N317 Computation for Scientific Applications Unit 3 - 2 Weka Classification

Classification Decision Tree Classification Typical applications predicts categorical class labels (discrete or nominal) classifies data (constructs a tree model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical applications Credit approval Target marketing Medical diagnosis Fraud detection

Classification Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as decision trees

Classification age? student? credit rating? <=30 >40 no yes fair Model usage: for classifying future or unknown objects Estimate accuracy of the model using a test set The known label of test set is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known age? student? credit rating? <=30 >40 no yes 31..40 fair excellent

Presentation of Classification Results February 16, 2019 Data Mining: Concepts and Techniques

Visualization of a Decision Tree in SGI/MineSet 3.0 February 16, 2019 Data Mining: Concepts and Techniques

Classifier Accuracy Measures Confusion matrix: a matrix (also known as an error matrix) that allows visualization of the performance of an algorithm, typically a supervised learning one. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. The diagonal values are the correct classifications while others are errors. E.g. Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) classes buy_computer = yes buy_computer = no total recognition(%) 6954 46 7000 99.34 412 2588 3000 86.27 7366 2634 10000 95.42

Data Mining: Concepts and Techniques Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data(adjusting values measured on different scales to a notionally common scale) Data Mining: Concepts and Techniques

Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Prune trees to avoid overfitting Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree”

Weka Introduction Data Mining and Weka: https://www.youtube.com/watch?v=Exe4Dc8FmiM https://www.youtube.com/watch?v=nHm8otvMVTs Weka Datasets: https://www.youtube.com/watch?v=BO6XJSaFYzk

Build a Classifier https://www.youtube.com/watch?v=da-6IBnqzsg Example Dataset: glass.arff Files used by Weka has the .arff extension Can examine the file using NotePad++

Build a Classifier Steps: Weka -> Explorer Open file and check data

Build a Classifier Choose a classifier In our class, we will use a J48 tree classifier (to build a decision tree)

Build a Classifier Run the classifier

Build a Classifier Understand the result Information about the dataset Information about the result tree Information about accuracy

Building a Classifier Tree pruning https://youtu.be/ncR_6UsuggY Default – pruned Compare accuracy with an unpruned tree

Building a Classifier Tree pruning Change the configuration of the classifier and run again

Building a Classifier Tree pruning Compare the accuracy

Building a Classifier Tree pruning Can manually set the minimum number of instances per leaf Configure the classifier again and change the minNumObj setting. Heavier leaves will result in smaller trees:

Building a Classifier Visualize the tree Right click on a running result

Building a Classifier Visualize the tree Right click on the tree window and “fit to screen”

Building a Classifier Remove attributes and instances https://youtu.be/XySEe4uNsCY Run Classifier with less attributes and instances Compare results Remove Attributes

Building a Classifier Remove Instances:

Evaluation Training and testing https://www.youtube.com/watch?v=FMiCOx95lAc https://www.youtube.com/watch?v=7lFie7V__Gs https://www.youtube.com/watch?v=V0eL6MWxY-w https://www.youtube.com/watch?v=dGF475wS5eY

Evaluation Training and testing Need to have a training data set and a testing data set It is important that the training set is different from testing set If you have only one data set, separate them into two sets

Evaluation Training and testing example Supply a test set: segment-challenge.arff (training set) segment-test.arff (testing set) If only one data set, specify a percentage split to create two sets

Evaluation Training and testing example Repeat the process with different percentage split Repeat the process with different random number seed

Evaluation Cross-validation https://youtu.be/V0eL6MWxY-w https://youtu.be/dGF475wS5eY Divide dataset into 10 parts (folds) For each run, use one part for testing, and others for training Each data point used once for testing, 9 times for training Average the results Stratified cross-validation Ensure that each fold has the right proportion of each class value Rule of thumb If having lots of data, use percentage split Otherwise, use stratified 10-fold cross-validation 10-fold is the standard operation. Can increase the number if the data set is large and if having enough data in the testing set.