Assignment 1: Classification by K Nearest Neighbors (KNN) technique

Slides:



Advertisements
Similar presentations
Machine Learning Homework
Advertisements

Florida International University COP 4770 Introduction of Weka.
Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Web ADI - Srinivas.M. Purpose Data upload into Oracle Applications Solution: Web ADI brings Oracle E-Business Suite functionality to a spreadsheet, where.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
© by Pearson Education, Inc. All Rights Reserved.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Evaluation – next steps
Using the FOCUS Web Site Teacher’s Desk. Topics Covered in this Presentation n Accessing the FOCUS Web site n Importing and Creating Classes n Adding.
WEKA – Knowledge Flow & Simple CLI
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Instructors begin using McGraw-Hill’s Homework Manager by creating a unique class Web site in the system. The Class Homepage becomes the entry point for.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Tutorial 11 Five windows included in the Visual Basic Startup Screen Main Form Toolbox Project Explorer (Project) Properties.
1 PROJECT 5 WEB/HTML COMPUTER PURCHASE FORM Management Information Systems, 9 th edition, By Raymond McLeod, Jr. and George P. Schell © 2004, Prentice.
The material contained in this document is proprietary to Triniti Corporation (Triniti). This material may not be disclosed, duplicated or otherwise revealed,
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Managing File Resource Using File Server Resource Manager Chapter 9 Advance Computer Network Lecture Sorn Pisey
An Exercise in Machine Learning
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Quiz 1 review. Evaluating Classifiers Reading: T. Fawcett paper, link on class website, Sections 1-4 Optional reading: Davis and Goadrich paper, link.
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
This is how you invoke the Microsoft Visual Studio 2010 Software. All Programs >> Microsoft Visual Studio 2010.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Machine Learning Homework Gaining familiarity with Weka, ML tools and algorithms.
Visual Basic.NET Windows Programming
The Effects of Cashtags in Predicting Daily DJIA Directional Change
Evaluating Classifiers
Student Registration/ Personal Needs Profile
Review of statistics in data mining
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
CS 8520: Artificial Intelligence
Features & Decision regions
Categorizing networks using Machine Learning
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Experiments in Machine Learning
Tutorial for LightSIDE
CSCI N317 Computation for Scientific Applications Unit Weka
CS4705 – Natural Language Processing Thursday, September 28
Model Evaluation and Selection
Intro to Machine Learning
ADMIT Doctoral Coordinators: Assigning Faculty Reviewers &
Machine Learning with WEKA
Student Registration/ Personal Needs Profile
ADMIT Doctoral Coordinators: Assigning Faculty Reviewers &
Evaluating Classifiers
Roc curves By Vittoria Cozza, matr
Student Registration/ Personal Needs Profile
Assignment 8 : logistic regression
Objective 1: Use Weka’s WrapperSubsetEval (Naïve Bayes
Assignment 7 Due Application of Support Vector Machines using Weka software Must install libsvm Data set: Breast cancer diagnostics Deliverables:
Neural Networks Weka Lab
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

Assignment 1: Classification by K Nearest Neighbors (KNN) technique Due 8/31/17 Dataset on class web page from Golub et al, Science, 286 (1999) 531-537 Download / familiarize yourself with Weka.  Weka is a useful tool that has implemented most of the major machine learning algorithms. We will be using the Weka GUI in this assignment. Download and start the Weka GUI.  Follow the instructions on the Weka site: http://www.cs.waikato.ac.nz/ml/weka/ You will see four buttons; we will only use the Explorer functionality.  There is much Weka documentation available. You may familiarize yourself with Explorer as much as you like by reading the user guide and using their provided sample datasets: Explorer guide

Open the leukemia gene expression file in Weka Open the leukemia gene expression file in Weka. This file has data from 72 leukemia patients (rows). The expression values are for 150 genes (columns).  The last column is the type of leukemia (ALL or AML) for each patient. Q1.  What is the mean value of expression of the gene labeled “CD33 CD33 antigen (differentiation antigen)”?

Go to the “classify” tab. Under “Classifier” click the “Choose” button Go to the “classify” tab.  Under “Classifier” click the “Choose” button.  Expand the “lazy” menu and choose “IBk”.  This is KNN.  IBk stands for Instance-Based k.  Click on this text in the parameter box for IBk. A menu will pop up.  For “KNN”, enter 5.  Recall that this means the algorithm will use the five nearest neighbors to classify each data point.  Leave the rest of the values as default. Under “Test options” choose “Cross-validation” and under “Folds” enter 5.  The dropdown menu below Test options should say “(Nom) leukemia_type”.  This means that the algorithm will classify “leukemia_type” (AML or ALL), using the gene expression value as attributes.   Click the “Start’ button.  The main window will show a variety of results, such as accuracy, true positive rates, false positive rates, and a confusion matrix when ALL is treated as the positive class. Q2a. What is the % of correctly classified instances? Q2b. Calculate the TP and FP rates for ALL from the confusion matrix. Q2c. What is the confusion matrix when AML is treated as the positive class? Q2d. Calculate the TP and FP rates for AML from the new confusion matrix?

Right click on your result in the “Result list” on the left side of the screen.  Choose “visualize threshold curve” and “ALL”.  An ROC curve plots true positive (TP) rate vs. false positive (FP) rate, which are the defaults.  You can also view other types of curves by clicking the dropdown menus.  For example, precision-recall curves are an alternative to ROC curves; precision and recall are options in the dropdown menu. Q3a. Capture the ROC curve when ALL is the positive class. Q3b. Capture the ROC curve when AML is the positive class.

ZeroR is a baseline classifier that identifies the class with the most examples and predicts all examples to be in that class.  Click the Choose button under Classifier, and expand the “rules” folder.  Choose “ZeroR”.  Again use cross-validation with Folds=5.  Run it. Q4a. What is the % of correctly classified instances? Q4b. Calculate the TP and FP rates for ALL from the confusion matrix. Q4c. What is the confusion matrix when AML is treated as the positive class? Q4d. Calculate the TP and FP rates for AML from the new confusion matrix? Any successful classification should yield more accurate results than ZeroR; however, if the number of examples of each class in the data set is greatly imbalanced, results with ZeroR will look good because most of the examples are correctly classified. This is an indication that you need to deal with uneven class sizes by weighting, a topic not covered in this class. Check out the “Cost Sensitive Classifier” and “Cost Sensitive evaluation” if you’re interested.

Extended HW1: Try naïve Bayes in Weka for the leukemia data set Under the “bayes” Classifier folder, choose “NaiveBayes” and run.  What is the % of correctly classified instances? What are the TP and FP rates for ALL and AML