Download presentation
Presentation is loading. Please wait.
Published byFilip Havlíček Modified over 5 years ago
1
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Due 8/31/17 Dataset on class web page from Golub et al, Science, 286 (1999) Download / familiarize yourself with Weka. Weka is a useful tool that has implemented most of the major machine learning algorithms. We will be using the Weka GUI in this assignment. Download and start the Weka GUI. Follow the instructions on the Weka site: You will see four buttons; we will only use the Explorer functionality. There is much Weka documentation available. You may familiarize yourself with Explorer as much as you like by reading the user guide and using their provided sample datasets: Explorer guide
2
Open the leukemia gene expression file in Weka
Open the leukemia gene expression file in Weka. This file has data from 72 leukemia patients (rows). The expression values are for 150 genes (columns). The last column is the type of leukemia (ALL or AML) for each patient. Q1. What is the mean value of expression of the gene labeled “CD33 CD33 antigen (differentiation antigen)”?
3
Go to the “classify” tab. Under “Classifier” click the “Choose” button
Go to the “classify” tab. Under “Classifier” click the “Choose” button. Expand the “lazy” menu and choose “IBk”. This is KNN. IBk stands for Instance-Based k. Click on this text in the parameter box for IBk. A menu will pop up. For “KNN”, enter 5. Recall that this means the algorithm will use the five nearest neighbors to classify each data point. Leave the rest of the values as default. Under “Test options” choose “Cross-validation” and under “Folds” enter 5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This means that the algorithm will classify “leukemia_type” (AML or ALL), using the gene expression value as attributes. Click the “Start’ button. The main window will show a variety of results, such as accuracy, true positive rates, false positive rates, and a confusion matrix when ALL is treated as the positive class. Q2a. What is the % of correctly classified instances? Q2b. Calculate the TP and FP rates for ALL from the confusion matrix. Q2c. What is the confusion matrix when AML is treated as the positive class? Q2d. Calculate the TP and FP rates for AML from the new confusion matrix?
4
Right click on your result in the “Result list” on the left side of the screen. Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP) rate vs. false positive (FP) rate, which are the defaults. You can also view other types of curves by clicking the dropdown menus. For example, precision-recall curves are an alternative to ROC curves; precision and recall are options in the dropdown menu. Q3a. Capture the ROC curve when ALL is the positive class. Q3b. Capture the ROC curve when AML is the positive class.
5
ZeroR is a baseline classifier that identifies the class with the most examples and predicts all examples to be in that class. Click the Choose button under Classifier, and expand the “rules” folder. Choose “ZeroR”. Again use cross-validation with Folds=5. Run it. Q4a. What is the % of correctly classified instances? Q4b. Calculate the TP and FP rates for ALL from the confusion matrix. Q4c. What is the confusion matrix when AML is treated as the positive class? Q4d. Calculate the TP and FP rates for AML from the new confusion matrix? Any successful classification should yield more accurate results than ZeroR; however, if the number of examples of each class in the data set is greatly imbalanced, results with ZeroR will look good because most of the examples are correctly classified. This is an indication that you need to deal with uneven class sizes by weighting, a topic not covered in this class. Check out the “Cost Sensitive Classifier” and “Cost Sensitive evaluation” if you’re interested.
6
Extended HW1: Try naïve Bayes in Weka for the leukemia data set
Under the “bayes” Classifier folder, choose “NaiveBayes” and run. What is the % of correctly classified instances? What are the TP and FP rates for ALL and AML
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.