Ch. Eick Project 2 COSC 6335 2013 Christoph F. Eick.

Slides:



Advertisements
Similar presentations
CHAPTER 13: Alpaydin: Kernel Machines
Advertisements

The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
CHAPTER 9: Decision Trees
Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 4 By Herb I. Gross and Richard A. Medeiros next.
Chapter 8 Hypothesis Testing I. Significant Differences  Hypothesis testing is designed to detect significant differences: differences that did not occur.
Evaluating Hypotheses
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
29-Jun-15 Recursion. 2 Definitions I A recursive definition is a definition in which the thing being defined occurs as part of its own definition Example:
Review.
Theoretical & Industrial Design of Aerofoils P M V Subbarao Professor Mechanical Engineering Department An Objective Invention ……
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
For Better Accuracy Eick: Ensemble Learning
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Ch. Eick Project 3 COSC Christoph F. Eick Last updated: October 21, 2014.
Review Measures of central tendency
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Ch. Eick Christoph F. Eick. Ch. Eick Post Analysis Project1 Disclaimer The main purpose of these slides is not criticize groups but rather to learn how.
Experimental Evaluation of Learning Algorithms Part 1.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Topic9: Density-based Clustering
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Issues concerning the interpretation of statistical significance tests.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
HMM - Part 2 The EM algorithm Continuous density HMM.
Ap statistics FRAPPY! 10. ideal solution part a students teachers studentsteachers Min Q1 Med Q3 Max.
Chapter 8 Hypothesis Testing I. Significant Differences  Hypothesis testing is designed to detect significant differences: differences that did not occur.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
-The five summary statistics can be demonstrated on a box plot. - It shows a clear visual display of how the data are spread out. BOXPLOTS.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
How to write the ‘Empirical Evaluation’ Section of Your Paper Christoph F. Eick Department of Computer Science, University of Houston Example Call for.
COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
21-Feb-16 Analysis of Algorithms II. 2 Basics Before we attempt to analyze an algorithm, we need to define two things: How we measure the size of the.
Program Performance 황승원 Fall 2010 CSE, POSTECH. Publishing Hwang’s Algorithm Hwang’s took only 0.1 sec for DATASET1 in her PC while Dijkstra’s took 0.2.
Christoph F. Eick: Thoughts on the Rook Project Challenges of Playing Bridge Well 
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
1. Randomized Hill Climbing Neighborhood Randomized Hill Climbing: Sample p points randomly in the neighborhood of the currently best solution; determine.
Christoph F. Eick: Thoughts on Designing Michigan-style Classifier Systems Thoughts on Selection Methods in Michigan-style Classifier Systems  When solving.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Paired Samples and Blocks
Fitting.
Chapter 21 More About Tests.
Outlier Discovery/Anomaly Detection
Sequence comparison: Significance of similarity scores
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Data Mining Anomaly/Outlier Detection
CHAPTER 18: Inference in Practice
Using statistics to evaluate your test Gerard Seinhorst
Framework: Agent in State Space
Analysis of Algorithms II
Box-and-Whisker Plots
We have Been looking at:
Interpretation of Similar Gene Expression Reordering
COSC 6335 Fall 2014 Post Analysis Project1
Presentation transcript:

Ch. Eick Project 2 COSC Christoph F. Eick

Ch. Eick Project 2 COSC Ch. Eick Arko’s Agreement Code agreement = function(x,y) { max<-NROW(x$cluster); count<-0; total<-max*(max+1)/2; for(i in 1:max) { for(j in i:max) { if(j!=i) { if((x$cluster[j]==x$cluster[i] & y$cluster[j]==y$cluster[i]) | (x$cluster[j]!=x$cluster[i] & y$cluster[j]!=y$cluster[i])) count<-count+1; } else { if((x$cluster[i]==0 & x$cluster[j]==0) | (x$cluster[i]>0 & x$cluster[j]>0)) count<-count+1; } returnValue<-count/total; return(returnValue); 2

Ch. Eick Project 2 COSC Ch. Eick 3 K-means for Complex8 In general, the turquoise and the pink clusters are bad, whereas the brown and green clusters are okay.

Ch. Eick Project 2 COSC Ch. Eick Arko’s Code for the Purity Function (except what is in red) purity<-function(a,b,outliers=FALSE) { require('matrixStats'); t<-table(a,b); rowTotals<-rowSums(t); #the same can be with apply(t,1,sum) rowMax<-apply(t,1,max); if(!outliers) { purity<-sum(rowMax)/sum(rowTotals); return (purity) } else { if(NROW(rowTotals)>1) { purity<-(sum(rowMax)-rowMax[1])/(sum(rowTotals)-rowTotals[1]); } else { purity<-NA; } pcOutliers=rowTotals[1]/(sum(rowTotals)); returnVector=vector(mode='double',length=2); returnVector[1]=purity; returnVector[2]=pcOutliers; return(returnVector); } 4

Ch. Eick Project 2 COSC Ch. Eick Task4: Characterizing the 5 Clusters Cluster number Characteristic 1 a>0.65 AND b>0.6 2 d> f> no interesting observation 5 a<0.44 AND b<0.45 (lower accuracy) 5 Cluster number Properties Remark: As we use k-means, almost everybody should have different clusters and summaries

Ch. Eick Project 2 COSC Ch. Eick Project2 Observations Assuming purity is used as the evaluation measure DBSCAN outperformed Kmeans quite significantly, as K-means was not able to detect the natural clusters; on the other hand, for the Yeast dataset K-means obtained better results than DBSCAN; in general, DBSCAN seems to create one very big cluster or obtain a clustering with a lot of outliers, and it seemed to be very difficult (or even impossible) to obtain solutions that lie between the extremes. A lot of students failed to observe that k-means fails to identify the natural clusters in the Complex8 Dataset. For the purity function, some code ignored the assumption that outliers are assumed to be in cluster zero and obtained incorrect results; e.g. considering the objects in cluster 0 in purity computations of DBSCAN results or excluding cluster 1 when computing purity for k-means clusterings. For task 4 the main goal was to characterize the objects in clusters 1-5; a lot of students did put enough focus on this task; e.g. they provided a general analysis of boxplots rather than analyzing the box plots with respect to separating the 5 clusters and with respect to differences between the distribution in a particular cluster and the distribution in the dataset. About 35% of the students provided quite sophisticated search procedures to find good DBSCAN parameter settings; unfortunately, I had a very hard time, understanding most of the chosen approaches due to lack of explanation and examples that illustrate the approach. There was a quite dramatic differences with respect to amount of work and quality of the approach/solutions obtained for Tasks 4 and 6. Overall, some really good work was done by some students for tasks 4 and or 6 (score=9 or higher). Challenges for Task6 include: Finding an acceptable range of parameter values so that DBSCAN creates at least “okay” results How to search for good solutions in the range Another observation, if we maximize purity, is using a large number of clusters might be beneficiary to obtain better results; however, how to embed this knowledge into the search procedure is a challenge…

Ch. Eick Project 2 COSC Ch. Eick Optimal DBSCAN Clustering for Complex 8 7 For the complex8 dataset, the best results are as follows: Purity = 1 Outliers = % Number of Clusters = 19 (20, if we include cluster 0 as outliers) Eps = 12.8 MinPts = 3 Remark: 3 Students found purity 100% clusters (one extra point for that; results still need to be verified)

Ch. Eick Project 2 COSC Ch. Eick 8 “Optimal” Complex8 DBSCAN Clustering