Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch. Eick Project 2 COSC 6335 2013 Christoph F. Eick.

Similar presentations


Presentation on theme: "Ch. Eick Project 2 COSC 6335 2013 Christoph F. Eick."— Presentation transcript:

1 Ch. Eick Project 2 COSC 6335 2013 Christoph F. Eick

2 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick Arko’s Agreement Code agreement = function(x,y) { max<-NROW(x$cluster); count<-0; total<-max*(max+1)/2; for(i in 1:max) { for(j in i:max) { if(j!=i) { if((x$cluster[j]==x$cluster[i] & y$cluster[j]==y$cluster[i]) | (x$cluster[j]!=x$cluster[i] & y$cluster[j]!=y$cluster[i])) count<-count+1; } else { if((x$cluster[i]==0 & x$cluster[j]==0) | (x$cluster[i]>0 & x$cluster[j]>0)) count<-count+1; } returnValue<-count/total; return(returnValue); 2

3 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick 3 K-means for Complex8 In general, the turquoise and the pink clusters are bad, whereas the brown and green clusters are okay.

4 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick Arko’s Code for the Purity Function (except what is in red) purity<-function(a,b,outliers=FALSE) { require('matrixStats'); t<-table(a,b); rowTotals<-rowSums(t); #the same can be with apply(t,1,sum) rowMax<-apply(t,1,max); if(!outliers) { purity<-sum(rowMax)/sum(rowTotals); return (purity) } else { if(NROW(rowTotals)>1) { purity<-(sum(rowMax)-rowMax[1])/(sum(rowTotals)-rowTotals[1]); } else { purity<-NA; } pcOutliers=rowTotals[1]/(sum(rowTotals)); returnVector=vector(mode='double',length=2); returnVector[1]=purity; returnVector[2]=pcOutliers; return(returnVector); } 4

5 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick Task4: Characterizing the 5 Clusters Cluster number Characteristic 1 a>0.65 AND b>0.6 2 d>0.35 3 f>0.38 4 no interesting observation 5 a<0.44 AND b<0.45 (lower accuracy) 5 Cluster number Properties 1 2 3 4 5 Remark: As we use k-means, almost everybody should have different clusters and summaries

6 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick Project2 Observations Assuming purity is used as the evaluation measure DBSCAN outperformed Kmeans quite significantly, as K-means was not able to detect the natural clusters; on the other hand, for the Yeast dataset K-means obtained better results than DBSCAN; in general, DBSCAN seems to create one very big cluster or obtain a clustering with a lot of outliers, and it seemed to be very difficult (or even impossible) to obtain solutions that lie between the extremes. A lot of students failed to observe that k-means fails to identify the natural clusters in the Complex8 Dataset. For the purity function, some code ignored the assumption that outliers are assumed to be in cluster zero and obtained incorrect results; e.g. considering the objects in cluster 0 in purity computations of DBSCAN results or excluding cluster 1 when computing purity for k-means clusterings. For task 4 the main goal was to characterize the objects in clusters 1-5; a lot of students did put enough focus on this task; e.g. they provided a general analysis of boxplots rather than analyzing the box plots with respect to separating the 5 clusters and with respect to differences between the distribution in a particular cluster and the distribution in the dataset. About 35% of the students provided quite sophisticated search procedures to find good DBSCAN parameter settings; unfortunately, I had a very hard time, understanding most of the chosen approaches due to lack of explanation and examples that illustrate the approach. There was a quite dramatic differences with respect to amount of work and quality of the approach/solutions obtained for Tasks 4 and 6. Overall, some really good work was done by some students for tasks 4 and or 6 (score=9 or higher). Challenges for Task6 include: Finding an acceptable range of parameter values so that DBSCAN creates at least “okay” results How to search for good solutions in the range Another observation, if we maximize purity, is using a large number of clusters might be beneficiary to obtain better results; however, how to embed this knowledge into the search procedure is a challenge…

7 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick Optimal DBSCAN Clustering for Complex 8 7 For the complex8 dataset, the best results are as follows: Purity = 1 Outliers = 0.4704038% Number of Clusters = 19 (20, if we include cluster 0 as outliers) Eps = 12.8 MinPts = 3 Remark: 3 Students found purity 100% clusters (one extra point for that; results still need to be verified) 0 1 2 3 4 5 6 7 0 0 11 1 5 1 3 0 1 60 0 0 0 0 0 0 0 2 0 57 0 0 0 0 0 0 3 0 0 518 0 0 0 0 0 4 0 0 0 482 0 0 0 0 5 0 0 0 0 57 0 0 0 6 0 0 0 0 5 0 0 0 7 0 0 0 0 3 0 0 0 8 0 0 0 0 12 0 0 0 9 0 0 0 0 14 0 0 0 10 0 0 0 0 7 0 0 0 11 0 0 0 0 10 0 0 0 12 0 0 0 0 8 0 0 0 13 0 0 0 0 0 10 0 0 14 0 0 0 0 0 113 0 0 15 0 0 0 0 0 245 0 0 16 0 0 0 0 0 0 66 0 17 0 0 0 0 0 0 407 0 18 0 0 0 0 0 0 8 0 19 0 0 0 0 0 0 0 403 20 0 0 0 0 0 0 0 54

8 Ch. Eick Project 2 COSC 6335 2013 Ch. Eick 8 “Optimal” Complex8 DBSCAN Clustering


Download ppt "Ch. Eick Project 2 COSC 6335 2013 Christoph F. Eick."

Similar presentations


Ads by Google