Evaluating Software Clustering Algorithms

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax K-MST -based.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Conceptual Clustering
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Clustering Categorical Data The Case of Quran Verses
Fast Algorithms For Hierarchical Range Histogram Constructions
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
Precision and Recall.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Segmentation Graph-Theoretic Clustering.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Clustering Software Artifacts Based on Frequent common changes Presented by: Ashgan Fararooy Prepared by: Haroon Malik (Modified)
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Ioana Sora, Gabriel Glodean, Mihai Gligor Department of Computers Politehnica University of Timisoara Software Architecture Reconstruction: An Approach.
Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Text Clustering Hongning Wang
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
Sampath Jayarathna Cal Poly Pomona
The NP class. NP-completeness
Unsupervised Learning: Clustering
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Machine Learning for the Quantified Self
CPSC 121: Models of Computation 2008/9 Winter Term 2
Using Partial Reference Alignments to Align Ontologies
Data Mining K-means Algorithm
Evaluation of Relational Operations
Chapter 5. Optimal Matchings
Evaluation.
Hierarchical clustering approaches for high-throughput data
Clustering Algorithms for Noun Phrase Coreference Resolution
Segmentation Graph-Theoretic Clustering.
Clustering.
Lecture 21 Clustering (2).
Software Clustering.
An Interactive Approach to Collectively Resolving URI Coreference
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
2001 IEEE International Conference on Software Maintenance (ICSM'01).
Precision and Recall.
DIVIDE AND CONQUER.
Presentation transcript:

Evaluating Software Clustering Algorithms

The problem How do we know that a particular decomposition of a software system is good? What does ‘good’ mean anyway? We can compare against a Mental model Benchmark standard Can be done either manually or automatically 5/26/2019 COSC6431

Manual evaluation Have experts evaluate automatic decompositions Very time-consuming, impractical Also quite subjective Need an automatic, objective way of doing it 5/26/2019 COSC6431

Automatic evaluation Usually measures the similarity of an automatic decomposition A to an authoritative decomposition M (prepared manually) Major drawback: Assumes there exists one ‘correct’ decomposition Other evaluation approaches are possible, such as measuring the stability of an SCA 5/26/2019 COSC6431

Precision/Recall Standard Information Retrieval measures Were applied in a software clustering context by Anquetil Definitions: Intra pair: A pair of software entities in the same cluster Inter pair: A pair of entities in different clusters 5/26/2019 COSC6431

Precision/Recall Precision: Percentage of intra pairs in A which are also intra in M Recall: Percentage of intra pairs in M found also in A A good algorithm should exhibit high values in both measures 5/26/2019 COSC6431

Precision / Recall example Decomposition A Decomposition B 1 4 5 2 3 8 10 7 9 6 Pair 1-5 : Intra-pair in A but not in B Precision: 16/20 = 80% Recall: 16/21 = 76.2% 5/26/2019 COSC6431

Problems with P/R Sensitive to the size and number of clusters Differences are exaggerated if you have small/many clusters Two values makes comparison harder The two values are interchangeable if there is no “reference” decomposition 5/26/2019 COSC6431

Koschke – Eisenbarth measure Loosely based on Precision/Recall Attempts to be less strict Definitions: GOOD match: Two clusters (one in A, one in M) with both precision and recall larger than a threshold p (typical value 70%) OK match: Two clusters with only one of the measures larger than p 5/26/2019 COSC6431

Koschke – Eisenbarth metric B1B4 A2 A3 A4 A5 B1 B2 B3 A2A3 B5 B4 Good OK 5/26/2019 COSC6431

Koschke – Eisenbarth metric Does not take edges into account In extreme situations (each cluster contains only one element) may provide strange results (similarity of 100%) No penalty for joining clusters 5/26/2019 COSC6431

MoJo distance The distance between two different partitions of the same software system is defined as the minimum number of Move and Join operations to transform one to the other Move: Remove an object from a cluster and put it in a different cluster Join: Merge two clusters into one Split: Has to be simulated by Move operations 5/26/2019 COSC6431

MoJo example Decomposition A Decomposition B 1 2 1 2 3 4 3 5 7 4 5 6 6 8 7 8 5/26/2019 COSC6431

Why only Move and Join? Two clusters can be joined in only one way. One cluster can be split in two in an exponential number of ways Joining two clusters only means that we performed more detailed clustering than required Splitting is effectively assigned a weight equal to the cardinality of the smaller of the two resulting clusters 5/26/2019 COSC6431

Computing MoJo distance Give each object in partition A a tag j if it belongs in cluster Mj in partition M For each cluster in A, assign it to group Gk if it contains more objects with tag k than any other tag. To resolve ties, use the maximum bipartite matching method to maximize the number of groups. Join clusters within same group. Then move all objects with tag k to the clusters in group Gk. 5/26/2019 COSC6431

MoJo Distance Calculation MoJo Distance is M+J, where M is the number of Moves, and J is the number of Joins The moves for each cluster is , and the total moves is Assume the number of non-empty groups is g, from G1 to Gg. The joins of each group is|Gk|-1. Then the number of Joins is 5/26/2019 COSC6431

Outline of Proof Move & Join operations are commutative The algorithm uses the minimum number of Moves Any algorithm that also uses the minimum number of Moves can not beat us For any algorithm that does not use the minimum number of Moves, there exists another one which is better or equal and uses the minimum number of Moves 5/26/2019 COSC6431

MoJoFM Effectiveness Metric Assumes the existence of a ‘correct’ authoritative partition MoJo Distance can be used to measure the distance between two arbitrary partitions 5/26/2019 COSC6431

MeCl and EdgeSim Other measures do not consider edges Edges might convey important information EdgeSim: Rewards clustering algorithms for preserving the edge types Penalizes clustering algorithms for changing the edge types MeCl: Rewards the clustering algorithm for creating cohesive “subclusters” 5/26/2019 COSC6431

EdgeSim Inter-edge: Edge between clusters Intra-edge: Edge within a cluster Y: set of edges that are of the same type in both A and B 5/26/2019 COSC6431

EdgeSim example In this example, EdgeSim(A,B) = 52.6% 5/26/2019 COSC6431

EdgeSim counterexample 5/26/2019 COSC6431

EdgeMoJo philosophy A similarity measure cannot make assumptions as to what cluster a particular object should belong to Dissimilarity between decompositions should increase if the misplacement of an object results in the misplacement of a large number of edges 5/26/2019 COSC6431

EdgeMoJo calculation Apply MoJo and obtain a series of Move and Join operations Perform all Join operations The cost of each Move operations increases from 1 to 5/26/2019 COSC6431

EdgeMoJo example Decomposition A Decomposition B 1 2 1 2 3 4 5 3 4 5 7 6 8 6 8 m(5) = 1 + |4-1|/(4+1) = 1.6 5/26/2019 COSC6431

Real data experiments 5/26/2019 COSC6431

Synthetic data experiments 5/26/2019 COSC6431