Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working.

Similar presentations


Presentation on theme: "Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working."— Presentation transcript:

1 Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working Conference on Reverse Engineering

2 November 2003Vassilios Tzerpos 2 The Software Clustering Problem Input: A set of software artifacts (files, classes) Structural information, i.e. interdependencies between the artifacts (invocations, inheritance) Non-structural information (timestamps, ownership) Goal: Partition the artifacts into “meaningful” groups in order to help understand the software system at hand

3 November 2003Vassilios Tzerpos 3 Example Program files Utility files Used by the same program files Have almost the same dependencies

4 November 2003Vassilios Tzerpos 4 Open questions o Validity of clusters discovered based on high- cohesion and low-coupling No guarantee that legacy software was developed in such a way o Discovering utility subsystems Utility subsystems are low-cohesion / high-coupling They commonly occur in manual decompositions o Utilizing non-structural information What types of information has value? LOC, timestamps, ownership, directory structure

5 November 2003Vassilios Tzerpos 5 Our goals o Create decompositions that convey as much information as possible about the artifacts they contain o Discover utility subsystems as well as subsystems based on high-cohesion and low- coupling o Evaluate the usefulness of any combination of structural and non-structural information

6 November 2003Vassilios Tzerpos 6 Information Theory Basics o Entropy H(A): Measures the Uncertainty in a random variable A o Conditional Entropy H(B|A): Measures the Uncertainty of a variable B, given a value for variable A. o Mutual Information I(A;B): Measures the Dependence of two random variables A and B.

7 November 2003Vassilios Tzerpos 7 Information Bottleneck (IB) Method oA : random variable that ranges over the artifacts to be clustered oB : a random variable that ranges over the artifacts ’ features oI(A;B) : mutual information of A and B oInformation Bottleneck Method [TPB ’ 99]  Compress A into a clustering C k so that the information preserved about B is maximum (k=number of clusters). oOptimization criterion:  minimize I(A;B) - I(C k ;B)  minimize H(B|C k ) – H(B|A)

8 November 2003Vassilios Tzerpos 8 a1a1 a2a2 a3a3 anan A: Artifacts B: Features b1b1 b2b2 b3b3 bmbm C: Clusters c1c1 c2c2 c3c3 ckck Minimize Loss of I(A;C) Maximize I(C;B) Information Bottleneck Method

9 November 2003Vassilios Tzerpos 9 Agglomerative IB o Conceptualize graph as an nxm matrix (artifacts by features) -0.17 u2u2 0- u1u1 - f3f3.10- f2f2.17.10 -f1f1 u2u2 u1u1 f3f3 f2f2 f1f1 A\B f1f1 f2f2 f3f3 u1u1 u2u2 p f1f1 01/4 1/5 f2f2 1/40 1/5 f3f3 1/4 0 1/5 u1u1 1/3 001/5 u2u2 1/3 001/5 o Compute an nxn matrix indicating the information loss we would incur if we joined any two artifacts into a cluster o Merge tuples with the minimum information loss

10 November 2003Vassilios Tzerpos 10 Adding Non-Structural Data o If we have information about the Developer and Location of files we express the artifacts to be clustered using a new matrix o Instead of B we use B’ to include non-structural data o We can compute I(A;B’) and proceed as before f1f1 f2f2 f3f3 u1u1 u2u2 AliceBobp1p1 p2p2 p3p3 f1f1 01/6 0 00 f2f2 0 0 0 0 f3f3 0 0 0 0 u1u1 1/5 00 000 u2u2 00 000

11 November 2003Vassilios Tzerpos 11 oAIB has quadratic complexity since we need to compute an (nxn) distance matrix. oLIMBO algorithm Produce summaries of the artifacts Apply agglomerative clustering on the summaries ScaLable InforMation BOttleneck

12 November 2003Vassilios Tzerpos 12 Experimental Evaluation o Data Sets TOBEY: 939 files / 250,000 LOC LINUX : 955 files / 750,000 LOC o Clustering Algorithms ACDC: Pattern-based BUNCH: Adheres to High-Cohesion and Low-Coupling NAHC, SAHC Cluster Analysis Algorithms Single linkage (SL) Complete linkage (CL) Weighted average linkage (WA) Unweighted average linkage (UA)

13 November 2003Vassilios Tzerpos 13 Experimental Evaluation o Compared output of different algorithms using MoJo MoJo measures the number of Move/Join operations needed to transform one clustering to another. The smaller the MoJo value of a particular clustering, the more effective the algorithm that produced it. o We compute MoJo with respect to an authoritative decomposition

14 November 2003Vassilios Tzerpos 14 Structural Feature Results TOBEYLinux LIMBO311237 ACDC320342 NAHC382249 SAHC482353 SL688402 CL361304 WA351309 UA354316 Limbo found Utility Clusters

15 November 2003Vassilios Tzerpos 15 Non-Structural Feature Results o We considered all possible combinations of structural and non-structural features. o Non-Structural Features available only for Linux Developers (dev) Directory (dir) Lines of Code (loc) Time of Last Update (time) o For each combination we report the number of clusters k when the MoJo value between k and k+1 differs by one.

16 November 2003Vassilios Tzerpos 16 Non-Structural Feature Results ClustersMoJo dev+dir69178 dev+dir+time37189 dir25195 dir+loc+time78201 dir+time18208 dir+loc74210 dev+dir+loc49212 dev71229 structural56237 o 8 combinations outperform structural results. o “Dir” information produced better decompositions. o “Dev” information has a positive effect. o “Time” leads to worse clusterings.


Download ppt "Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working."

Similar presentations


Ads by Google