1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro.

1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro Inoue** *Ritsumeikan University, Japan **Osaka University, Japan ***Japan Science and Technology Agency, Japan

2 Motivation Long-lived software systems evolve through multiple modifications. Many different versions are created and delivered The evolution is not simple and straightforward It is common that one original system creates several distinct successor branches during evolution Several distinct versions may be unified later and merged into another version To manage the many versions correctly and efficiently, it is very important to know objectively their relationships

3 Motivation (Cont.) We have been interested in measuring the similarity between two large software systems This was motivated by our scientific curiosity such as what is the quantitative similarity of two software systems We would like to quantify the similarity with a solid and objective measure We have been interested in comparing all the files It is important that the software similarity metric is not based on sampled information as the attribute value (or fingerprint), but rather reflect the overall system characteristics

4 Research Aim We measure the similarity between two large software systems Propose a similarity metric S line  S line is defined as ratio of shared source code lines to the total source code lines  S line requires computing matches between source code lines in the two systems, beyond the boundaries of files and directories Develop a similaritiy metric evaluation tool SMAT (Software similarity MeAsurement Tool)  We have evaluated the similarity between various versions of BSD UNIX  We have performed cluster analysis of the similarity values to create a dendrogram that correctly shows evolution history of BSD UNIX

5 Definitions A software system P is composed of elements p1, p2, · · ·, pm, and P is represented as a set {p1, p2, · · ·, pm} Another software system Q is denoted by {q1, q2, · · ·, qn} We will choose the type of elements, such as files and lines, based on the definitions of the similarity metrics

6 Definitions (Cont.) Suppose that we are able to determine matching between pi and qj (1<=i<=m, 1<=j<=n), we call Correspondence Rs the set of matched pair (pi, qj), where Similarity S of P and Q with respect to Rs is defined as follows P Q

7 Similarity Metric We show a concrete operational similarity metric S line using equivalent line matching Each element of a software system is a single line of each source file composing the system Two lines with minor distinction such as space/comment modification and identifier rename are recognized as equivalent S line is not affected by file renaming or path changes

8 Measuring Sline A key problem of S line is computation of the correspondence Rs We propose an approach that effectively uses both diff and a clone detection tool named CCFinder[1]  CCFinder is a tool used to detect duplicated code blocks (called clones)  Diff is a tool used to detect the longest common subsequence (LCS) between two files diff is applied to all pairs of the two files xi and yj, where CCFinder detects a clone pair (bx, by) and bx is in xi and by is in yj, respectively [1] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A multi-linguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7):654-670, 2002.

9 Similarity Measuring Process All comments, white spaces, and empty lines are removed CCFinder has an option for the minimum number of tokens of clones to be detected, and whose default is set to 20 SMAT executes diff on any file pair xi and yj in X and Y respectively, where at least one clone is detected between xi and yj. The lines appearing in the clones detected by Step 2 and in the common subsequences found in Step 3 are merged Sline is calculated using the ratio of lines in the correspondence to those in whole systems

10 Diff and CCFinder A straightforward approach we might consider is that first we construct appended files x1; x2; · · · and y1; y2; · · · which are concatenation of all source files x1, x2, · · · and y1, y2, · · · for systems X and Y, respectively This method is fragile due to the change of file concatenation order caused by internal reshuffling of files Another approach is that we try to greedily apply diff to all combination of files between two systems This approach might work, but the scalability would be an issue When the length of code are less than threshold of CCFinder(usually 20 tokens), then CCFinder reports no clones at all An approach is proposed that effectively uses both diff and CCFinder

11 Applications of SMAT To explore the applicability of S line and SMAT, we have used many versions of open-source BSD UNIX operating systems 4.4-BSD Lite, 4.4-BSD Lite2 FreeBSD 2.0, 2.0.5, 2.1, 2.2, 3.0, 4.0 NetBSD 1.0, 1.1, 1.2, 1.3, 1.4, 1.5 OpenBSD 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8 23 major-release versions were chosen for computing S line of all pair combinations The evaluation was performed only on source code files related to the OS kernels written in C

13 Results (1/2) S line evolution between FreeBSD 2.2 and other FreeBSD versions

14 Results (2/2) S line between each version of FreeBSD and some of NetBSD

15 Cluster Analysis The dendrogram from a cluster analysis is shown

16 Conclusion We have proposed a similarity metric called S line  S line is defined as ratio of shared source code lines to the total source code lines developed an S line -based evaluation tool SMAT applied SMAT to various software systems S line and SMAT are very useful for identifying the origin of the systems and to characterize their evolution

17 Future work Further applications of SMAT to various software systems and product lines will be made to investigate their evolution

18 End

19 S line and Release Duration The release durations are calculated from the difference of OS release dates The Pearson ’ s correlation coefficient between S line values and release durations of FreeBSD versions is -0.973 The Pearson ’ s correlation coefficient between the size increases and the release durations is 0.528 We think that S line is a reasonable measures of release durations in this case

20 The number of files and LOC of BSD UNIX

21 Part of Sline values between BSD UNIX kernel files

22 Outline of CCFinder CCFinder directly compares source code on token unit, and detects code clones Normalization of name space Replacement of names defined by user Removal of table initialization Consideration of modules delimiter CCFinder can analyze the system of millions line scale in practical use time

23 Source files Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Transformation Token sequence Match detection Transformed token sequence Clones on transformed sequence Formatting CCFinder: Clone Detection Process

1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro.

Similar presentations

Presentation on theme: "1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.

Similar presentations

Presentation on theme: "1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro."— Presentation transcript:

Similar presentations

About project

Feedback

1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro.

Presentation on theme: "1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto, Makoto Matsushita, Toshihiro Kamiya, Katsuro."— Presentation transcript: