Finding File Clones in FreeBSD Ports Collection Yusuke Sasaki Tetsuo Yamamoto Yasuhiro Hayase Katsuro Inoue
File Clones Research about file-clones is scarce Two or more files with the same content Comments and code indentation ignored Inside a project or between different projects Research about file-clones is scarce Get new knowledge about file-clones Project A Project B int main() { printf(“Hello msr!”); return 0; }
FCFinder Input Output Faster than other tools Detection .c and .h files Output File-clone sets Faster than other tools Detection Tokenization MD5 Hash Calculation Exact Matching Tool Speed CCFinder 1.4M files / 960 hours x1 1PC D-CCFinder 1.4M files / 51 hours x19 80PCs FCFinder 1.4M files / 17.16 hours x55
These values follow the power law Experiment Target Only .c and .h files in the FreeBSD Ports Collection ~1.4M files ~12 GB 17.16 hours We measured: File size Number of files in each project Size of each file-clone set Number of file-clones in a project These values follow the power law
File-clone Set Size file clone set size 5 10 50 100 Left:used in PHP5 Right:used in PHP4 used in both of PHP4 and 5 D E L:650 sets R:500 sets 419 sets 120 file clones 5 10 50 100 L:61 file clones R:59 file clones file clone set size R*2 = 0.8508
File-clones per Project Right:PHP4 modules Center:projects related bin-utils Left:PHP5 modules G 5 10 50 100 500 1K 5K 10K number of file clone sets R*2 = 0.8263
File-clones Between Projects (1/3) * Nodes show the projects * Edges between projects show the number of file clones between two projects Ex) gcc41 and gfortran shares 7691 file clones
File-clones Between Projects (2/3) * Nodes show the projects * Edges between projects show the number of file clones between two projects
File-clones Between Projects (3/3) * Nodes show the projects * Edges between projects show the number of file clones between two projects
Conclusions & Future Work Measured several features of the FreeBSD Ports collection. Found that the measured features follow the power law Future Work Projects logical coupling investigation