Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate.

Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate Authorship Xiaozhu Meng, Barton P. Miller, William R. Williams, and Andrew R. Bernat

Line-level authorship information is useful for: o Analyzing software quality o Performing software forensics o Improving software maintenance Mining Software Repositories for Accurate Authorship Code 2

Limitation of the current methods o Current tools: git-blame, svn-annotate, and cvs-annotate o They only report the last change 3 Mining Software Repositories for Accurate Authorship printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx", task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, tsk->comm, task_pid_nr(tsk), address, (void *)regs->ip, (void *)regs->sp, error_code); Alice Bob printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx", task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, tsk->comm, task_pid_nr(tsk), address, (void *)regs->ip, (void *)regs->sp, error_code); Alice Bob Jim o Miss earlier changes

Accurate line-level authorship 4 Mining Software Repositories for Accurate Authorship o Repository graph A graph abstraction of a code repository o Structural authorship A sub-graph recording the development history of a line of code o Weighted authorship Contribution weights for each author

Steps to extract accurate line-level authorship 5 Mining Software Repositories for Accurate Authorship Repository graph: Structural authorship: for a line of code Weighted authorship: (Alice: 50%, Bob: 30%, Jim: 20%) Code repository

Repository graph 6 Mining Software Repositories for Accurate Authorship Alice BobJim Nodes are revisions: Snapshots of different stages of the project Edges represent development dependencies: branching and merging create multiple paths Edges are annotated with code changes: o Added, deleted, and changed lines o Code changes can be composed along a path s0s0 s 10 s1s1 s2s2 s5s5 s6s6 s7s7 s8s8 s9s9 s3s3 s4s4 δ 0,1 δ 1,2 δ 2,3 δ 3,4 δ 2,5 δ 5,6 δ 4,7 δ 6,7 δ 7,10 δ 5,8 δ 8,9 δ 9,10

Structural authorship A sub-graph records the development history of a line of code 7 Mining Software Repositories for Accurate Authorship Alice BobJim δ 2,7 = δ 6,7 ○ δ 5,6 ○ δ 2,5 δ 2,9 = δ 8,9 ○ δ 5,8 ○ δ 2,5 s10s10 s2s2 s7s7 s9s9 s3s3 s4s4 s0s0 s1s1 δ 0,1 δ 1,2 s5s5 s6s6 s8s8 δ 2,5 δ 5,6 δ 6,7 δ 5,8 δ 8,9 δ 2,3 δ 3,4 δ 4,7 δ 7,10 δ 9,10

Weighted authorship Contribution weights for each author 8 Mining Software Repositories for Accurate Authorship force_sig_info_fault(si_code, address, tsk, 0); force_sig_info_fault(si_code, address | 0xff, tsk); force_sig_info_fault(si_code); force_sig_info_fault(si_code, address, tsk, 0); Alice Bob Jim force_sig_info_fault(si_code, address, tsk, 0); (Alice: 4.5%, Bob: 25%, Jim: 70.5%)

Our new git-author o Implement repository graph, structural authorship, and weighted authorship o Use a syntax similar to that of git-blame 9 Mining Software Repositories for Accurate Authorship

Evaluation o Multi-author study o Source code bug prediction study 10 Mining Software Repositories for Accurate Authorship or

Multi-author study RepositoryMultiple AuthorsNumber of lines Dyninst40K (9.12%)434K GCC217K (6.27%)3454K Gimp78K (8.12%)955K Httpd20K (8.15%)247K Linux1072K (7.22%)14857K 11 Mining Software Repositories for Accurate Authorship o Investigate the percentage of multi-author lines o git-blame loses information on these lines o git-author identifies 6% ~ 9% of total lines as multi-author lines

Source code bug prediction o A machine learning based technique to o Learn the characteristics of previous bugs o Predict where current bugs are o Improve software testing o Prioritize testing o Reduce testing effort 12 Mining Software Repositories for Accurate Authorship

Bug prediction study 13 Mining Software Repositories for Accurate Authorship Module-level File-level Line-level Coarser Finer A module or a file still contains a lot of code! Locate suspicious lines Investigate whether accurate line-level authorship improves bug prediction

Approach comparison 14 Mining Software Repositories for Accurate Authorship VS * Bug density of a source file is the average number of bugs per line [1] Y. Kamei, et al. Revisiting common bug prediction ﬁndings using effort-aware models. 2010. Model componentsFile-level model [1] Line-level model Input a source filea line of code Output the bug density* of the file the probability that the line is buggy Bug predictors code churnweighted authorship agenumber of authors bug fixesnumber of commits Machine learning technique linear regressionlinear SVM A bug prediction model uses a machine learning technique to learn bug predictors and predict where the bugs are

Experiment setup 15 Mining Software Repositories for Accurate Authorship Bug report database Bug #1 Bug #2 Bug #3 Code repository Release 1 Release 2 Release 3 Match if the bug is present in the release Apache HTTP Server Project o We selected seven releases that had a large number of reported bugs o For each release, we trained on that release and predicted on the next release

Performance comparison 16 Mining Software Repositories for Accurate Authorship Point (x,y) means that by testing x% of total lines of code, we can find y% of total bugs The closer a model gets to the top-left corner, the better the model is

Results: Apache 2.2.10 predicting 2.3.0 17 Mining Software Repositories for Accurate Authorship

Future work: binary code authorship Software forensics: Use git-author for ground truth 18 Mining Software Repositories for Accurate Authorship Malware binaries Learning-based coding style attribution

Conclusions o Structural authorship and weighted authorship overcome a weakness of the current methods o Git-author extracts more information than git- blame on 6% to 9% of total lines o This information improves source code bug prediction 19 Mining Software Repositories for Accurate Authorship

20 Mining Software Repositories for Accurate Authorship Questions? Git-author is available at: https://github.com/mxz297/Git-author

Numerical metrics 21 Mining Software Repositories for Accurate Authorship Area under the curve (AUC) is a numerical summary of the performance of a model The difference of AUC between two models represents the testing effort saved by the better model

Bug Results 22 Mining Software Repositories for Accurate Authorship PoptCE lm opti lm avg lm pes fmlm opti lm avg lm pes fm 0.96950.93920.90230.83210.91320.82430.72200.5221 0.98840.96320.92970.81660.96640.89350.79650.4693 0.99970.97060.93390.84530.99900.91480.80820.5509 0.96470.93250.89650.87160.89560.80070.69430.6208 0.96640.92750.88480.88700.89610.77560.64330.6504 1.00130.96650.92450.9267 1.00400.89790.77000.7769 Mean 0.98170.94990.91200.86320.94570.85110.73910.5984 Std. Dev. 0.01540.01730.01840.03680.04600.05320.05850.0998

Line count results 23 Mining Software Repositories for Accurate Authorship

Line Count Results 24 Mining Software Repositories for Accurate Authorship PoptCE lmfmlmfm 0.91480.81130.79250.5404 0.94250.77040.85780.4321 0.94700.78600.86580.4579 0.91530.82880.78340.5624 0.86600.77110.65900.4173 0.93430.88600.82990.7050 Mean 0.92000.80890.79810.5192 Std. Dev. 0.02710.04040.06920.0988

Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate.

Similar presentations

Presentation on theme: "Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate.

Similar presentations

Presentation on theme: "Computer Sciences Department University of Wisconsin - Madison ICSM 2013 Eindhoven, Netherlands September 24, 2013 Mining Software Repositories for Accurate."— Presentation transcript:

Similar presentations

About project

Feedback