Speaker: Liu Shuchang Osaka University Extraction of Evolution History from Software Source Code Using Linear Counting Speaker: Liu Shuchang Osaka University 1
Background daily software development copy existing code product variant copy edit software evolution 2
Evolution History Example only source code ❌ 3
Introduction Evolution History Recovery product variants using only source code Evolution Tree vertex: variant edge: derived relation (most similar pair) key: product similarity Previous Study diff based (file-to-file similarity) time needed (worst case: 2 days) Linear Counting Algorithm estimating instead of calculating 4
Linear Counting Algorithm Cardinality: 11 Zero: 2 Bitmap Size: 8 -8 × ln(2/8) = 11.0903 An example of the Linear Counting Algorithm 5
Estimate Product Similarity Multiset A Bit Map A Bit Map A∩B hash function bitwise operator Initialization Multiset B Bit Map B Bit Map A∪B hash function Similarity: Jaccard Index |A∩B| ——— |A∪B| LC(A∩B) continued division LC(A∪B) 6
Process Flow Variant A (Source Code) Initial Multiset A Initialization 1. n-gram modeling Jaccard Index 2. each line of the code |A∩B| ——— |A∪B| Linear Counting Algorithm Variant B (Source Code) Initial Multiset B Initialization (A, B), (A, C), (A, D), … Evolution Tree the most similar pair Prim’s Algorithm 7
Research Data A description of datasets we dealt with 8
Final Result of dataset5 The Evolution Tree we extracted (the Best Configuration) Existing actual evolution history 9
Analysis on Bitmap Size Part of the experiment results of dataset5 10
Best Configuration Main Factors N-gram Modeling no (each line of code) Bitmap Size 128,000,000 bits Hashing Function MurmurHash3 Results Proper Edges 86.5% (on average) Time 10s to 5mins 11
Contributions and Future Work extract an ideal Evolution Tree efficiently influence of various factors best configuration faster and showed better accuracy Future Work larger datasets other programming language solve the remaining problems 12