Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1

Investigating Vector-based Detection of Code Clones Using BigCloneBench
Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1 1 Osaka University 2 Nara Institute of Science and Technology 3 Nagoya University This logo was designed by me 1. Motivation 2. Investigation Target Vectorization Algorithms FLCCFinder : We have developed a vector-based approach It uses TF-IDF and cosine similarity It has possibilities: High-dimensional vector Polysemy and synonym are missed Dimensionality Reduction Machine Learning BoW Simple representation TF-IDF Reflect word’s importance LSI 〇 Reduction by SVD LDA Generative probabilistic model Doc2Vec Extends Word2Vec to documents WV-avg Average vector of Word2Vec FT-avg Average vector of FastText Source Code Code Fragment B Code Fragment A Feature Vector Vector Space Code Clone Similarity Measurements Cosine similarity Measure of similarity between two vectors of an inner product WMD (Word Mover’s Distance) Distance function between text documents To improve FLCCFinder, we investigated what kind of vectorization algorithms and similarity measurements are effective. 3. Investigation and Results RQ1: Does the recall of code clone detection vary with vectorization algorithms? RQ2: Does the selection of vectorization algorithms and distance scale affect detection speed? Applied each vectorization algorithms to BigCloneBench (BigCloneBench is a big data clone benchmark) Built 1MLOC dataset by randomly selecting from BigCloneBench Recall for each vectorization algorithm Calculation time for each vectorization algorithm BoW TF-IDF LSI LDA Doc2 Vec WV-avg FT-avg T1 .99 T2 .84 .82 .92 .85 .91 .95 .94 VST3 .90 .83 .97 .93 ST3 .45 .37 .61 .46 .79 MT3 .06 .03 .09 .23 .04 .55 .43 WT3/T4 .00 .02 .08 .05 Vectorization algorithms BoW TF-IDF LSI LDA Doc2 Vec WV-avg FT-avg Word 2Vec Fast Text Similarity measurements Cosine similarity WMD Generation time [sec] 5.1 10.0 9.7 60.3 44.7 42.7 196.1 29.5 187.7 Similarity time [sec] 5.5 1.1 1.6 497.5 538.1 Recall was improved by using dimensionality reduction and machine learning WV-avg had the highest recall WV-avg (Word2Vec) was fast among algorithms using machine learning WMD was much slower than cosine similarity 4. Discussion 5. Future works RQ1 Recall of TF-IDF was lower than BoW This is because the weights of identifiers tend to be high But code clones contain different identifier name Doc2Vec was not able to detect many Type-3 clones WV-avg had the highest recall RQ2 WV-avg(Word2Vec) achieved the highest speed among algorithms using machine learning It is not practical to use WMD for detecting code clones WV-avg was an effective algorithm Measure the precision and F-measure Investigate other than algorithms used in this study Use machine learning for classification between code clones and NON-clones Source Code Code Fragment B Code Fragment A Feature Vector Machine Learning Code Clones NON-Clones Output Input Vectorization This work was supported by JSPS KAKENHI Grant Numbers JP , JP18H04094 and JP16K16034

Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1

Similar presentations

Presentation on theme: "Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1

Similar presentations

Presentation on theme: "Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1"— Presentation transcript:

Similar presentations

About project

Feedback