Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Introduction to Information Retrieval
Multimedia Database Systems
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extracting Code.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Industrial Application.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Kinds of.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Do Practitioners.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development of.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 1 Towards an Assessment of the Quality of Refactoring.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
What kind of and how clones are refactored? A case study of three OSS projects WRT2012 June 1, Eunjong Choi†, Norihiro Yoshida‡, Katsuro Inoue†
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Towards a Collection of Refactoring Patterns Based.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
DeepWalk: Online Learning of Social Representations
IR 6 Scoring, term weighting and the vector space model.
Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.
A Simple Approach for Author Profiling in MapReduce
Plan for Today’s Lecture(s)
Do Developers Focus on Severe Code Smells?
Word Embeddings and their Applications
Basic machine learning background with Python scikit-learn
Natural Language Processing of Knee MRI Reports
Vector-Space (Distributional) Lexical Semantics
Using Transductive SVMs for Object Classification in Images
Face detection using Random projections
Yuta Nakamura1, Eunjong Choi1, Norihiro Yoshida2,
○Yuichi Semura1, Norihiro Yoshida2, Eunjong Choi3, Katsuro Inoue1
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Resource Recommendation for AAN
Michal Rosen-Zvi University of California, Irvine
Enriching Taxonomies With Functional Domain Knowledge
Unsupervised Machine Learning: Clustering Assignment
Using Clustering to Make Prediction Intervals For Neural Networks
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Wil Collins, Will Dickerson Client: Mohamed Magdy and CTRnet
Presentation transcript:

Investigating Vector-based Detection of Code Clones Using BigCloneBench Kazuki Yokoi1 Eunjong Choi2 Norihiro Yoshida3 Katsuro Inoue1 1 Osaka University 2 Nara Institute of Science and Technology 3 Nagoya University This logo was designed by me 1. Motivation 2. Investigation Target Vectorization Algorithms FLCCFinder : We have developed a vector-based approach It uses TF-IDF and cosine similarity It has possibilities: High-dimensional vector Polysemy and synonym are missed Dimensionality Reduction Machine Learning BoW Simple representation TF-IDF Reflect word’s importance LSI 〇 Reduction by SVD LDA Generative probabilistic model Doc2Vec Extends Word2Vec to documents WV-avg Average vector of Word2Vec FT-avg Average vector of FastText Source Code Code Fragment B Code Fragment A Feature Vector Vector Space Code Clone Similarity Measurements Cosine similarity Measure of similarity between two vectors of an inner product WMD (Word Mover’s Distance) Distance function between text documents To improve FLCCFinder, we investigated what kind of vectorization algorithms and similarity measurements are effective. 3. Investigation and Results RQ1: Does the recall of code clone detection vary with vectorization algorithms? RQ2: Does the selection of vectorization algorithms and distance scale affect detection speed? Applied each vectorization algorithms to BigCloneBench (BigCloneBench is a big data clone benchmark) Built 1MLOC dataset by randomly selecting from BigCloneBench Recall for each vectorization algorithm Calculation time for each vectorization algorithm BoW TF-IDF LSI LDA Doc2 Vec WV-avg FT-avg T1 .99 T2 .84 .82 .92 .85 .91 .95 .94 VST3 .90 .83 .97 .93 ST3 .45 .37 .61 .46 .79 MT3 .06 .03 .09 .23 .04 .55 .43 WT3/T4 .00 .02 .08 .05 Vectorization algorithms BoW TF-IDF LSI LDA Doc2 Vec WV-avg FT-avg Word 2Vec Fast Text Similarity measurements Cosine similarity WMD Generation time [sec] 5.1 10.0 9.7 60.3 44.7 42.7 196.1 29.5 187.7 Similarity time [sec] 5.5 1.1 1.6 497.5 538.1 Recall was improved by using dimensionality reduction and machine learning WV-avg had the highest recall WV-avg (Word2Vec) was fast among algorithms using machine learning WMD was much slower than cosine similarity 4. Discussion 5. Future works RQ1 Recall of TF-IDF was lower than BoW This is because the weights of identifiers tend to be high But code clones contain different identifier name Doc2Vec was not able to detect many Type-3 clones WV-avg had the highest recall RQ2 WV-avg(Word2Vec) achieved the highest speed among algorithms using machine learning It is not practical to use WMD for detecting code clones WV-avg was an effective algorithm Measure the precision and F-measure Investigate other than algorithms used in this study Use machine learning for classification between code clones and NON-clones Source Code Code Fragment B Code Fragment A Feature Vector Machine Learning Code Clones NON-Clones Output Input Vectorization This work was supported by JSPS KAKENHI Grant Numbers JP25220003, JP18H04094 and JP16K16034