Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Aggregating local image descriptors into compact codes
Chapter 5: Introduction to Information Retrieval
A Machine Learning Approach for Improved BM25 Retrieval
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
A novel log-based relevance feedback technique in content- based image retrieval Reporter: Francis 2005/6/2.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Presented by Zeehasham Rasheed
Recommender systems Ram Akella November 26 th 2008.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
SINGULAR VALUE DECOMPOSITION (SVD)
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
Semi-automatic Product Attribute Extraction from Store Website
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Collaborative Deep Learning for Recommender Systems
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Information Retrieval in Practice
Queensland University of Technology
Deep Feedforward Networks
Large-Scale Content-Based Audio Retrieval from Text Queries
Sentiment analysis algorithms and applications: A survey
Deep Learning Amin Sobhani.
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Saliency-guided Video Classification via Adaptively weighted learning
A Deep Learning Technical Paper Recommender System
Intelligent Information System Lab
A Consensus-Based Clustering Method
Efficient Deep Model for Monocular Road Segmentation
Efficient Estimation of Word Representation in Vector Space
Multimedia Information Retrieval
Mining and Analyzing Data from Open Source Software Repository
Neural Networks Advantages Criticism
Presented by: Prof. Ali Jaoua
Neuro-Computing Lecture 4 Radial Basis Function Network
Discriminative Frequent Pattern Analysis for Effective Classification
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Word embeddings based mapping
Word embeddings based mapping
Ying Dai Faculty of software and information science,
Machine Learning Interpretability
Ying Dai Faculty of software and information science,
Michal Rosen-Zvi University of California, Irvine
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Panagiotis G. Ipeirotis Luis Gravano
John H.L. Hansen & Taufiq Al Babba Hasan
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Precise Condition Synthesis for Program Repair
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Kostas Kolomvatsos, Christos Anagnostopoulos
Introduction to Sentiment Analysis
(presentor: jee-weon Jung)
Topic: Semantic Text Mining
Modeling IDS using hybrid intelligent systems
Learning and Memorization
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017 Presentor : Jee-weon Jung, 18. 05. 11.

Index Overview Introduction Background Model architecture Feature extraction Feature reduction with autoencoder DNN-based relevancy estimation Empirical evaluation Conclusion

Overview Object : bug localization Bug localization : locating potential buggy files using a given bug report Lexical mismatch is a key challenge in bug localization Terms in bug reports does not match the terms in src codes Propose DNN + rVSM(information retrieval) approach DNN : used for relating terms in bug reports with terms in src files rVSM : collects the feature on the textual similarity between bug reports and src files rVSM and DNN complement each other in real world evaluation 50% precise bug localization with single suggested file 66% in 3-best suggestion

Introduction (1) Software maintainance (bug fixing) is time consuming Developers investigate and locate potential buggy files using bug reports Bug reports : description on the scenario when the softwave malfunctions + may provide relevant information as well Quickly & precisely locating bug files are crucial to efficiency

Introduction (2) Two main categories of works exist to assist bug localization Falut localization : statisics of program analysis information / focus on the program’s semantics and execution Analyze contents of the bug report : use information retrieval methods such as Latent Dirichlet Allocation / Latent Semantic Indexing / probabilistic ranking / Vector space model, etc. Machine learning is used for bug localization recently Use previously fixed files as classification labels and assign each report Cannot suggest files that have not been fixed before

Introduction (3) This paper develop DNNLOC (DNN + rVSM) rVSM : extract features to measure similarity using texts DNN : measure the relevancy (=similarity) between bug report and src code DNN is also used to combine DNN, rVSM, Metadats’s output >>> Deal the lexical gap with DNN and use rVSM as feature extractor Autoencoder is used to make DNNLOC scalable Reduce the dimensionality of data Meta feature is additionaly extracted and used Human’s knowledge into DNN Ex) recently fixed files are more likely buggy files

Introduction (4)

Background (1) DNN models complex nonlinear relationships Power comes from nonlinear activation functions, deep architecture This paper uses primitive type of DNN, a multi-layer perceptron

Model architecture (1) Generate positive / negative pair between a bug report and buggy file Model architecture flow Feature extraction  reduce dimensionality using DNN (autoencoder)  parallelly processed features are all fed into DNN (MLP, combiner)  output score is used as confidence score To compute the relevancy between a bug report and a file, text tokens are extracted Identifiers, API method, classes, comments, textual descriptions of API

Model architecture (2) Combiner DNN combines DNN(autoencoder’s output) relevancy, rVSM textual similarity, and meta data Inputs one bug report and one buggy / non-buggy file  outputs confidence score

Feature extraction (1) Parse bug report into words using whitespace ‘a’, ‘the’, punctuations removed Compund word decomposed using capital letter “childAdded”  child, added Term frequency-inverse document frequency is used to measure term significance weights of all words For src file, 4 types of features are extracted 2 code based / 2 textual Names, identifiers / names of API classes and interfaces (code based) Apply same mechanism to the comments and string literals in the src code

Feature extraction (2) 4 types of meta data from bug-fixing history of a project How recent a file has been fixed (several studies used this data) Measured in months Bug fixing frequency collaborative filtering score : measure the similarity of a bug report and previously fixed bug reports by the same file (proposed in another paper) Textual similarity between names of classes in a bug report and src file

Feature reduction with autoencoder (1) For processing real-world data, reducing the dimensionality is crucial Reduce computational cost for deep learning Use autoencoder which has three layers Target output = input

Feature reduction with autoencoder (2) Adoptation of autoencoder brings differentiation with LSI LSI cannot scale for a large matrix LSI conducts linear transformation Nonlinear transformation appears to be more resilient to noises and various input data types

DNN based relevancy estimation (1)

DNN based relevancy estimation (2) In relevancy estimator DNN, autoencoder DNN, combiner DNN, fully connected (dense) layers are used Feature selection is conducted using DNN When training, output relevancy score is either 0 or 1 At prediction, score between 0 and 1 is obtained

DNN based relevancy estimation (2) In relevancy estimator DNN, autoencoder DNN, combiner DNN, fully connected (dense) layers are used Feature selection is conducted using DNN When training, output relevancy score is either 0 or 1 At prediction, score between 0 and 1 is obtained

Empirical evaluation (1) Evaluated proposed system’s accuracy /compare with existing information retrieval and machine learning approaches / time efficiency / impact of parameters, components, features / Dataset : same with Ye et al.(cited)

Empirical evaluation (2) Comparison among proposed DNNLOC’s components

Empirical evaluation (3) Accuracy comparison on the number of nodes in the hidden layer, Training data size

Empirical evaluation (4) Accuracy & comparison with existing approaches B : set of all bug reports K : set of positions off buggy files in the ranked list Indexi = index of the highest ranked actual buggy file Ttestset : number of cases MAP, MRR are between 0 and 1, higher better

Empirical evaluation (5)

Empirical evaluation (6) Time efficiency analysis CPU: Intel Xeon CPU E5-2650 2.00GHz(32cores), 126GB RAM

Conclusion Combined rVSM and DNN Used another DNN to ensemble two models result with meta data at the same time rVSM and DNN well complement Higher accuracy than state-of-the-art rVSM and DNN respectively