Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.

Slides:



Advertisements
Similar presentations
Personalized Presentation in Web-Based Information Systems Institute of Informatics and Software Engineering Faculty of Informatics and Information Technologies.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Optimizing search engines using clickthrough data
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Detailed Design Kenneth M. Anderson Lecture 21
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Information Retrieval
Overview of Search Engines
Information Retrieval in Practice
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
1 Introduction to Software Engineering Lecture 1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Workshop on Software Product Archiving and Retrieving System Takeo KASUBUCHI Hiroshi IGAKI Hajimu IIDA Ken’ichi MATUMOTO Nara Institute of Science and.
Design Concepts By Deepika Chaudhary.
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
MICHAL TVAROŽEK, MICHAL BARLA, GYÖRGY FRIVOLT, MAREK TOMŠA, MÁRIA BIELIKOVÁ Improving Semantic Search via Integrated Personalized Faceted and Visual Graph.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Information Retrieval
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
G. Marchionini, Univ. of Maryland Electronic Environments Cost Trends: Hardware cost < Software cost < Information cost < People time Virtuality (transcend.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
WHIM- Spring ‘10 By:-Enza Desai. What is HCIR? Study of IR techniques that brings human intelligence into search process. Coined by Gary Marchionini.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Information Retrieval in Practice
Software Metrics 1.
Search Engine Architecture
Object-Oriented Software Engineering Using UML, Patterns, and Java,
Map Reduce.
Lecture 9- Design Concepts and Principles
An Introduction to Visual Basic .NET and Program Design
Information Retrieval
Applying Key Phrase Extraction to aid Invalidity Search
Lecture 9- Design Concepts and Principles
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Ying Dai Faculty of software and information science,
Code search & recommendation engines
Ying Dai Faculty of software and information science,
Lab 2: Information Retrieval
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information Technologies Slovak University of Technology in Bratislava SOFSEM 2013

Motivation every day… – programmers need to answer several questions for the purpose of finding solutions and making decisions structure of source code is not conducive to being read in a sequential style > programmers… – read code selectively – identify parts of code relevant to the target task – use the Web as a giant repository of source code

Concept / feature location process of identifying parts of source code that correspond to a specific functionality input: – description of a problem (change task) expressed by concepts - terms, keywords output: – a set of software components (artifacts) that implement or address the input concepts

Why popularity? to support search-driven development, it is not sufficient to implement a "mere" full text search over a base of source code when a programmer reuses source code (from external sources) she has to trust the work of external programmers that are unknown to her

Why popularity? /2 trustability (reputability) is a big issue for reusing source code (components) reputation ranking – can be a plausible way to rank source code results -> programmer’s “karma” value

What we do? modeling programmer’s knowledge – activity-based programmer’s knowledge model – methods for its automatic retrieving

How? monitoring programmer’s interaction with source code – authorship – authorship duration – code stability – code patterns – work experience – used technologies – activities (selections, edits, search) – notability (popularity) of created components

Notability of created components the more important a component is then a programmer has better knowledge about it -> processing source code repository -> searching for relevant functions for a given programmer’s query

Processing source code repository

Index creator creates (document and term) indexes from all source code files of projects in the repository uses Vector Space Model terms are extracted from comments, names of functions and identifiers Apache Cassandra

Processing source code repository

Function graph creator creates a directed graph of functional dependencies nodes represent functions names (full signature of functions) directed edge from the function F to the function G is created if the function G is invoked in the function F we use intermediate representation of source code, i.e., an abstract syntax tree (AST) and control flow graph (CFG) obtained from AST

Processing source code repository

Ranking functional dependencies PageRank algorithm rank value indicates importance of a particular function functions called from many other functions, have a significantly higher PageRank value than those that are used infrequently

Processing source code repository

Searching for relevant functions

Phase#1 top relevant documents are retrieved based on a similarity sim(d j, q) between documents (source code files) and programmer’s query q

Searching for relevant functions

Phase#2

Searching for relevant functions

Phase#3

Searching for relevant functions

Evaluation plugin into Microsoft Visual Studio an experiment with 6 participants and 2 software projects (P1, P2) how well the participants could find code fragments (functions) that matched given tasks -> 2 groups (G1, G2) of 3 members -> 2 experiments

Experiment#1 a set of 6 change tasks - 3 tasks for each project Find the method for calculating a co-occurrence rank of obtained annotation and change the damping factor to the value 0.3. participants reformulated the target tasks into a sequence of words that described concepts they needed to find

Experiment#1 T1: tasks 1-3 were specified for P1 T2: tasks 4-6 were specified for P2 the goal of G1 was to perform T1 using our search engine the goal of G2 was to perform T1 using built-in search engine G1 – T2 – built-in search engine G2 – T2 – our search engine

Experiment#1 we compared the number of participants‘ queries created using our search engine and the built-in search engine -> the participants had to make less effort for locating target code fragments (methods)

Experiment#2 we specified 2 implementation tasks T1, T2 participants of G1 were tasked to implement T1 using the built-in search engine participants of G2 were tasked to implement T1 using our search engine G1 – T2 – our search engine G2 – T2 – built-in search engine

Experiment#2 for each query, each participant evaluated relevance of the results – for each obtained result, each participant assigned a level of confidence completely/mostly irrelevant mostly/highly relevant fragments were evaluated as relevant only if they were ranked with the confidence levels mostly or highly relevant

Experiment#2 we used the precision metrics it is the fraction of the top 5 ranked code fragments which are relevant to the query

Experiment#2 search results, obtained using our search engine, were more relevant during performing the tasks compared with the use of the built- in search engine

Why we do that? PerConIK – focused on support of enterprise applications development by viewing a software system as a web of information artifacts several agents collect and process documentations, source codes, developer blogs, developer activities, …

Why we do that? we create a dataset based on monitoring complete behaviour of programmer – searching relevant information on the Web, – searching in source code, – writing and correcting source code in development environment – defining explicit feedback from the programmers for purpose of evaluation