A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p76004546.

Slides:

Advertisements

Similar presentations

1 CASE STUDY RESEARCH An Introduction. 2 WHY CASE STUDY RESEARCH? The case study method is amongst the most flexible of research designs, and is particularly.

Advertisements

Improved TF-IDF Ranker

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

Using Natural Language Program Analysis to Locate and understand Action-Oriented Concerns David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and.

Genetic Factors Predisposing to Homosexuality May Increase Mating Success in Heterosexuals Written by Zietsch et. al By Michael Berman and Lindsay Tooley.

Writing Good Software Engineering Research Papers A Paper by Mary Shaw In Proceedings of the 25th International Conference on Software Engineering (ICSE),

Vector Space Model CS 652 Information Extraction and Integration.

Validity Lecture Overview Overview of the concept Different types of validity Threats to validity and strategies for handling them Examples of validity.

Database Complexity Metrics Brad Freriks SWE6763, Spring 2011.

Indexing Overview Approaches to indexing Automatic indexing Information extraction.

Measurement and Data Quality

United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE

Assessment Report Department of Psychology School of Science & Mathematics D. Abwender, Chair J. Witnauer, Assessment Coordinator Spring, 2013.

This chapter is extracted from Sommerville’s slides. Text book chapter

Scales and Indices While trying to capture the complexity of a phenomenon We try to seek multiple indicators, regardless of the methodology we use: Qualitative.

Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.

© 2013 Cengage Learning. Outline  Types of Cross-Cultural Research  Method validation studies  Indigenous cultural studies  Cross-cultural comparisons.

Chapter 5 Research Methods in the Study of Abnormal Behavior Ch 5.

1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.

-Example of Multidimensional scaling- Keiko Nakashima

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

Understanding Statistics

IIT BOMBAYIDP in Educational Technology * Paper Planning Template Resource – Paper-Planning-Template(SPT)Version 1.0, Dec 2013 Download from:

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Slide 13-1 Copyright © 2004 Pearson Education, Inc.

Evaluating a Research Report

L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.

Chapter 2 Research in Abnormal Psychology. Slide 2 Research in Abnormal Psychology  Clinical researchers face certain challenges that make their investigations.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Chapter 2 AP Psychology Outline

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

9.1 WELCOME TO COMMON CORE HIGH SCHOOL MATHEMATICS LEADERSHIP SCHOOL YEAR SESSION 1 17 SEPT 2014 TAKING CHANCES (IN CONTENT AND PEDAGOGY)

HOW TO WRITE RESEARCH PROPOSAL BY DR. NIK MAHERAN NIK MUHAMMAD.

RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,

Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.

URBDP 591 I Lecture 3: Research Process Objectives What are the major steps in the research process? What is an operational definition of variables? What.

Copyright 2003 Scott/Jones Publishing Standard Version of Starting Out with C++, 4th Edition Chapter 13 Introduction to Classes.

Mary Jones. Psychology: The Science of Behavior and Mental Processes Psychologists attempt to understand Observable behavior: Such as speech and physical.

Mining fuzzy domain ontology based on concept Vector from wikipedia category network.

BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.

CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.

2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.

Question paper 1997.

Copyright © Allyn & Bacon 2008 Intelligent Consumer Chapter 14 This multimedia product and its contents are protected under copyright law. The following.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Approaches to Translation Ju Miao Nankai University.

A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Connecting Architecture Reconstruction Frameworks Ivan Bowman, Michael Godfrey, Ric Holt Software Architecture Group University of Waterloo CoSET ‘99 May.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 1 Research: An Overview.

PSYCH 610 Entire Course (UOP) For more course tutorials visit  PSYCH 610 Week 1 Individual Assignment Research Studies Questionnaire.

Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

RESEARCH METHODS 8-10% 250$ 250$ 250$ 250$ 500$ 500$ 500$ 500$ 750$

Unit 6 Probability.

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p 江怡岑 P 王于庭

OUTLINE  Introduction  Background & Related work  Framework  Dataset and Experimental Procedure  Static analysis results  Conclusions

INTRODUCTION  Program comprehension is a key developer activity during software maintenance.  Topic models : rely on lexical information to identify topics that are semantically related to high-level domain concepts.  LSI ( latent semantic indexing )  LDA ( latent Dirichlet allocation )

INTRODUCTION  While topics reflect semantic relatedness, it is believed that human evolves spatial cognition strategies to navigate the code base.  for object-oriented (OO) systems built on the principle of encapsulation, the entities should be spatially organized in a way that reflects the topics of software

INTRODUCTION  the tenet of “ topical locality ”  spatial relatedness entails semantic relatedness  So basic that in many cases it is not mentioned  When the tenet is mentioned, its validity is not measured explicitly.  our goal is to measure the extent to which this key tenet holds for OO systems.  propose a framework to examine what extent three relationships of topical locality hold in large- scale open-source projects.

BACKGROUND and Related Work  A. Way-finding in Code Base  B. Relating Spatial and Semantic Cues  C. Topical Locality Applied in Software Engineering Tools

BACKGROUND and Related Work  A. Way-finding in Code Base  Developer comprehending a code base can therefore be thought of as continually trying to answer way-finding questions.  Moonen has examined way-finding in soft-ware and extended the concept of legibility to software.

BACKGROUND and Related Work  B. Relating Spatial and Semantic Cues  We are interested in the interplay of different cues so that they can be effectively synthesized.  We focus on the relationship between two types of cues.  Spatial.  Semantic.  Spatial + Semantic = “topical locality”  the software entities should be neither randomly named nor randomly placed.  Source code entities should be spatially organized to reflect the semantics of software.

BACKGROUND and Related Work  C. Topical Locality Applied in Software Engineering Tools  The idea of topical locality plays an important role in building a number of software engineering tools.  Survey three tools  Code Indexers  Code Visualizers  Code Summarizers

BACKGROUND and Related Work  Code Indexers  An indexer takes source code and generates profiles of the code for later searching  Should index header comments ?  we want to address how well name and header comments represent the target code entity’s topic.

BACKGROUND and Related Work

 Code Visualizers  Once a relevant code line is located, its surroundings provide valuable contextual information for the developer  examining topical locality of a contiguous fragment allows us to assess to what extent the code line indicates the topic of its surroundings.

BACKGROUND and Related Work

 Code Summarizers  A summarizer generates a snapshot of the source code in order to reduce the cost for developers to read and understand the staggering amount of software repository information  Our contribution is to measure the degree of topical locality of the snapshot

BACKGROUND and Related Work

FRAMEWORK overview  Framework Overview

FRAMEWORK research questions  Research questions  RQ1 : Which better conveys class body’s topic: class name, header comments, or a combination of both ?  RQ2 : Can a code line indicate its surrounding’s topic ?  RQ3 : Can a contiguous code fragment serve as a snapshot of the entire class ?

FRAMEWORK method  independent variables are concerned with identifying spatial relationships  dependent variable is about the semantic relatedness  Three measures:  TFIDF cosine similarity  query term probability  document overlap  We treat source code as document  output score in the range [0, 1]

FRAMEWORK three measures (1/3)  TFIDF scheme – text mining model  = ()×  refers to the term frequency of  is the inverse document frequency, = 2( +1/), where is the total number of documents in the corpus and is the number of documents in which occurs.

FRAMEWORK three measures (2/3)  Query term probability  measures the likelihood of a term in the query/source being present in the target document.

FRAMEWORK three measures (3/3)  Document overlap  a set-based measure that quantifies the amount of overlap between two documents Q and W

Dataset and Experimental Procedure  LOC : the lines of code  COM : the lines of comments  CCs : the number of classes

Dataset and Experimental Procedure  Use a source code indexer to process the code base of the selected projects.  The indexing process results in the profiles that store partial and important information from the source code.  We calculate the three semantic relatedness measures (TFIDF-Cos, Prob and Overlap) based on the profiles.

RQ1  Can class name (N) and/or header comment (H) convey the topic of class body(B) ?  Calculate the lexical similarity for (N,B), (H,B), (NH,B)

RQ2  Can a code line indicate the topic of its surroundings?  For randomly selected code line(L), we take a contiguous code fragment of 30 lines as its surroundings (S) and select from the same file another 30-line contiguous code fragment(R)  Compare the lexical similarity of (L,S) with that of (L,R)  Those classes with at least 70 LOC are considered.

RQ3  Can a contiguous code fragment serve as a snapshot of entire class?  Form a code search perspective, the lexical similarity of the snapshot should indicate the topical closeness of the classes  Randomly select a term w(‘data’ in Fig.4) to act as query keyword. The snapshot is extracted as 30- line contiguous code fragment.  Only consider classes with at least 60 LOC.

Static Analysis Results  RQ1 : Name vs. Header  RQ2:Code Line and Surroundings  RQ3: Contiguous Fragment as a Snapshot  Threats to Validity

RQ1 : Name vs. Header  NH is the closet to B in most cases, expect MegaMek when measured by TFIDF, where NB is larger than HB and NHB. => MegaMek classes do not have useful header comments.

RQ1 : Name vs. Header  Least Significant Distance(LSD) multiple comparison test: a test places the combinations significantly different from others in separate groups, and allocates the best combination to ‘group A’.  The result classifies NH-B into ‘group A’, indicating that the similarity score of NH-B is significantly higher than N-B and H-B.  We conclude that if the class contains useful header comments, then it is important to combine the header comments with the class name in order to convey the topic of the class body.

RQ2:Code Line and Surroundings  A code line indicates the topic of its surroundings more than it indicates the topic of a random code fragment.

RQ3: Contiguous Fragment as a Snapshot  We calculate the Pearson correlation coefficient, which is a parametric statistic that shows the correlation between two variables.  From the viewpoint of distinguishing the topics of different classes, a contiguous code fragment can serve as a snapshot of the entire class.

RQ3: Contiguous Fragment as a Snapshot

Threats to Validity  Construct Validity: the selection of 30-line contiguous, non-empty, and comments-inclusive code fragment for addressing RQ2 and RQ3.  Empty lines contribute little to spatial and semantic information. All comments is a choice influenced by RQ1.  Internal validity : using three measures derived form different mathematical models diminished the measuring bias.  External validity : this analysis may not generalize to other software projects.

Conclusions  In this paper, we contributed a novel experimental framework for testing this tenet of “topical locality” and applied the framework to provide empirical evidence of topical locality in large-scale OO systems.  Our future work includes carrying out more empirical studies to examine other topical locality instances.  It is important to integrate the theoretical understandings and empirical findings to enhance the practical tool support for software developers.