API Hyperlinking via Structural Overlap Fan Long, Tsinghua University Xi Wang, MIT CSAIL Yang Cai, MIT CSAIL.

Slides:



Advertisements
Similar presentations
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Advertisements

Greedy Algorithms Greed is good. (Some of the time)
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Label Placement and graph drawing Imo Lieberwerth.
22C:19 Discrete Math Trees Fall 2011 Sukumar Ghosh.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Benjamin J. Deaver Advisor – Dr. LiGuo Huang Department of Computer Science and Engineering Southern Methodist University.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Recent Development on Elimination Ordering Group 1.
DISC-Finder: A distributed algorithm for identifying galaxy clusters.
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.
Department of Computer Science, University of California, Irvine Site Visit for UC Irvine KD-D Project, April 21 st 2004 The Java Universal Network/Graph.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.
Dependency Tracking in software systems Presented by: Ashgan Fararooy.
Network Aware Resource Allocation in Distributed Clouds.
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
1 PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Protecting Sensitive Labels in Social Network Data Anonymization.
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
Union-find Algorithm Presented by Michael Cassarino.
The Software Development Process
Exploiting Code Search Engines to Improve Programmer Productivity and Quality Suresh Thummalapenta Advisor: Dr. Tao Xie Department of Computer Science.
© SERG Reverse Engineering (Interconnection Styles) Interconnection Styles.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Tagging Systems and Their Effect on Resource Popularity Austin Wester.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Recommending Adaptive Changes for Framework Evolution Barthélémy Dagenais and Martin P. Robillard ICSE08 Dec 4 th, 2008 Presented by EJ Park.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Graph Indexing From managing and mining graph data.
WHO WILL BENEFIT FROM THIS TALK TOPICS WHAT YOU’LL LEAVE WITH Developers looking to build applications that analyze big data. Developers building applications.
1 API Recommendation Wujie Zheng
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Prof. I. J. Chung Data Structure #1 Professor I. J. Chung.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
基于多核加速计算平台的深度神经网络 分割与重训练技术
Cross-library API Recommendation Using Web Search Engines
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Data Mining Chapter 6 Search Engines
The JSF Tools Project – WTP (internal) release review
Precise Condition Synthesis for Program Repair
MAPO: Mining and Recommending API Usage Patterns
Presentation transcript:

API Hyperlinking via Structural Overlap Fan Long, Tsinghua University Xi Wang, MIT CSAIL Yang Cai, MIT CSAIL

Example: MSDN …… Help information for EnterCriticalSection API See Also sections that lists related functions

Motivation Cross-references are useful to organize API knowledge – Hyperlinks to related functions – “See Also” in MSDN It is difficult to manually maintain cross-references – Huge libraries: more than 1400 functions in Apache – Tedious and error-prone Goal – Auto-generate cross-references for documentation

Cross-references Different users may need different kinds of cross-references in the document of a library – end-users, testers, developers, … For end-users of the library, it needs to contain the functions that perform the same or a relevant task In this paper, we focus on the documentation for end-users

Existing solutions Documentation tools and tags with doxygen, javadoc… – only 15 out of 1461 APIs in httpd are annotated – Developers cannot track all related functions, when the library is evolving Usage pattern mining – Based on the call graph – Find functions f and g that is often called together – Sensitive to specific client code – May have missing or unreliable results

Altair Output

See (original): extracted from comment by doxygen See also: auto-generated by Altair Five related functions for compression and decompression Results are organized in two modules

Basic idea Hyperlink – Functions are related, if they access same data: The more data they share, the more likely that they are related. Module – Tightly related functions  module. – Tense connection inside a module – Loose connection between two modules Altair analyzes library implementation.

Altair Stages Program analysis – Extract data access relations from the library code and summarize them in a data access graph Ranking – Compute overlap rank to measure the relevance between two functions Clustering – Group the functions that are tightly related into modules RankingClustering Program analysis

Data access graph f() { return new A; } g(A *a) { g 0 (a); z = 42; } h() { z++; } static g 0 (A *a) { a->x++; a->y--; } f g h A.xA.yz Data nodes are fields and global variables g calls g 0, and g 0 ’s access effect is merged to g f allocates objects of type A, and effects all of its fields

Overlap rank N(f) denote the set of data that f may access Given a function f, we define its overlap with function g as: π(g|f) is the proportion of f’s data that is also accessed by g.

Overlap rank π(h|f)=0, π(g|f)=1, π(f|g)=2/3 High π(g|f) value  g is related to f Overlap rank is asymmetric; cross-references are also not bi-directional f g h A.xA.yz

Clustering Overlap coefficient (symmetric measure): Function set F is partitioned into two modules, S and its complement. We define the conductance as: min( ) Inter-connection between two modules The sum of vertex degrees in the module

Clustering To find min( ) is NP-hard Altair uses spectral clustering algorithm to get approximate result – Directly cluster functions into k modules, if k is known – Recursively bi-partition the function set until they have desired granularity, if k is unknown

Related work API recommendation – Suade(FSE’05), FRAN, and FRIAR(FSE’07) Importance: Suade, FRAN Association: FRIAR – Change history mining(ROSE, ICSE’04) – Extract code examples: Strathcona(ICSE’05), XSnipppet(OOPSLA’06) Module clustering – Arie, Tobias, Identifying objects using Cluster and Concept Analysis(ICSE’99) – Michael, Thomas, Identifying Modules via Concept Analysis(ICSM’97)

Ranking comparison Altair returns – APIs that perform related tasks – Functions that in the same module SuadeFRANFRIARAltair apr_file_eof( apr_file_t *file) do_emit_plainapr_file_read ap_rputs do_emit_plain N/Aapr_file_seek apr_file_read apr_file_dup apr_file_dup2 (… 5 more) apr_hash_get( apr_hash_t *ht, const void *key, apr_ssize_t klen) find_entry find_entry_def dav_xmlns… dav_get… (… 25 more) apr_palloc apr_hash_set memcpy strlen apr_pstrdup (… 95 more) apr_hash_set apr_palloc apr_hash_make strlen apr_pstrdup (… 18 more) apr_hash_copy apr_hash_merge apr_hash_set apr_hash_make apr_hash_this (… 3 more)

Case study of module clustering ModuleFunctions UtilityBZ2_bzBuffToBuffCompress BZ2_bzBuffToBuffDecompress CompressBZ2_bzCompressInit BZ2_bzCompress BZ2_bzCompressEnd DecompressBZ2_bzDecompressInit BZ2_bzDecompress BZ2_bzDecompressEnd File operationsBZ2_bzReadOpen BZ2_bzRead BZ2_bzReadClose (… 8 in total) 16 API functions in bzip2 1. File I/O and compression APIs 2. Decompress APIs from others. 3. Compress APIs and two utility functions

Analysis cost Applied to several popular libraries Analysis finished in seconds for fairly large libraries(>500K LOC) Library packageKLOC(llvm bitcode)Analysis time (sec)Memory used (MB) bzip <14.6 sqlite httpd subversion openssl-0.9.8i

Limitations & Extensions Limitations – Source code of the library is required – Low-level system calls, whose code is missing – Semantic relevance (SHA-1 and MD5 functions) Extensions – Combination with client code mining – Heuristics like naming convention

Conclusion Altair can auto-generate cross-references and cluster API into meaningful modules Altair exploits data overlaps between functions Data access graph Overlap rank Such structural information is reliable for API recommendation and module clustering

Download Altair Altair is open source and available at: – Including source code along with demos Feel free to try it!

Thanks! Questions?

Challenges Open program – Parameters of two functions may point to same data. – Use fields to distinguish different data Calls – Function may call other API in its implementation. – Merge their effect, if the callee is static. Allocations – Functions like malloc and free create or destroy an object – These functions affect all fields of the object.

Example: Data access graph f g h xyzw g0g0 e A f(A *a) { a->x = 0xdead; a->y = 0xbeaf; } e() { return new A; } g(A *a, B *b) { g 0 (a); b->z = 42; } h() { w++; } static g 0 (A *a) { a->x++; a->y--; }

Graph construction Function f access data d – An edge from f to d Data d is a field of type t – An edge from t to d Function f calls a static function g – An edge from f to g Function f creates or destroys objects of type t – An edge from f to t

Bipartite graph Computes the transitive closure of the graph Removes type and static function nodes and leaves only edges from public function nodes to data nodes f g h xyzw g0g0 e A f g h e A.xA.yzw

Conductance Overlap coefficient, symmetric measure: Function set F is partitioned into two modules, S and its complement The total overlap of all vertices in S defined as: The overlap between vertices sets S and defined as:

Conductance The intra-connection inside a module should be tense. The inter-connection between modules should be loose. Conductance for a partition is: We need to minimize it

Modularity Define modularity of function set F as minimized conductance: NP-hard Altair uses spectral clustering algorithm Recursively bi-partition functions until they have desired granularity.

Goal Precision – No unrelated functions – Few missing functions Not sensitive to client code – Our tool does not need client code at all Clustering the functions into modules – Better organization of the knowledge – Can further help program analysis tools such as specification mining

Overview Motivation Stages Design – Data access graph – Overlap rank – Clustering Experiment Discussion Conclusion

Framework Source code llvm frontend llvm bitcode Data access graph Program analysis Overlap rank Modules Output Ranking Spectral Clustering

Framework Source code llvm frontend llvm bitcode Data access graph Program analysis Overlap rank Modules Output Ranking Spectral Clustering

Framework Source code llvm frontend llvm bitcode Data access graph Program analysis Overlap rank Modules Output Ranking Spectral Clustering

Graph construction Function f access a field x of type t – An edge from node f to node t.x Function f access global variable v – An edge from node f to node v Function f creates or destroys an object of type t – Edges from node f to data nodes of all fields of t Function f calls function g, and g is static – For every data node d, if g accesses d, we add an edge from node f to node d