Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.

Slides:

Advertisements

Similar presentations

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Kai Pan, Xintao Wu University of North Carolina at Charlotte Generating Program Inputs for Database Application Testing Tao Xie North Carolina State University.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Rakesh Agrawal Ramakrishnan Srikant

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Mining Sequences. Examples of Sequence Web sequence:  {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation}

1 Detecting Logic Vulnerabilities in E- Commerce Applications Presenter: Liu Yin Slides Adapted from Fangqi Sun Computer Science Department College of.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.

OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Tao Xie North Carolina State University In collaboration with Nikolai Tillmann, Peli de Halleux, Wolfram Research, Suresh Thummalapenta,

Tao Xie North Carolina State University Supported by CACC/NSA Related projects supported in part by ARO, NSF, SOSI.

Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

Behavior-based Spyware Detection By Engin Kirda and Christopher Kruegel Secure Systems Lab Technical University Vienna Greg Banks, Giovanni Vigna, and.

Tao Xie Automated Software Engineering Group Department of Computer Science North Carolina State University

Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Investigation.

Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University

1 PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.

Computer Measurement Group, India Optimal Design Principles for better Performance of Next generation Systems Balachandar Gurusamy,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Bug Localization with Machine Learning Techniques Wujie Zheng

Mining High Utility Itemset in Big Data

Patterns and Reuse. Patterns Reuse of Analysis and Design.

Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign

Data Mining Association Rules: Advanced Concepts and Algorithms

Yazd University, Electrical and Computer Engineering Department Course Title: Advanced Software Engineering By: Mohammad Ali Zare Chahooki 1 Machine Learning.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.

PROCESSING, ANALYSIS & INTERPRETATION OF DATA

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.

Software Analytics: Towards Software Mining that Matters Tao Xie University of Illinois at Urbana-Champaign

Exploiting Code Search Engines to Improve Programmer Productivity and Quality Suresh Thummalapenta Advisor: Dr. Tao Xie Department of Computer Science.

Computer Science 1 Mining Likely Properties of Access Control Policies via Association Rule Mining JeeHyun Hwang 1, Tao Xie 1, Vincent Hu 2 and Mine Altunay.

Cooperative Developer Testing: Tao Xie North Carolina State University In collaboration with Xusheng ASE and Nikolai Tillmann, Peli de

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

+ Moving Targets: Security and Rapid-Release in Firefox Presented by Carlos Bernal-Cárdenas.

1 Exposing Behavioral Differences in Cross-Language API Mapping Relations Hao Zhong Suresh Thummalapenta Tao Xie Institute of Software, CAS, China IBM.

Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Identifying Domain Expertise of Developers from Source Code Presenter : Wu, Jia-Hao Authors : Renuka.

Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

Graph Indexing From managing and mining graph data.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

CAR-Miner: Mining Exception-Handling Rules as Sequence Association Rules Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.

Wujie Zheng 1, Hao Ma 2, Michael Lyu 1, Tao Xie 3, and Irwin King 1,4 1 CUHK, 2 Microsoft Research, 3 NCSU, 4 AT&T Labs Nov. 9, 2011 Mining Test Oracles.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Ingredients:

1 API Recommendation Wujie Zheng

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

User Characterization in Search Personalization

Cross-library API Recommendation Using Web Search Engines

Expandable Group Identification in Spreadsheets

Panagiotis G. Ipeirotis Luis Gravano

MAPO: Mining and Recommending API Usage Patterns

Presentation transcript:

Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina State University Raleigh, USA ASE 2009 This work is supported in part by NSF grant CCF and ARO grant W911NF and ARO grant W911NF managed by NCSU Secure Open Source Systems Initiative (SOSI)

Alattin: Motivation 2  Problem: Programming rules are often not well documented  General solution:  Mine common patterns across a large number of data points (e.g., code samples)  Use common patterns as programming rules to detect defects

3  Limited data points  Existing approaches mine specifications from a few code bases  miss specifications due to lack of sufficient data points  Existing approaches produce a large number of false positives Challenges addressed by Alattin

4 4 4 Code repositories 1 2 N … 12 mining patterns searchingmining patterns Code search engine e.g., Open source code on the web Eclipse, Linux, … Existing approaches Alattin approach Often lack sufficient relevant data points (eg. API call sites)‏ Code repositories Limited Data Points

5 5  Existing approaches produce a large number of false positives  One major observation:  Programmers often write code in different ways for achieving the same task  Some ways are more frequent than others Large Number of False Positives Frequent ways Infrequent ways Mined Patterns mine patterns detect violations

6 Example: java.util.Iterator.next() PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Example 2 Code Sample 2 Java.util.Iterator.next() throws NoSuchElementException when invoked on a list without any elements

7 Example: java.util.Iterator.next() PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample code examples Sample 1 (1218 / 1243) Sample 2 (6/1243) Mined Pattern from existing approaches: “boolean check on return of Iterator.hasNext before Iterator.next”

8 Example: java.util.Iterator.next()  Require more general patterns (alternative patterns): P 1 or P 2 P 1 : boolean check on return of Iterator.hasNext before Iterator.next P 2 : boolean check on return of ArrayList.size before Iterator.next  Existing approaches cannot mine, since alternative P 2 is infrequent PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample 2

9 Our Solution: ImMiner Algorithm  Mines alternative patterns of the form P 1 or P 2  Based on the observation that infrequent alternatives such as P 2 are frequent among code examples that do not support P code examples Sample 1 (1218 / 1243) Sample 2 (6/1243) P 2 is frequent among code examples not supporting P 1 P 2 is infrequent among entire 1243 code examples

10 Alternative Patterns  ImMiner mines three kinds of alternative patterns of the general form “P 1 or P 2 ” Balanced: all alternatives (both P 1 and P 2 ) are frequent Imbalanced: some alternatives (P 1 ) are frequent and others are infrequent (P 2 ). Represented as “P 1 or P ^ 2 ” Single: only one alternative

11 ImMiner Algorithm  Uses frequent-itemset [Burdick et al. ICDE 01] mining iteratively  An input database with the following APIs for Iterator.next() Input databaseMapping of IDs to APIs

12 ImMiner Algorithm: Frequent Alternatives Input database Frequent itemset mining (min_sup 0.5) Frequent item: 1 P 1 : boolean-check on the return of Iterator.hasNext() before Iterator.next()

13 ImMiner: Infrequent Alternatives of P 1 Positive database (PSD) Negative database (NSD)  Split input database into two databases: Positive and Negative  Mine patterns that are frequent in NSD and are infrequent in PSD  Reason: Only such patterns serve as alternatives for P 1  Alternative Pattern : P 2 “const check on the return of ArrayList.size() before Iterator.next()”  Alattin applies ImMiner algorithm to detect neglected conditions

14 Neglected Conditions  Neglected conditions refer to  Missing conditions that check the arguments or receiver of the API call before the API call  Missing conditions that check the return or receiver of the API call after the API call  One of the primary reasons for many fatal issues  security or buffer-overflow vulnerabilities [Chang et al. ISSTA 07]

15 Alattin Approach Application Under Analysis Detect neglected conditions Classes and methods Open Source Projects on web 1 2 N … … Pattern Candidates Alternative Patterns Violations Extract classes and methods reused Phase 1: Issue queries and collect relevant code samples. Eg: “lang:java java.util.Iterator next” Phase 2: Generate pattern candidates Phase 3: Mine alternative patterns Phase 4: Detect neglected conditions statically

16 Evaluation  Research Questions:  Does alternative patterns exist in real applications?  How high percentage of false positives are reduced (with low or no increase of false negatives) in detected violations?

17 Subjects  Two categories of subjects:  3 Java default API libraries  3 popular open source libraries  Column “Samples”: number of code examples collected from Google code search

18 RQ1: Balanced and Imbalanced Patterns  How high percentage of balanced and imbalanced patterns exist in real applications?  Balanced patterns: 0% to 30% (average: 9.69%)  Imbalanced patterns:  30% to 100% (average: 65%) for Java default API libraries  0% to 9.5% (average: 5%) for open source libraries  Inference: Java default API libraries provide more different ways of writing code compared to open source libraries

19 RQ2: False Positives and False Negatives  How high % of false positives are reduced (with low or no increase of false negatives)?  Applied mined patterns (“P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”) in three modes:  Existing mode: “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”  P 1, P 2,..., P i  Balanced mode: “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”  “P 1 or P 2 or... or P i ”  Imbalanced mode: “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ”  “P 1 or P 2 or... or P i or A ^ 1 or A ^ 2 or... or A ^ j ” 19

20 RQ2: False Positives and False Negatives ApplicationExisting ModeBalanced Mode DefectsFalse Positives DefectsFalse Positives % of reduction False Negatives Java Util Java Transaction Java SQL BCEL HSqlDB Hibernate AVERAGE/ TOTAL  Existing Mode vs Balanced Mode  Balanced mode reduced false positives by 15.17% without any increase in false negatives 20

21 RQ2: False Positives and False Negatives ApplicationExisting ModeImbalanced Mode DefectsFalse Positives DefectsFalse Positives % of reduction False Negatives Java Util Java Transaction Java SQL BCEL HSqlDB Hibernate AVERAGE/ TOTAL  Existing Mode vs Imbalanced Mode  Imbalanced mode reduced false positives by 28% with quite small increase in false negatives 21

22 Conclusion  Problem-driven methodology for advancing mining software engineering data by identifying  new problems, patterns  mining algorithms, defects  Alattin mines alternative patterns classified into three categories: balanced, imbalanced, and single  Alattin can be used to enhance various existing mining approaches to reduce false positives  Future work: Exploit synergy between static and dynamic analysis to further reduce false positives

23 Thank You