Software Analytics: Towards Software Mining that Matters Tao Xie University of Illinois at Urbana-Champaign

Slides:



Advertisements
Similar presentations
Runtime Techniques for Efficient and Reliable Program Execution Harry Xu CS 295 Winter 2012.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
HOW DO PROFESSIONAL DEVELOPERS COMPREHEND TO SOFTWARE Report submitted by Tobias Roehm, Rebecca Tiarks, Rainer Koschke, Walid Maalej.
Predictor of Customer Perceived Software Quality By Haroon Malik.
CS 3500 SE - 1 Software Engineering: It’s Much More Than Programming! Sources: “Software Engineering: A Practitioner’s Approach - Fourth Edition” Pressman,
Min Zhang School of Computer Science University of Hertfordshire
Reverse Engineering © SERG Code Cloning: Detection, Classification, and Refactoring.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
1 Predicting Bugs From History Software Evolution Chapter 4: Predicting Bugs from History T. Zimmermann, N. Nagappan, A Zeller.
Symptomatic Analysis for Software Maintenance A technology designed for SERC.
1 Software Maintenance and Evolution CSSE 575: Session 8, Part 3 Predicting Bugs Steve Chenoweth Office Phone: (812) Cell: (937)
CHAPTER 6 SECONDARY DATA SOURCES. Important Topics of This Chapter Success of secondary data. To understand how to create an internal database. To distinguish.
Usability 2004 J T Burns1 Usability & Usability Engineering.
CS350/550 Software Engineering Lecture 1. Class Work The main part of the class is a practical software engineering project, in teams of 3-5 people There.
1 Predictors of customer perceived software quality Paul Luo Li (ISRI – CMU) Audris Mockus (Avaya Research) Ping Zhang (Avaya Research)
Chapter 2: Business Intelligence Capabilities
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Unit Testing & Defensive Programming. F-22 Raptor Fighter.
Jieming Zhu 1, Pinjia He 1, Qiang Fu 2, Hongyu Zhang 3, Michael R. Lyu 1, Dongmei Zhang 3 1 The Chinese University of Hong Kong, Hong Kong 2 Microsoft,
Data Mining Chun-Hung Chou
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Dependency Tracking in software systems Presented by: Ashgan Fararooy.
Software Engineering CS3003
Evaluation of software engineering. Software engineering research : Research in SE aims to achieve two main goals: 1) To increase the knowledge about.
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
Improving Programmer Productivity via Mining Program Source Code Tao Xie Department of Computer Science North Carolina State University
1 PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Chapter 1 Introduction to Data Mining
Bug Localization with Machine Learning Techniques Wujie Zheng
Mining Software Data: Code Tao Xie University of Illinois at Urbana-Champaign
Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu.
Yazd University, Electrical and Computer Engineering Department Course Title: Advanced Software Engineering By: Mohammad Ali Zare Chahooki 1 Machine Learning.
1 CS430: Information Discovery Lecture 18 Usability 3.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.
Alattin: Mining Alternative Patterns for Detecting Neglected Conditions Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Exploiting Code Search Engines to Improve Programmer Productivity and Quality Suresh Thummalapenta Advisor: Dr. Tao Xie Department of Computer Science.
1-1 Software Development Objectives: Discuss the goals of software development Identify various aspects of software quality Examine two development life.
An Undergraduate Course on Software Bug Detection Tools and Techniques Eric Larson Seattle University March 3, 2006.
The Interplay Between Mathematics/Computation and Analytics Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
+ Moving Targets: Security and Rapid-Release in Firefox Presented by Carlos Bernal-Cárdenas.
Presented by Lu Xiao Drexel University Quantifying Architectural Debt.
Objective ICT : Internet of Services, Software & Virtualisation FLOSSEvo some preliminary ideas.
Azure Machine Learning Introduction to Azure ML. Setting Expectations This presentation is for you if…  you hear the buzzword “Machine Learning” and.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
UC Marco Vieira University of Coimbra
Introduction to Machine Learning, its potential usage in network area,
Steve Chenoweth Office Phone: (812) Cell: (937)
MIS2502: Data Analytics Advanced Analytics - Introduction
Kevin C. Chang University of Illinois, Urbana-Champaign
Introduction C.Eng 714 Spring 2010.
DEFECT PREDICTION : USING MACHINE LEARNING
AI emerging trend in QA Sanjeev Kumar Jha, Senior Consultant
Data Warehousing and Data Mining
Beyond Computational Thinking
Code search & recommendation engines
Advanced Compiler Design
MAPO: Mining and Recommending API Usage Patterns
Presentation transcript:

Software Analytics: Towards Software Mining that Matters Tao Xie University of Illinois at Urbana-Champaign

Should I test\review my? ©A. Hassan

Software analytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data- driven tasks around software and services. [MALETS’11 Zhang et al.]

Software Intelligence & Analytics for Software Development

use Data Exploration and Analysis  Mining Software Repositories (MSR) for Software Practitioners  Beyond Software Developers obtain Insightful and Actionable info  Need get real as well Analytic Techniques Producing Impact on Practice

Look through your software data ©A. Hassan

Mine through the data! An international effort to make software repositories actionable Promise Data Repository ©A. Hassan

Mining Software Repositories (MSR) Transforms static record- keeping repositories to active repositories Makes repository data actionable by uncovering hidden patterns and trends 11 Mailinglist BugzillaCrashes Field logsCVS/SVN ©A. Hassan

12 Field Logs Source Control CVS/SVN Bugzilla Mailing lists Crash Repos Historical Repositories Runtime Repos Code Repos Sourceforge GoogleCode ©A. Hassan

BugzillaCVS/SVNMailinglistCrashes MSR researchers analyze and cross-link repositories fixed bug discussions Buggy change & Fixing change Field crashes Estimate fix effort Mark duplicates Suggest experts and fix! New Bug Report ©A. Hassan

use Data Exploration and Analysis  Mining Software Repositories (MSR) for Software Practitioners  Beyond Software Developers obtain Insightful and Actionable info  Need get real as well Analytic Techniques Producing Impact on Practice

We continue to help practitioners (esp. developers) ©A. Hassan

Detection and Management of Code Clones ©A. Hassan

Support Logs Source Code ©A. Hassan

use Data Exploration and Analysis  Mining Software Repositories (MSR) for Software Practitioners  Beyond Software Developers obtain Insightful and Actionable info  Need get real as well Analytic Techniques Case Studies

Predicting Bugs Studies have shown that most complexity metrics correlate well with LOC! –Graves et al on commercial systems –Herraiz et al on open source systems Noteworthy findings: –Previous bugs are good predictors of future bugs –The more a file changes, the more likely it will have bugs in it –Recent changes affect more the bug potential of a file over older changes (weighted time damp models) –Number of developers is of little help in predicting bugs –Hard to generalize bug predictors across projects unless in similar domains [Nagappan, Ball et al. 2006] 23

Using Imports in Eclipse to Predict Bugs 24 import org.eclipse.jdt.internal.compiler.lookup.*; import org.eclipse.jdt.internal.compiler.*; import org.eclipse.jdt.internal.compiler.ast.*; import org.eclipse.jdt.internal.compiler.util.*;... import org.eclipse.pde.core.*; import org.eclipse.jface.wizard.*; import org.eclipse.ui.*; 14% of all files that import ui packages, had to be fixed later on. 71% of files that import compiler packages, had to be fixed later on. [Schröter et al. 06]

25 Percentage of bug-introducing changes for eclipse Don’t program on Fridays ;-) [Zimmermann et al. 05]

26 Failure is a 4-letter Word [PROMISE’11 Zeller et al.]

27 Actionable Alone is not Enough! [PROMISE’11 Zeller et al.]

Who produces more buggy code? ©A. Hassan

use Data Exploration and Analysis  Mining Software Repositories (MSR) for Software Practitioners  Beyond Software Developers obtain Insightful and Actionable info  Need get real as well Analytic Techniques Producing Impact on Practice

Analytic Techniques in SE Association rules and frequent patterns Classification Clustering Text mining/Natural language processing Visualization More details are at 30

31 Basic mining algorithms Solution-Driven  Problem-Driven Advanced mining algorithms New/adapted mining algorithms Where can I apply X miner?What patterns do we really need? E.g., frequent partial order mining [ESEC/FSE 07] E.g., association rule, frequent itemset mining… E.g., [ICSE 09], [ASE 09]

32 Code repositories 1 2 N … 12 mining patterns searchingmining patterns Code search engine e.g., Open source code on the web Eclipse, Linux, … Traditional approaches Our new approaches Often lack sufficient relevant data points (Eg. API call sites)‏ Code repositories Mining  Searching + Mining

 Existing approaches produce high % of false positives  One major observation:  Programmers often write code in different ways for achieving the same task  Some ways are more frequent than others Frequent ways Infrequent ways Mined Patterns mine patterns detect violations S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

34 Example: java.util.Iterator.next() PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample 2 Java.util.Iterator.next() throws NoSuchElementException when invoked on a list without any elements S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

35 Example: java.util.Iterator.next() PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample code examples Sample 1 (1218 / 1243) Sample 2 (6/1243) Mined Pattern from existing approaches: “boolean check on return of Iterator.hasNext before Iterator.next” S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

Example: java.util.Iterator.next()  Require more general patterns (alternative patterns): P 1 or P 2 P 1 : boolean check on return of Iterator.hasNext before Iterator.next P 2 : boolean check on return of ArrayList.size before Iterator.next  Cannot be mined by existing approaches, since alternative P 2 is infrequent PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } PrintEntries1(ArrayList entries) { … Iterator it = entries.iterator(); if(it.hasNext()) { string last = (string) it.next(); } … } Code Sample 1 PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } PrintEntries2(ArrayList entries) { … if(entries.size() > 0) { Iterator it = entries.iterator(); string last = (string) it.next(); } … } Code Sample 2 S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

37 Our Solution: ImMiner Algorithm  Mines alternative patterns of the form P 1 or P 2  Based on the observation that infrequent alternatives such as P 2 are frequent among code examples that do not support P code examples Sample 1 (1218 / 1243) Sample 2 (6/1243) P 2 is frequent among code examples not supporting P 1 P 2 is infrequent among entire 1243 code examples [ASE 09] S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

38 Alternative Patterns  ImMiner mines three kinds of alternative patterns of the general form “P 1 or P 2 ” Balanced: all alternatives (both P 1 and P 2 ) are frequent Imbalanced: some alternatives (P 1 ) are frequent and others are infrequent (P 2 ). Represented as “P 1 or P ^ 2 ” Single: only one alternative S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

ImMiner Algorithm  Uses frequent-itemset mining [Burdick et al. ICDE 01] iteratively  An input database with the following APIs for Iterator.next() Input databaseMapping of IDs to APIs S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

ImMiner Algorithm: Frequent Alternatives Input database Frequent itemset mining (min_sup 0.5) Frequent item: 1 P 1 : boolean-check on the return of Iterator.hasNext() before Iterator.next() S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

41 ImMiner: Infrequent Alternatives of P 1 Positive database (PSD) Negative database (NSD)  Split input database into two databases: Positive and Negative  Mine patterns that are frequent in NSD and are infrequent in PSD  Reason: Only such patterns serve as alternatives for P 1  Alternative Pattern : P 2 “const check on the return of ArrayList.size() before Iterator.next()”  Alattin applies ImMiner algorithm to detect neglected conditions S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

Neglected Conditions  Neglected conditions refer to  Missing conditions that check the arguments or receiver of the API call before the API call  Missing conditions that check the return or receiver of the API call after the API call  One primary reason for many fatal issues  security or buffer-overflow vulnerabilities [Chang et al. ISSTA 07] S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.

use Data Exploration and Analysis  Mining Software Repositories (MSR) for Software Practitioners  Beyond Software Developers obtain Insightful and Actionable info  Need get real as well Analytic Techniques Producing Impact on Practice

Machine Learning that Matters [ICML’12 Wagsta ff ]

Hyper-Focus on Benchmark Data Sets Hyper-Focus on Abstract Metrics Lack of Follow-Through [ICML’12 Wagsta ff ]

Meaningful Evaluation Methods Involvement of the World Outside ML Eyes on the Prize [ICML’12 Wagsta ff ]

MSRA Software Analytics Group Utilize data-driven approach to help create highly performing, user friendly, and efficiently developed and operated software and services. Software Development Process Software Systems Software Users Information Visualization Analysis Algorithms Large-scale Computing Research TopicsTechnology Pillars Vertical Horizontal Contact: Dongmei Zhang

Software Analytics in Practice

Adoption Challenges for Software Analytics Must show value before data quality improves Correlation vs. Causation

ICSE Papers: Industry vs. Academia Source© Carlo Ghezzi OSDI % vs. xSE ?% Developers, Programmers, Architects Among All Attendees ICSM 11 KeynoteICSE 09 Keynote MSR 12 KeynoteMSR 11 Keynote SCAM 12 Keynote

"Are Automated Debugging [Research] Techniques Actually Helping Programmers?" 50 years of automated debugging research –N papers  only 5 evaluated with actual programmers “ ” [ISSTA11 Parnin&Orso]

Are Regression Testing [Research] Techniques Actually Helping Industry? Likely most studied testing problems –N papers “ ” [STVR11 Yoo&Harman]

Are [Some] Failure-Proneness Prediction [Research] Techniques Actually Helping? Empirical software engineering (on prediction) –N papers [PROMISE11 Zeller et al.] “ ”

A Researcher's Observation in HCI Research Community “The reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks. This is in contrast with how easy it is to build new interaction techniques and then to run tight, controlled studies on these new techniques with small, artificial tasks” “I give up on CHI/UIST” by James Landay Source©J. Landay

“This attitude is a joke and it offers researchers no incentive to do systems work. Why should they? Why should we put 3-4 person years into every CHI publication? Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paper.” A Researcher's Observation in HCI Research Community “I give up on CHI/UIST” by James Landay Source©J. Landay

A Researcher's Observation in HCI Research Community “When will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industry?” “We are our own worst enemies. I think we have been blinded by the perception that "true scientific" research is only found in controlled experiments and nice statistics.” Does our research community have similar issues?? “I give up on CHI/UIST” by James Landay Source©J. Landay

MS Academic Search: “Pointer Analysis”

“Pointer Analysis: Haven’t We Solved This Problem Yet?” [Hind PASTE’01] 58 “During the past 21 years, over 75 papers and 9 Ph.D. theses have been published on pointer analysis. Given the tones of work on this topic one may wonder, “Haven't we solved this problem yet?'' With input from many researchers in the field, this paper describes issues related to pointer analysis and remaining open problems.” Michael Hind. Pointer analysis: haven't we solved this problem yet?. In Proc. ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) Source©M. Hind

“Pointer Analysis: Haven’t We Solved This Problem Yet?” [Hind PASTE’01] 59 Section 4.3 Designing an Analysis for a Client’s Needs “ Barbara Ryder expands on this topic: “… We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract. However, this does not accomplish the key goal, which is to design and engineer pointer analyses that are useful for solving real software problems for realistic programs.” Michael Hind. Pointer analysis: haven't we solved this problem yet?. In Proc. ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) Source©M. Hind&B. Ryder

MS Academic Search: “Clone Detection” Typically focus/evaluate on intermediate steps (e.g., clone detection) instead of ultimate tasks (e.g., bug detection or refactoring), even when the field already grows mature with n years of efforts on intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks] 61 Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. CP-Miner: a tool for finding copy-paste and related bugs in operating system code. In Proc. OSDI MSRA XIAO Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun Qiu, and Tao Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. In Proc. ACSAC 2012,

Suggested Actions  Tech Adoption Get research problems from real practice Get feedback from real practice Collaborate across disciplines Collaborate with industry

Software Analytics  Data Exploration and Analysis  For Software Practitioners  Obtain Insightful and Actionable info  With Analytic Techniques Producing Impact on Practice

Acknowledgments Microsoft Research Asia Software Analytics Group Ahmed Hassan, Lin Tan, Jian Pei Many other colleagues 64

Q&A

Software Analytics  Data Exploration and Analysis  For Software Practitioners  Obtain Insightful and Actionable info  With Analytic Techniques Producing Impact on Practice