Interactive Code Review for Systematic Changes

Slides:



Advertisements
Similar presentations
HOW DO PROFESSIONAL DEVELOPERS COMPREHEND TO SOFTWARE Report submitted by Tobias Roehm, Rebecca Tiarks, Rainer Koschke, Walid Maalej.
Advertisements

Project Proposal.
COMP6703 : eScience Project III ArtServe on Rubens Emy Elyanee binti Mustapha Supervisor: Peter Stradzins Client: Professor Michael.
SE 450 Software Processes & Product Metrics 1 Defect Removal.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
1CMSC 345, Version 4/04 Verification and Validation Reference: Software Engineering, Ian Sommerville, 6th edition, Chapter 19.
Systematic Editing: Generating Program Transformations from an Example Na Meng Miryung Kim Kathryn S. McKinley The University of Texas at Austin.
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
Lase: Locating and Applying Systematic Edits by Learning from Examples Na Meng* Miryung Kim* Kathryn S. McKinley* + The University of Texas at Austin*
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
Hipikat: A Project Memory for Software Development The CISC 864 Analysis By Lionel Marks.
Cross Language Clone Analysis Team 2 October 27, 2010.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
O FFICE M ANAGEMENT T OOL - II B BA -V I TH. Abdus Salam2 Week-7 Introduction to Query Introduction to Query Querying from Multiple Tables Querying from.
Automatically detecting and describing high level actions within methods Presented by: Gayani Samaraweera.
Cross Language Clone Analysis Team 2 February 3, 2011.
Evaluation Methods - Summary. How to chose a method? Stage of study – formative, iterative, summative Pros & cons Metrics – depends on what you want to.
Cross Language Clone Analysis Team 2 February 3, 2011.
PROGRAMMING TESTING B MODULE 2: SOFTWARE SYSTEMS 22 NOVEMBER 2013.
“The Role of Experience in Software Testing Practice” A Review of the Article by Armin Beer and Rudolf Ramler By Jason Gero COMP 587 Prof. Lingard Spring.
Experience Report: System Log Analysis for Anomaly Detection
Recommendation in Scholarly Big Data
AP CSP: Cleaning Data & Creating Summary Tables
Working with Scholarly Articles
Why We Refactor? Confessions of GitHub Contributors
John D. McGregor Session 9 Testing Vocabulary
Testing and Debugging PPT By :Dr. R. Mall.
Maths Information Evening
SOFTWARE TESTING OVERVIEW
Research Skills Workshop
Brian Leonard ブライアン レオナルド
David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and K
Testing and Debugging.
Towards Trustworthy Program Repair
Data Analysis of EnchantedLearning.com vs. Invent.org
Software Documentation
CBCD: Cloned Buggy Code Detector
Verification and Validation
Development History Granularity Transformations
John D. McGregor Session 9 Testing Vocabulary
Critics: An Interactive Code Review Tool for
Un</br>able’s MySecretSecrets
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
Eclipse 20-Sep-18.
Verification and Validation
Microsoft Word Reviewing Documents.
Authors: Khaled Abdelsalam Mohamed Amr Kamel
John D. McGregor Session 9 Testing Vocabulary
Office of Education Improvement and Innovation
Accurate and Efficient Refactoring Detection in Commit History
Lecture 12: Data Wrangling
Design and Programming
: Clone Refactoring Davood Mazinanian Nikolaos Tsantalis Raphael Stein
Programming Fundamentals (750113) Ch1. Problem Solving
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
Mike Timms and Cathleen Kennedy University of California, Berkeley
Programming Fundamentals (750113) Ch1. Problem Solving
Programming Fundamentals (750113) Ch1. Problem Solving
Java IDE Dwight Deugo Nesa Matic Portions of the notes for this lecture include excerpts from.
Applying Use Cases (Chapters 25,26)
Applying Use Cases (Chapters 25,26)
Dr Amina Rashad and Dr Nahed Kandeel
WALKTHROUGH and INSPECTION
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Automatically Diagnosing and Repairing Error Handling Bugs in C
Fine-grained and Accurate Source Code Differencing
Week 7: Computer Tools for Problem Solving and Critical Thinking
MAPO: Mining and Recommending API Usage Patterns
What Can It Do For You? Spira | #InflectraCon
Presentation transcript:

Interactive Code Review for Systematic Changes Tianyi Zhang,1 Myoungkyu Song,2 Joseph Pinedo,2 Miryung Kim1 1 University of California, Los Angeles 2 University of Texas at Austin Hi everyone, my name is Tianyi. Today I am going to present our recent work, interactive code review for systematic changes.

Code Review What is code review? inspect changes find mistakes overlooked by developers State-of-art Eclipse Compare, Gerrit, Phabricator, Code Flow line-level differences manual process So, what is code review? Code review is a process of inspecting program changes before developers check in their code into the repository. It is used to find oversight mistakes, and is adopted as a common practice in industry to improve software quality. Currently there are many popular code review frameworks, such as Eclipse Compare, Gerrit in Google (Click), Phabricator in Facebook (Click), and Code Flow in Microsoft (Click). These frameworks generally compute the line-level differences between the old and new program revision, and display them side by side. Then code reviewers need to manually go through each difference to check if there is any mistake.

Motivation Reviewers have a hard time to inspect systematic edits — similar changes scattered across the program int keyDownEvent (int w) { - ExpandItem item = items [index]; switch (w) { case OS.SPACE: Event event = new Event (); - event.item = item; - sendEvent(true, event); + event.item = focusItem; + sendEvent(event); + refreshItem(focusItem); (a) change example int keyReleaseEvent (int wParam) { - ExpandItem item = items [index]; switch (wParam) { case OS.SPACE: Event ev = new Event (); - ev.item = item; - sendEvent(true, ev); + ev.item = focusItem; + sendEvent(ev); + refreshItem(focusItem); (b) a similar but not identical change int ButtonUpEvent (int wParam) { - ExpandItem item = items [index]; if (wParam == HOVER){ Event bEvent = new Event (); - bEvent.item = item; - sendEvent(true, bEvent); + sendEvent(bEvent); + refreshItem(focusItem); (c) an inconsistent change However, reviewers may find it difficult to review systematic changes using current code review frameworks. By systematic changes, I mean similar changes that are scattered across the program. And you can see an example of systematic change here. Please notice that I said similar changes instead of identical or duplicated changes because sometimes they may involve minor differences at different locations. For example (Click), the same variable is named differently in these three methods, event, ev, and bEvent. The context of each change is also different in these three methods (Click). The first two are located in a switch statement while the last one is in an if statement.

Motivation Code reviewers cannot easily answer questions like Diff Patch Code reviewers cannot easily answer questions like Unchanged Location Potential Mistake Similar Change What other locations are changed similarly to this change? Are there inconsistencies among similar edits? Are there any other locations that are similar to this code but are not updated? Missing Update If a bunch of changes are scattered into multiple files, it’s usually hard to pinpoint oversight errors from many similar program changes. Specifically, we also need to find locations that are supposed to be changed but the developer forgot to change. Therefore, code reviewers cannot easily answer questions like …

Outline Related Work Interactive Code Review Approach Evaluation Phase I: Context-Aware Change Template Generation Phase II: Template Customization Phase III: Change Summarization and Anomaly Detection Evaluation Semi-Structured Interviews with Salesforce Engineers A User Study with 12 ECE students at UT Austin Conclusion Here is the outline for today’s presentation. I will first talk about related works. Then I will walk you through the three-phase code review approach, and discuss how we evaluated this approach. Finally, I will wrap up this talk with a short conclusion.

Related Work Code Clone Analysis Systematic Change Automation Modern Code Review and Change Comprehension Decompose large, composite changes into small ones [Rigby et al., Xie et al.] Our work is inspired by these findings. Code Clone Analysis Detect duplicated code and find cloning-related bugs [CCFinder, Deckard, CP-Miner, SecureSync] But they are not designed to investigate diff patches. Systematic Change Automation Replicate similar changes to multiple locations [LASE, Sydit] LASE use fixed template generation while our approach allows users to interactively customize the template LASE doesn’t conduct user study Our work is inspired by research findings by Peter Rigby, Tao Xie, and Mike Barnett. They studied code review practices and challenges, and suggest to decompose large, composite changes into small, relevant ones. People also designed code clone detectors to identify duplicated code, like CCFinder by Kamiya, Deckard by Jiang. Other tools like CP-Miner and SecureSync are used to find copy-paste-related bugs. However, these techniques are used to analyze code clones in a single program revision. People cannot use them to investigate program changes between two program revisions. Na Meng created Syndit and Lase to help developers replicate similar changes to multiple locations. Our tool differs from Na’s works in two ways. First, Lase learns multiple examples and generates a fixed change template. However, our work learns from one example and allows users to interactively customize the default template. Second, Lase doesn’t conduct user study to investigate the usability of their approach.

Critics: Interactive Code Review Approach for Systematic Changes Anomalous Changes Diff Patch new revision old revision Systematic Changes Examine Select peer reviewer Edit Abstract Diff Template AST edits Context Extraction Template Generation Critics is an interactive code review approach to help developers understand and review systematic changes. Let’s first look at the workflow in this approach. Given a diff patch, that is, the program changes between the old and new program revision, peer reviewers need to first select a program change of interest. (Double Click) Critics then extracts the context of the selected change and formalizes the textual changes as edits on the Abstract Syntax Tree. (Click) Critics further generates an abstract diff template and visualizes the template as a tree graph. (Click) Reviewers are allowed to edit and customize the template. (Click) Finally, Critics searches for similar changes and potential mistakes based on the customized template and reports them to reviewers. (Click) Reviewers can examine the search result. And as you can see, reviewer can iteratively refine the template based on previous search results.

Phase I: Context-Aware Change Template Generation method_decl int keyDownEvent (int w) { - ExpandItem item = items [index]; switch (w) { case OS.SPACE: Event ev = new Event (); - ev.item = item; - sendEvent(true, ev); + ev.item = focusItem; + sendEvent(ev); + refreshItem(focusItem); (a) selected change item_decl switch other_stmt … Case 1 Case 2 … ev_decl ev..focus. send.. refresh.. ev.item=item sendEvent(..) Now I will walk you through the approach step by step. Let’s first look at Phase I. In this phase, Critics generates a context-aware template from the user-selected change. It first parses textual program changes into structured program edits on abstract syntax tree. (Click) (Each deleted statement is displayed as a red node with dotted lines. And each inserted statement is displayed as a blue AST node with solid lines.) As you can see, this tree is actually consisted of the AST nodes appearing in the old revision (Click) and the AST nodes in the new revision (Click). (Click) Given such an abstract syntax tree, Critics further prunes the tree by filtering out the AST nodes unrelated to the selected change, like other switch cases in the program. (Click) At the same time, it identifies the context of the selected change by analyzing the control and data dependency between AST nodes. As you can see, the selected change depends on the declaration statement of the variable, ev, (Click), even though this statement is not changed at all. Besides, the selected change also depends on the switch statement (Click). Because the changed location is only executed in a certain switch condition. And now, we get the context-aware change template of the selected program change. (Click) (b) abstract change template

Phase II: Template Customization method_decl .. item = items[..] switch $exclude $exclude Case 1 Event $v1= new .. Event ev = new .. ev.item = focusItem $v1.item = focusItem sendEvent($v1) sendEvent(ev) refreshItem(focusItem) $v1.item=item ev.item=item sendEvent(true, ev) sendEvent(true, $v1) In phase II, Critics allows reviewers to customize the context-aware change template by parameterizing tokens and excluding AST nodes. For example, the reviewer may realize that the variable, ev is likely to be renamed in other locations. Then she can parameterize this variable, ev (Click) to allow it to match with any other variables. And Critics also propagates this parameterization to all other nodes, for the ease of use. (Click) As another example, the reviewer may wonder if there are other locations that are changed similarly but in a different context. She can exclude the context statements (Click), which is the switch statement in this example. At this point, we get a customized change template. (Click) (c) customized change template

Phase III: Change Summarization and Anomaly Detection Hypothesis: If two locations look similar in both the old and new revision, the program changes in these two locations must be similar. Critics searches for similar locations in the old revision and the new revision respectively. Matchedold Differentold Matchednew Correct similar change Similar change to different contexts Differentnew Missing similar change Irrelevant Matchedold Differentold Matchednew Correct similar change Similar change to different contexts Differentnew Missing similar change Irrelevant Generally speaking, the problem of summarizing similar changes and finding mistakes can be reduced to a search problem. So in this phase, Critics uses the customized template to search for program changes that match and violate the template. Intuitively, it’s hard to directly search for program changes in a diff path, especially considering that we also need to find locations that are supposed to be changed but is not changed by the developer. So we sidestepped this hard problem by making such a hypothesis. If two locations look similar to each other in both the old and the new program revision, the program changes in these two locations must be similar. (Pause here and ask if everybody is following) First, we need to search for locations that match the template in the old revision. Second, we need to search for locations that match the template in the new revision. If two locations look similar in the old revision but are different in the new revision, it might be caused by two mistakes. First, one of them is not changed at all. Second, one of them is changed incorrectly. If two locations look different in the old revision but look similar in the new revision, it means that the developer may mistakenly change the wrong location. If two locations are different in both the old and new revision, we consider them irrelevant. And this is the high-level idea of how we group similar changes and detect potential mistakes. In the next slide, I am going to talk more details about how we implement this algorithm.

Phase III: Change Summarization and Anomaly Detection Match Match Match Different change_templateold Similar Change Potential Mistake change_template As I mentioned before, a change template consists of AST nodes appearing in the old revision and AST nodes in the new revision. So the first step is to split the template into two templates. One template only contains AST nodes in the old revision and the other only contains AST nodes in the new revision. (Click) Let’s just call them “old template” and “new template” for simplicity here. Then we are going to use the old template and the new template to respectively match with locations in the old and the new program revision. (Double Click) The first two changes are grouped as similar changes because their locations are matched with both the old and new template. However, the other two changes are reported as possible mistakes. Match Match Different Match change_templatenew

Phase III: Change Summarization and Anomaly Detection Original RTED algorithm1 only computes node-level alignment between two trees. Critics extended RTED in two ways. $exclude Switch If match a parameterized token with any concrete token. match an excluded node with any node. $v1.item = focusItem ev.item = focusItem sendEvent(ev) sendEvent($v1) event.item = focusItem sendEvent(event) 1. Pawlik, Mateusz, and Nikolaus Augsten. "RTED: a robust algorithm for the tree edit distance." Proceedings of the VLDB Endowment 5.4 (2011): 334-345. Regarding the tree matching algorithm, Critics customizes the Robust Tree Edit Distance algorithm to allow flexible matching of parameterized identifiers and excluded statements. The original RTED algorithm is limited in this problem because it only computes node-level alignment between two trees. Critics extended the RTED algorithm in two way. If a variable is parameterized, it allows the variable to be matched with any other variables. Similarly, if a node is excluded, it allows the node to be matched with any other nodes.

Critics Plug-in Eclipse plug-in are available at https://sites.google.com/a/utexas.edu/critics/. (Zhang et al. FSE 14’ Demo)

Research Questions RQ1: How usable is Critics in practice? RQ2: How accurately does a reviewer locate similar edits and mistakes with Critics? RQ3: How much time can a reviewer save by using Critics? We evaluated Critics in three perspectives, usability, accuracy, time efficiency.

Semi-Structured Interview at Salesforce We interviewed six software engineers at Salesforce to study the usability of Critics in industry. Subject Role Gender Age Java Experience Code Review Frequency 1 Developer Male 21-30 4 Weekly 2 QE Female 3 Manager 41-50 Seldom 31-40 5 10 6 14 Daily In order to study the usability of Critics in practice, we conducted a semi-structured interview with six software engineers at Salesforce. Two of them are software developers, three of them are quality engineers (software testers), and one is project manager. Five of them said they did code review at least once a week. Even though the project manager rarely did code review, we still interviewed him because we believe he could provide valuable feedback from a manager’s perspective.

Semi-Structured Interview at Salesforce 20-minute presentation about Critics Explore Critics1 with one of four diff pathes authored by their own team. No. Patch Description Changed LOC Num of Changed Files 1 Refactor test cases by moving bean maps to respective utils classes 743 22 2 Refactoring the API to get versioned field values 943 34 3 Refactor test cases to use try-with-resources statements 484 10 4 Update common search tests by getting versioned test data 2224 12 Before the interview, we first presented our approach and taught them how to use the Critics. Then we asked them to do code review with one of the four diff patches authored by their own team. 1. Critics is implemented as an Eclipse plug-in, http://sites.google.com/utexas/edu/critics/

How could Critics help them? “... REST APIs across different versions generally share similar code snippets ... It's hard and time-consuming to find mistakes on similar changes on those locations...” “The feature in your tool can free us from piling code review tasks on our senior developers...” One interviewee reported that because their team is developing versioned APIs. They actually reused a lot of code between different versions. So if there is a bug in an old API, they have to fix not only this specific version but also all following versions. Another interviewee said it’s hard to find missing updates, unless the reviewer is super familiar with the codebase. And currently they don’t have a specific scenario to solve this problem and only depends on regression testing. However, our tool can help them relieve the code review burden on senior developers because it allows users to interactively explore the diff path and thus doesn’t require much knowledge about the code base.

How do they like or dislike Critics? “Currently COLLABORATOR only highlights the changed location in a very naive way. A feature like extracting and visualizing the change context can help us better understand the change itself as well as find some underlying change patterns between related changes.” “It will be helpful if Critics can provide some hints about template customization.” They liked Critics because Critics can help them better understand program changes. However, two participants suggested us to provide some hints about template customization.

User Study at UT Austin We recruited 12 UT students 4 of them are ECE undergrads, the others are graduate students at Software Engineering All of them have at least one year experience of Eclipse IDE All but one have code review experience using diff tools such as Eclipse Compare and SVN/Git diff. We gave them a 20-minute tutorial to teach Critics plug-in In order to study the accuracy and time efficiency of Critics, we did another user study at UT Austin and quantitatively compare our tool with another code review plugin in Eclipse. We recruited 12 UT students. 4 of them are ECE undergrads and the others are graduate students at Software Engineering. All of them have at least one year experience of Eclipse IDE. All but one student have code review experience using diff tools such as Eclipse Compare, svn diff, and git diff. Before the user study, we gave them a 20-minute tutorial to teach them how to use the Critics plug-in in Eclipse. After they got familiar with the plug-in, they were required to review two different patches, one with Critics plug-in and the other with Eclipse Compare.

Code Review Patches Version Change Description Similar Change Inconsistent Change Missing Update Size(LOC) Patch 1 JDT 9800 vs JDT 9801 Initiate a variable in a for loop instead of using a hashmap getTrailingComments(ASTNode) getLeadingComments(ASTNode) getExtendedEnd(ASTNode) getExtendedStartPosition(ASTNode) getComments(ASTNode) getCommentsRange(ASTNode) 190 Patch 2 JDT 16010 Vs JDT 10611 extract the logic of unicode traitement to a method getNextChar() getNextCharasDigit() getNextToken() … 9 locations in total getNextCharAsJavaIdentifierPart(ASTNode) jumpOverMethodBody() 11 locations in total 680 These two patches are drawn from the version history of an open-source project, Eclipse Java Development Tools (JDT). Patch 1 contains 190 lines of changed code and has 3 similar changes and 2 missing updates. Patch 2 contains 680 lines of changed code and has 9 similar changes and 11 missing updates. We manually seeded one inconsistent change in each patch.

User Study Tasks Each participant carried out code review tasks on two different patches, one with Critics and the other with Eclipse Compare Q1: Given the change in the method getTrailingComments, what other methods containing similar changes can you find? Count the number. Q2:Which of the following methods contains inconsistent changes compared with the change in getTrailingMethods? Q3: How many methods share context similar to the change in getTrailingMethods but have missed updates? We measured task completion time and accuracy. During the code review, each of them need to answer three questions. First, they need to find methods that contain similar changes and count the number. Second, they need to point out the inconsistent change. Finally, they need to find the locations that share similar context but missed the similar update. We recorded how long they complete each task and computed the accuracy of their answers, and use them as proxy measures of the code review productivity.

Subjects Critics Eclipse Compare Q1 Q2 Q3 Time 1 √ 13:30 N/A 26:37 2 13:18 ✗ 47:21 3 18:29 24:54 4 15:02 25:07 5 19:00 25:46 6 23:00 21:11 7 11:48 17:06 8 20:00 18:14 9 29:00 15:00 10 16:11 37:57 11 14:27 25:45 12 35:17 22:46 Average 83% 100% 92% 19:26 42% 58% 33% 25:39 Evaluation results show human subjects can answer questions about systematic changes 47.3% more correctly with 31.9% saving in time using Critics This table compares the correctness of their answers and the task completion time using Critics with using Eclipse Compare. (Click) It shows that human subjects can answer questions about systematic changes 47.3% more correctly with 31.9% time saving using Critics, in the comparison with using Eclipse Compare.

Comparison with LASE LASE automates systematic editing by searching for locations and applying edits to individual locations. It’s challenging to directly compare LASE and Critics. fixed vs. interactive template generation simulate observed customization patterns, e.g., prefer generalizing long names then short names, and then exclude contexts. Compare the locations found by the two techniques Our prior work LASE automates systematic editing by searching for locations and applying custom edits to individual locations.

Subjects and Metrics Six patches drawn from Eclipse JDT and SWT [Meng et al.] Patch size ranges from 190 to 680 lines of changed code Consisted of three to ten systematic edits Metrics precision recall F1 score

Comparison with LASE In five out of six cases, Critics achieves the same or higher accuracy than LASE within a few iterations. Critics LASE Precision Recall Iterations Time(sec) Patch 1 1 4 1.66 Patch 2 0.9 6 8.95 0.92 0.75 Patch 3 13.52 Patch 4 7 71.98 0.33 Patch 5 6.86 Patch 6 3 1.47 Average 0.87 17.41 0.99 0.84

Conclusion We present Critics, a novel approach for searching systematic changes and detecting potential mistakes during code reviews. Salesforces interviews show that Critics scales to an industry-scale project and can be easily adopted by professional developers User study results show human subjects using Critics can answer questions about systematic changes more correctly within less time, in the comparison of the baseline use of Eclipse Compare. Now I will wrap up this talk with a short conclusion.

Q&A

Accuracy variation in Critics’s Simulation