A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman.

Slides:



Advertisements
Similar presentations
Code Correctness, Readability, Maintainability Svetlin Nakov Telerik Corporation
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Spiros Papageorgiou University of Michigan
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.
The “Lifecycle” of Software. Chapter 5. Alternatives to the Waterfall Model The “Waterfall” model can mislead: boundaries between phases are not always.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Chapter 7: User-Defined Functions II Instructor: Mohammad Mojaddam.
Engineering Secure Software. The Power of Source Code  White box testing Testers have intimate knowledge of the specifications, design, Often done by.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
1 Software Architecture CSSE 477: Week 5, Day 1 Statistical Modeling to Achieve Maintainability Steve Chenoweth Office Phone: (812) Cell: (937)
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Using Natural Language Program Analysis to Locate and understand Action-Oriented Concerns David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and.
Software Quality Metrics
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
INFO 624 Week 3 Retrieval System Evaluation
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
Experimental Evaluation
Personality, 9e Jerry M. Burger
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Introduction to High-Level Language Programming
Chapter 2 The Research Enterprise in Psychology. n Basic assumption: events are governed by some lawful order  Goals: Measurement and description Understanding.
Identifying Reasons for Software Changes Using Historic Databases The CISC 864 Analysis By Lionel Marks.
CS4723 Software Validation and Quality Assurance
An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.
A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Assessing the SoP of MBE in the Embedded Systems Domain Xubo Miao MSc, School of Computing Supervisor: James R. Cordy.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
SWEN 5430 Software Metrics Slide 1 Quality Management u Managing the quality of the software process and products using Software Metrics.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
User Study Evaluation Human-Computer Interaction.
Bug Localization with Machine Learning Techniques Wujie Zheng
By: TARUN MEHROTRA 12MCMB11.  More time is spent maintaining existing software than in developing new code.  Resources in M=3*(Resources in D)  Metrics.
This chapter is extracted from Sommerville’s slides. Text book chapter
Experimental Evaluation of Learning Algorithms Part 1.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Learning from Observations Chapter 18 Through
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
UML Use Case Diagramming Guidelines. What is UML? The Unified Modeling Language (UML) is a standard language for specifying, visualizing, constructing,
Formal Methods in Software Engineering
Software Testing and Maintenance 1 Code Review  Introduction  How to Conduct Code Review  Practical Tips  Tool Support  Summary.
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
Copyright © 2015 NTT DATA Corporation Kazuo Kobori, NTT DATA Corporation Makoto Matsushita, Osaka University Katsuro Inoue, Osaka University SANER2015.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Assessment and Testing
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Software Engineering. Acknowledgement Charles Moen Sharon White Bun Yue.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Sixteen Questions About Software Reuse William B. Frakes and Christopher J. Fox Communications of the ACM.
A HUMAN STUDY OF FAULT LOCALIZATION ACCURACY Zachary P. Fry Westley Weimer University of Virginia September 16, 2010.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Steve Chenoweth Office Phone: (812) Cell: (937)
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Verification and Validation
Evaluation of IR Systems
Tools of Software Development
PROGRAMMING METHODOLOGY
Predict Failures with Developer Networks and Social Network Analysis
Summative Assessment Grade 6 April 2018 Develop Revise Pilot Analyze
Software metrics.
White Box testing & Inspections
Presentation transcript:

A Metric for Software Readability by Raymond P.L. Buse and Westley R. Weimer Presenters: John and Suman

Readability The human judgment of how easy a text is to understand A local, line-by-line feature Not related to the size of a program Not related to essential complexity of software

Readability and Maintenance Reading code is the most time-consuming of all maintenance activities [J. Lionel E. Deimel 1985, D. R. Raymond 1991, S. Rugaber 2000] 70% of software cost for maintenance [B. Boehm and V. R. Basili 2001] Readability also correlates with software quality, code change and defect reporting

Problem Statement How do we create a software readability metric that: –Correlates strongly with human annotators –Correlates strongly with software quality Currently no automated readability measure for software

Contributions A software readability metric that: –correlates strongly with human annotators –correlates strongly with software quality A survey of 120 human readability annotators A discussion of the features related to software readability

Readability Metrics for Natural Language Empirical and objective models of readability Flesch-Kincaid Grade Level has been used for over 50 years Based on simple features: –Word length –Sentence length Used by government agencies, MS Word, Google Docs

Experimental Approach 120 human annotators were shown 100 code snippets Resulting 12,000 readability judgments available online (ok, not really)‏

Snippet Selection - Goals Length –Short enough to aid feature discrimination –Long enough to capture important readability considerations Logical Coherence –Shouldn't span methods –Include comments adjacent comments Avoid trivial snippets (e.g a group of import statements)‏

Snippet Selection - Algorithm Snippet = 3 consecutive simple statements –Based on authors' experience –Simple statements are: declarations, assignments, function calls, breaks, continues, throws and returns Other nearby statements are included: –Comments, function headers, blank lines, if, else, try-catch, switch, etc. Snippets cannot cross scope boundaries

Readability Scoring Readability was rated from 1 to 5  1 - “less readable”  3 - “neutral”  5 - “more readable”

Inter-annotator agreement Good correlation needed for a coherent model Pearson product-moment correlation coefficient –Correlation of 1 indicates perfect correlation –Correlation of 0 indicates only random correlation Calculated for pair wise for all annotators Average correlation of 0.56 –Typically considered “moderate to strong”

Readability Model Objective: Mechanically predict human readability judgments Determine code features that are predictive of readability Usage: Use this model to analyze code (automate software readability metric)

Model Generation Classifier “Machine learning algorithms” Instances “Feature vector from a snippet” Experiment procedure - training phase - set of instances with labeled “correct answer” - classify based on the score from the bimodal distribution

Model Generation (contd …) Decide on a set of features that can be detected statically These factors relate to structure, density, logical complexity, documentation of the analyzed code Each feature is independent of the size/block of code

Model Generation (contd …) Build a classifier on a set of features Use 10-fold cross validation - random partitioning of data set into 10 subsets - train on 9 and test on 1 - repeat this process 10 times Mitigate any bias from partitioning by repeating the 10-fold validation 20 times Average the results across all of the runs

Results Two relevant success metrics – precision & recall Recall - % of snippets judged by annotators and classified by model as “more readable” Precision – fraction of snippets judged by annotators and classified by model as “more readable” “Performance is measured by weighing together the f-measure statistic and harmonic mean of the two metrics”

Results (contd …) “0.61” – f-measure of the classifier trained on randomly generated score labels “0.8” – f-measure of the classifier trained on average human data

Results (contd …) Repeated the experiment separately with annotator experience group ( and 400 level, graduate CS students

Interesting Facts from performance measure … Average line length and average number of identifiers per line are important to readability Average identifier length, loops, if constructs and comparison operators are not very predictive features

Readability Correlations (Experiment 1) Correlate defects detected by FindBugs* and readability metric Run FindBugs on benchmarks Classified the defects reports (one containing at least one defect and other containing none) Run the trained classifier Record the f-measure for “contains a bug” with respect to classifier judgment of “less readable” *FindBugs – a popular static bug finding tool

Readability Correlations (Experiment 2) Correlates future code churn to readability Uses readability to predict those functions that will be modified between 2 successive releases of a program Consider a function to have changed –Where text is not exactly the same –Changes in whitespaces

Readability Correlations - Results Average f-measure: –For Experiment 1 -> 0.61 and for Experiment 2 -> 0.63

Relating Metric to Software Life Cycle Readability tends to change over a long period of time

Relating Metric to Software Life Cycle (contd …) Correlate project readability against project maturity (as reported by developers) “Projects that reach maturity tend to be more readable”

Discussion Identifier Length –No influence! –Long names can improve readability, but can also reduce it –Comments might be more appropriate –Author's suggestions: Improved IDEs and code inspections Code Comments –Only moderately correlated –Being used to “make up for” ugly code? Characters/identifiers per line –Strongly correlated –Just as long sentences are more difficult to understand, so are long lines of code –Author's suggestion: keep lines short, even if it means breaking them up over several lines

Related Work Natural Language Metrics [R. F. Flesch 1948, R. Gunning 1952, J. P. Kincaid and E. A. Smith 1970, G. H. McLaughlin 1969] Coding Standards [S. Ambler 1997, B. B. Bederson et al. 2002, H. Sutter and A. Alexandrescu 2004] Style Checkers [T. Copeland 2005] Defect Prediction [T. J. Cheatham et al. 1995, T. L. Graves et al. 2000, T. M. Koshgoftaar et al. 1996, J. P. Kincaid and E. A. Smith 1970]

Future Work Examine personal preferences –Create personal models Models based on application domain Broader features –e.g. number of statements in an if block IDE integration Explore minimum set of predictive features

Conclusion Created a readability metric based on a specific set of human annotators This metric: –agrees with the annotators as much as they agree with each other –has significant correlation with conventional metrics of software quality Examining readability could improve language design and engineering practice