A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software.

Slides:



Advertisements
Similar presentations
September 2000Department of Statistics Kansas State University 1 Statistics and Design of Experiments: Role in Research George A. Milliken, PhD Department.
Advertisements

Using Probing Questions in Mathematics lessons Year 8 Level 5 Assessment Criteria What do you think?
Pennsylvania Value-Added Assessment System Overview: PVAAS
Software Quality Assurance Inspection by Ross Simmerman Software developers follow a method of software quality assurance and try to eliminate bugs prior.
COFFEE: an objective function for multiple sequence alignments
Copyright © 2003 by The McGraw-Hill Companies, Inc. All rights reserved. Business and Administrative Communication SIXTH EDITION.
Software Security Lecture 12 Fang Yu Dept. of MIS, National Chengchi University Spring 2011.
Broadening Participation in the Assessment of Student Communication Skills Joan Hawthorne University of North Dakota.
1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.
TEAM-Math and AMSTI Professional Mathematics Learning Communities Building Classroom Discourse.
Static Code Analysis and Governance Effectively Using Source Code Scanners.
Quantitative Research
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.
TEAM MORALE Team Assignment 12 SOFTWARE MEASUREMENT & ANALYSIS K15T2-Team 21.
Find opposites of numbers EXAMPLE 4 a. If a = – 2.5, then – a = – ( – 2.5 ) = 2.5. b. If a =, then – a = –
Evaluating Detection & Treatment Effectiveness of Commercial Anti-Malware Programs Jose Andre Morales, Ravi Sandhu, Shouhuai Xu Institute for Cyber Security.
Security Metrics in Practice Development of a Security Metric System to Rate Enterprise Software Brian Chess Fredrick.
Inference for Distributions
BTS730 Communications Management Chapter 10, Information Technology Management, 5ed.
Multiple Choice vs. Performance Based Tests in High School Physics Classes Katie Wojtas.
Cmpe 589 Spring Software Quality Metrics Product  product attributes –Size, complexity, design features, performance, quality level Process  Used.
CSCE 548 Code Review. CSCE Farkas2 Reading This lecture: – McGraw: Chapter 4 – Recommended: Best Practices for Peer Code Review,
Software Estimation and Function Point Analysis Presented by Craig Myers MBA 731 November 12, 2007.
“Knowledge comes from asking the right questions.”
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Viking Survey Results Report Team Assignment 11 Team 2-1.
The Analysis of the quality of learning achievement of the students enrolled in Introduction to Programming with Visual Basic 2010 Present By Thitima Chuangchai.
Animated banners - H1 Sigurbjörn Óskarsson. Research design Repeated measures N=32 (- 1 outlier) Task testing Control category + 3 levels of experimental.
Sort the graphs. Match the type of graph to it’s name.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Statistical Inference Statistical Inference involves estimating a population parameter (mean) from a sample that is taken from the population. Inference.
HIPAA Security A Quantitative and Qualitative Risk Assessment Rosemary B. Abell Director, National Healthcare Vertical Keane, Inc. HIPAA Summit VII September.
Login session using mouse biometrics A static authentication proposal using mouse biometrics Christopher Johnsrud Fullu 2008.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
INFORMATION X INFO425: Systems Design Systems Design Project Deliverable 1.
Scientific Method Science is ultimately based on observation –Sight and hearing Observations will lead to question—which lead to experiments to answer.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 Towards Improved Security Criteria for Certification of Electronic Health Record Systems Andrew Austin Ben Smith Laurie Williams North Carolina State.
Information Retrieval Quality of a Search Engine.
Saving your science fair project. Must save on your flashdrive to work on at home If you are using google docs you can use the google drive to save You.
Angela McCarthy CP5080, SP  Received: 14 August 2008  Revised: 13 November 2008  Written by Sherif Sakr of University of New South Wales, Australia.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.
ESSoS: February Leuven, Belgium1 Measuring the Effect of Code Complexity on Static Analysis Results James Walden, Adam Messer, Alex Kuhl Northern.
Presented by Yu-Shun Wang Advisor: Frank, Yeong-Sung Lin Near Optimal Defense Strategies to Minimize Attackers’ Success Probabilities for networks of Honeypots.
Risk-Aware Mitigation for MANET Routing Attacks Submitted by Sk. Khajavali.
Scientific Inquiry. The Scientific Process Scientific Process = Scientific Inquiry.
 Rubrics for Literacy Learning.  What is your current view of rubrics? What do you know about them and what experiences have you had using them ? Self.
Static Analysis Introduction Emerson Murphy-Hill.
ESSENTIAL QUESTION How can my PSAT score help me prepare for my future?
The People Of Utah A WebQuest for UEN Created by Kim Colton December, 2006.
Tool Support for Testing Classify different types of test tools according to their purpose Explain the benefits of using test tools.
Smashing WebGoat for Fun and Research: Static Code Scanner Evaluation Josh Windsor & Dr. Josh Pauli.
Key Updates. What has changed? National Curriculum Early Years baseline assessment SATS Teacher Assessments Assessment without levels, expected standards.
Information Systems Development
Agile Metrics that Matter
EEL5881 Software Engineering
Do Developers Focus on Severe Code Smells?
©2005 by the McGraw-Hill Companies, Inc. All rights reserved.
Information Systems Development
The Scientific Method C1L1CP1 How do scientists work?
Steps of answering a scientific question
Predicting Fault-Prone Modules Based on Metrics Transitions
Getting Started BPS 7e Chapter 0 © 2014 W. H. Freeman and Company.
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Software metrics.
A Metric for Evaluating Static Analysis Tools
MODULE – 1 The Number System
Psych 231: Research Methods in Psychology
Presentation transcript:

A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software

1 Four Perspectives on the Problem General £ How good are software security tools today? Tools vendor £ Is my static analysis product getting better over time? (What is better?) £ How much has it improved since the last release? £ What should I focus on to improve my tool in the future? £ If I make my tool detect a new kind of security bug, will an auditor or a developer thank me? Or both? Tools user: auditor £ Is the tool finding all the important types of security bugs? Tools user: developer £ Is the tool producing a lot of noise? Auditors and developers have different criteria for security tools, so we need a way to answer posed questions on two scales – “Auditor” and “Developer”

2 Proposed Solution Define metrics that model tool characteristics and conjecture a formula for calculating the score for each tool version £ Counts of true positives ( t ), false positives ( p ), and false negatives ( n ) £ 100 * t / (t + p + n) augmented by the weights and penalties (score out of 100) Define weights & penalties for reported results £ Results with different reported severities should be weighed differently High ( h ), Medium ( m ), and Low ( l ) £ False negatives penalties per bug category should differ depending on whether the tool claims to detect this kind of bug or not Define weights & penalties for “Auditor” and “Developer” scale £ Auditors tolerate false positives while developers tolerate false negatives – make false positive and false negative weights different to reflect this £ Importance & value of a vulnerability category ( v c ) for auditors & developers should affect the weights of the results Conduct an experiment and collect the necessary data to prove or disprove the conjecture

3 Experiment Analyzed three different projects: wuftpd (C), webgoat (Java), and securibench (Java) Ran four versions of Fortify tool Did a full audit of reported results for all product / version combinations (time consuming) TP (t)FP (p)FN (n) Important Not important 0.52 FP (p)FN (n) Auditor 0.52 Developer 20.5 High (h)Medium (m)Low (l) 421 Claims to detectDoesn’t detect 10.5 Table 1. Penalties with respect to category importance Defined weights based on our experiences with auditors and developers £ Table 1 presents chosen weights & penalties for true positives ( t ), false positives ( p ), and false negatives ( n ) based on high-value (high v c ) and low-value (low v c ) categories £ Table 2 presents false negatives penalty per bug category based on whether the tool claims to detect the category or not £ Table 3 presents High ( h ), Medium ( m ), and Low ( l ) severity weights £ Table 4 presents false positives ( p ) and false negatives ( n ) penalties for “Auditor” and “Developer” scales Table 4. “Auditor” vs. “Developer” scales penalties Table 2. False negatives penalty based on whether the tool claims to detect the category or not Table 3. Severity weights

4 Experimental Results & Analysis Collected data seems to indicate that we are headed in the right direction Both scores for wuftpd get higher until version 3.1 £ The number of false positives decreases, but in version 3.1 it increases wuftpd “Developer” score is lower than “Auditor” score for all four versions £ “Developer” false positives penalty is higher -- tool is tuned better for Java than for C £ After all, Fortify is a security company webgoat “Developer” score drops between versions 3.1 and 3.5 £ With the addition of multiple auditor-oriented categories Both scores are best for latest release examined (whew) Version 2.1Version 3.0Version 3.1Version 3.5 Auditor score Developer score Auditor score Developer score Auditor score Developer score webgoat securibench wuftpd (complete set of data for one experiment is available as a handout)

5 Conclusions & Future Work Proposed approach is useful for our purposes – measuring improvements of Fortify static analyzer £ It is unclear whether the same approach would be useful for comparing two different tools Determining an “answer key” to grade the results of the tool with is still a hard problem On our to-do list: £ Do more audits of various projects to collect more data to adjust the weights and penalties Include projects written for other languages the tool supports £ Experiment with additional weights and penalties Introduce penalty for incorrectly reporting severity of results £ Define a good visual representation of the collected data Make it intuitive to determine the area that needs improvement