Harry Xu University of California, Irvine & Microsoft Research

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

R O O T S Field-Sensitive Points-to-Analysis Eda GÜNGÖR
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
SLA-Oriented Resource Provisioning for Cloud Computing
Demand-driven Alias Analysis Implementation Based on Open64 Xiaomi An
Bebop: A Symbolic Model Checker for Boolean Programs Thomas Ball Sriram K. Rajamani
Refinement-Based Context-Sensitive Points-To Analysis for JAVA Soonho Kong 17 January Work of Manu Sridharan and Rastislav Bodik, UC Berkeley.
Background for “KISS: Keep It Simple and Sequential” cs264 Ras Bodik spring 2005.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
Using Programmer-Written Compiler Extensions to Catch Security Holes Authors: Ken Ashcraft and Dawson Engler Presented by : Hong Chen CS590F 2/7/2007.
The Ant and The Grasshopper Fast and Accurate Pointer Analysis for Millions of Lines of Code Ben Hardekopf and Calvin Lin PLDI 2007 (Best Paper & Best.
Static Analysis of Embedded C Code John Regehr University of Utah Joint work with Nathan Cooprider.
1 Refinement-Based Context-Sensitive Points-To Analysis for Java Manu Sridharan, Rastislav Bodík UC Berkeley PLDI 2006.
Data Structures Introduction. What is data? (Latin) Plural of datum = something given.
Java Alias Analysis for Online Environments Manu Sridharan 2004 OSQ Retreat Joint work with Rastislav Bodik, Denis Gopan, Jong-Deok Choi.
Reps Horwitz and Sagiv 95 (RHS) Another approach to context-sensitive interprocedural analysis Express the problem as a graph reachability query Works.
Symbolic Path Simulation in Path-Sensitive Dataflow Analysis Hari Hampapuram Jason Yue Yang Manuvir Das Center for Software Excellence (CSE) Microsoft.
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University STATIC ANALYSES FOR JAVA IN THE PRESENCE OF DISTRIBUTED COMPONENTS AND.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
Cross Language Clone Analysis Team 2 October 13, 2010.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Pointer Analysis Survey. Rupesh Nasre. Aug 24, 2007.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Detecting Inefficiently-Used Containers to Avoid Bloat Guoqing Xu and Atanas Rountev Department of Computer Science and Engineering Ohio State University.
Data Structures and Algorithms in Parallel Computing
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Selenium server By, Kartikeya Rastogi Mayur Sapre Mosheca. R
Chapter 4 Static Analysis. Summary (1) Building a model of the program:  Lexical analysis  Parsing  Abstract syntax  Semantic Analysis  Tracking.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Presented by: Omar Alqahtani Fall 2016
Advanced Computer Systems
Organizations Are Embracing New Opportunities
Security and Programming Language Work on SmartPhones
Harry Xu and Zhiqiang Zuo University of California, Irvine
Introduction to Compiler Construction
Done By: Ashlee Lizarraga Ricky Usher Jacinto Roches Eli Gomez
CSC 321: Data Structures Fall 2016
Chapter 14: System Protection
Curator: Self-Managing Storage for Enterprise Clusters
Applying Control Theory to Stream Processing Systems
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Hadoop Clusters Tess Fulkerson.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Chapter 2: Operating-System Structures
Introduction to Spark.
On Spatial Joins in MapReduce
Human Complexity of Software
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Pregelix: Think Like a Vertex, Scale Like Spandex
Demand-Driven Context-Sensitive Alias Analysis for Java
Lecture Topics: 11/1 General Operating System Concepts Processes
Efficient Document Analytics on Compressed Data:
Chapter 10: File-System Interface
Intermediate Representations Hal Perkins Autumn 2005
Introduction to Virtual Machines
Introduction to Virtual Machines
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
MapReduce: Simplified Data Processing on Large Clusters
COMPANY PROFILE: REELWAY
From Use Cases to Implementation
Map Reduce, Types, Formats and Features
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

Harry Xu University of California, Irvine & Microsoft Research Graph Systems for Statically Analyzing Hundreds of Projects at the Same Time Harry Xu University of California, Irvine & Microsoft Research

Learning from Lots of Java Projects Code completion and program synthesis Learn likely specifications Bug finding and security problem detection …

Learning from Programs Needs Program Analysis State of the art Program representation: abstract syntax tree Algorithms: AST traversal + machine learning (e.g., n-gram) Can we do better than that? Would some semantic representation of a program be a better abstraction? Example: points-to graphs for parameters of an API Example: dataflow information of objects flowing across APIs

Statically Analyzing Many Projects Together: How? Pointer Analysis Dataflow Analysis Abstract Interpretation Constraint-based Analysis Symbolic Execution … A big fear of memory blow up

Treating Static Analysis as A Big Data Problem Develop systems support for analyzing very large code base Present a simple interface to the user Push performance optimization down into the system An example problem: context-sensitive static analysis of very large codebases

Context-Free Language (CFL) Reachability A program graph P A context-free Grammar G with balanced parentheses properties K l1 l2 a b c c is K-reachable from a K  l1 l2 Reps, Program analysis via graph reachability, IST, 1998

A Wide Range of Applications Pointer/alias analysis Dataflow analysis, pushdown systems, set-constraint problems can all be converted to context-free-language reachability problems Alias b = a; c = b; a Assign b Assign c Alias  Assign+ Sridharan and Bodik, Refinement-based context-sensitive pointsto analysis for Java, PLDI, 2006 Zheng and Rugina, Demand-driven alias analysis for C, POPL, 2008

A Wide Range of Applications (Cont.) Pointer/alias analysis Address-of & / dereference* are the open/close parentheses Alias b = & a; // Address-of c = b; d = *c; // Dereference a & b Alias c * d Alias  Assign+ | & Alias * Sridharan and Bodik, Refinement-based context-sensitive pointsto analysis for Java, PLDI, 2006 Zheng and Rugina, Demand-driven alias analysis for C, POPL, 2008

A Dynamic Transitive Closure Problem Traditional Approach: a worklist-based algorithm The worklist contains reachable vertices No transitive edges are added physically Problem: embarrassingly sequential and unscalable

No Worry About Memory Blowup As long as one knows how to use disks and clusters Big Data thinking: Solution = (1) Large Dataset + (2) Simple Computation + System Design

Turning Big Code Analysis into Big Data Analytics Key insights: Adding transitive edges explicitly – satisfying (1) Core computation is adding edges – satisfying (2) Leveraging disk support for memory blowup Can existing graph systems be directly used? No, none of them support dynamic addition of a lot of edges

Graspan: A Graph System for Interprocedural Static Analysis of Large Programs [ASPLOS’17] Scalable Disk-based processing on the developer's work machine Parallel Edge-pair centric computation Easy to implement a static analysis Developer only needs to generate graphs in mechanical ways and provide a context-free grammar to implement the analysis Implemented in both Java and C++ https://github.com/Graspan/

How It Works? G GRAMMAR RULES

Granspan Design Partitions are of similar sizes Each partition contains an adjacency list of edges Edges in each partition are sorted Edge-Pair Centric Computation Preprocessing Post-Processing

Computation Occurs in Supersteps Edge-Pair Centric Computation Preprocessing Post-Processing

Each Superstep Loads Two Partitions 1 2 3 4 C A B 1 2 Edge-Pair Centric Computation Preprocessing Post-Processing

Post-Processing Repartition oversized partitions to maintain balanced load on memory Save partitions to disk Scheduler favors in-memory partitions and those with higher matching degrees Edge-Pair Centric Computation Preprocessing Post-Processing

What We Have Analyzed With Program #LOC #Inlines Linux 4.4.0-rc5 16M 31.7M PostgreSQL 8.3.9 700K 290K Apache httpd 2.2.18 300K 58K With A fully context-sensitive pointer/alias analysis A fully context-sensitive dataflow analysis On a Dell Desktop Computer with 8GB memory and 1TB SSD

Evaluation Can the interprocedural analyses improve D. Englers’ checkers? Found 85 new bugs in Linux 4.4.0-rc5 Is Graspan efficient and scalable? Computations took 11 mins – 12 hrs How easy is it to use Graspan? 1K LOC of C++ for writing each of points-to and dataflow graph generators Provide a grammar file Graspan v/s other engines? LogicBlox and SocialLite ran out of memory for all our programs Traditional analysis implementation ran 3 days for the smallest graph https://github.com/Graspan/

Current Work Graspan is already available Systems support for path-sensitive constraint-based analysis Systems support for flow-sensitive analysis Systems support for constraint solvers

Takeaways and Additional Thoughts Dealing with lots of projects is indeed a Big Data problem Develop a cloud-based project management platform The build file of each project implements a few interfaces, enabling them to be automatically built, analyzed, or even tested Submitting a new project requires implementing these interfaces Building and analyzing is done completely on the cloud Various analyses and learning tasks can be implemented as distributed cloud services