Recent and Upcoming Advances in the Dyninst Toolkits

Slides:

Advertisements

Similar presentations

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 Paradyn Project Upcoming Features in Dyninst and its Components Bill Williams.

Technical Architectures

File System Implementation

Memory Organization.

University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

Andrew Bernat, Bill Williams Paradyn / Dyninst Week Madison, Wisconsin April 29-May 1, 2013 New Features in Dyninst

The Deconstruction of Dyninst: Experiences and Future Directions Drew Bernat, Madhavi Krishnan, Bill Williams, Bart Miller Paradyn Project 1.

Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Buffer Overflow Proofing of Code Binaries By Ramya Reguramalingam Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.

November 2005 New Features in Paradyn and Dyninst Matthew LeGendre Ray Chen

April 2007The Deconstruction of Dyninst: Part 1- the SymtabAPI The Deconstruction of Dyninst Part 1: The SymtabAPI Giridhar Ravipati University of Wisconsin,

Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.

1 Lecture 8: Virtual Memory Operating System Fall 2006.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 Paradyn Project Deconstruction of Dyninst: Best Practices and Lessons Learned Bill.

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

File-System Management

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Advanced Computer Systems

Kernel Code Coverage Nilofer Motiwala Computer Sciences Department

Compiler Design (40-414) Main Text Book:

Recursion Topic 5.

Memory Hierarchy Ideal memory is fast, large, and inexpensive

The architecture of the P416 compiler

Virtual memory.

Non Contiguous Memory Allocation

Module 11: File Structure

Chapter 1 Introduction.

Understanding Operating Systems Seventh Edition

Chapter 11: File System Implementation

Chapter 6 - Database Implementation and Use

A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.

ITEC 202 Operating Systems

SOFTWARE DESIGN AND ARCHITECTURE

Java Beans Sagun Dhakhwa.

Parallel Programming By J. H. Wang May 2, 2017.

Hashing - Hash Maps and Hash Functions

File System Implementation

Harry Xu University of California, Irvine & Microsoft Research

Chapter 1 Introduction.

Chapter 8 Main Memory.

Cache Memory Presentation I

Lecture 10: Virtual Memory

The HP OpenVMS Itanium® Calling Standard

CSCI1600: Embedded and Real Time Software

Optimizing Malloc and Free

CS Introduction to Operating Systems

Page Replacement.

New Features in Dyninst 6.1 and 6.2

Lecture 1: Multi-tier Architecture Overview

Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.

Virtual Memory Hardware

Lecture Topics: 11/1 General Operating System Concepts Processes

Multithreaded Programming

Appendix D: Network Model

Lecture 2 The Art of Concurrency

Lecture 4: Instruction Set Design/Pipelining

The End Of The Line For Static Cyclic Scheduling?

COMP755 Advanced Operating Systems

Operating Systems: Internals and Design Principles, 6/E

CSCI1600: Embedded and Real Time Software

SPL – PS1 Introduction to C++.

From Use Cases to Implementation

Presentation transcript:

Recent and Upcoming Advances in the Dyninst Toolkits Bill Williams and Sasha Nicolas

A Brief Introduction to Dyninst Static and dynamic binary instrumentation and modification Instrumentation: don’t change original program behavior Modification: change behavior in well-defined ways Binary: work on arbitrary executables & shared libraries w/o recompilation etc Static and dynamic: rewrite on disk, or modify a running process Dyninst: a tool for static and dynamic binary instrumentation and modification

Dyninst and the Components DynC A Dyninst Component Codegen A Dyninst Component Symtab API A Dyninst Component Parse API A Dyninst Component Process Patch API A Dyninst Component Binary Binary Instruction API A Dyninst Component Dataflow API A Dyninst Component Proc Control API A Dyninst Component Stack Walker API A Dyninst Component

New Dyninst developments since 2016 ARM64 analysis layers complete New: ParseAPI and DataflowAPI New jump table analysis integrated and tested cross-platform ARM64 and x86 flavors working PPC coming soon Cross-architecture analysis released Internal multithreading in progress Early work on CUDA support

Jump table analysis Initial principle: jump tables are Computed control flow Based on an index Referring to a compact chunk of read-only data Naively, compute a backwards slice on the target of the computed indirect jump Control dependence: finds the bounds on the index Data dependence: finds the computation of the target Algorithm is architecture-independent

Real-world complexities Tables are not always single-level Indexes are not always: zero-based explicitly bounded above (or below) Index bounds and table contents may be computed far from each other Single slice can get very large and complex

Jump table refinements Two independent values matter: Table index Table location Identify and slice for each one separately Reduces complexity of each slice Slicing algorithms tend to be O(n2) in the size of the slice

Results from new jump table analysis We implemented the two-slice approach and saw improvement across the board: x86-64 results on coreutils, findutils, binutils, and SPEC, built with gcc and icc, all optimization levels 23% increase in tables resolved 6% reduction in parsing time ARM results on coreutils, findutils, binutils, and system libraries, built with gcc –O2 9% increase in tables resolved, corresponding to a 48% reduction in unparsed tables 23% reduction in parsing time 55% reduction in tables containing incorrect edges

Cross-architecture binary analysis Goal: process binaries on unrelated hosts as long as they have a common file format Internal testing: easier to analyze a collection of different binaries on a single host Tool uses: can preprocess code from a cluster on your desktop, irrespective of hardware All static analysis components now support this Symtab, InstructionAPI, ParseAPI, DataflowAPI

Cross-architecture instrumentation Analysis was easy, so what’s missing? Code generation Binary rewriting Remote process control Not currently an internal need Hasn’t been requested by users No spare resources

Parallelism for fun and profit Binaries continue to get larger and more complex Front-end and workstation hardware also continues improving, but mostly by adding cores Bottlenecks in Dyninst and components: CFG construction Line and type info parsing Take advantage of extra cores in our analysis layers!

An ECP-motivated use case Working with HPCToolkit on the Exascale Computing Project (ECP) Goal: replace mixture of homegrown, Dyninst, binutils, other components inside HPCStruct with Dyninst toolkits Improve precision and internal consistency Improve performance

Breaking down HPCStruct Task Previous approach Dyninst operation Analyze function boundaries Scan based on symbols ParseAPI control flow parse and report function extents Analyze loop locations Based on linear sweep disassembly from binutils ParseAPI loop analysis Map address ranges to source lines Based on libdwarf, binutils SymtabAPI line information parsing Map address ranges to set of inlined functions present Based on SymtabAPI or absent Generated as part of SymtabAPI type parsing

Parallelizing DWARF: existing infrastructure Dyninst has used libdwarf for many years Provides a useful abstract representation of DWARF information Cannot be used in a multithreaded manner Long turnaround on new features/bug fixes

Parallelizing DWARF: a necessary replacement Elfutils includes libdw, a libdwarf replacement Not fully thread-safe yet, but takes multithreading seriously as a goal Issues with migration: Diverging data structures API differences Lack of .eh_frame support DWARF5 support incomplete Nevertheless, migration is complete

Parallelizing DWARF: next steps The tree structure of DWARF is parsed recursively, leading to natural parallelism Implement spawn/join model and test: Performance Correctness

Parallelizing ParseAPI Simultaneous graph-construction and graph- traversal problem Originally optimized for the single-threaded case Naturally decomposes into parsing functions independently Synchronization needs to happen where two threads work on the same data: Interprocedural edges Shared code Shared indexes of CFG elements This really should get a slide of HPCViewer results

Initial approach Functions and basic blocks become lockable and apply internal locking appropriately External locking used sparingly when multiple operations must be transactional Able to achieve a roughly 2.7x speedup with HPCStruct After multithreading ParseAPI, single-threaded DWARF becomes 50% of HPCStruct run time Lock overhead is ~20% of HPCStruct run time This really should get a slide of HPCViewer results

Towards a lock-free parser Two key problems require locking: interprocedural edges and shared code Shared code can be cloned Duplicates a bit of data Allows unlocked access Interprocedural edges are hard to get rid of Key insight: block splits contain all the synchronization problems that occur at a function boundary

Supporting CUDA binaries Challenges at many levels of the toolkits SymtabAPI Fat binaries (nested ELF files) are a new challenge Mapping cubins to archives: not a perfect match InstructionAPI: proprietary format ParseAPI, DataflowAPI: what can we do without instruction-level information?

Dyninst 10.0 plans Three design goals: Expose component-level interfaces in a more natural manner, avoiding redundancy Standardize parameter formats across APIs Ensure interfaces support any extra information necessary for thread safety and accuracy

A motivating example for Dyninst 10.0 BPatch_statement class is entirely analogous to Symtab::Statement but: Uses void* rather than Address/Offset for address range Is not an AddressRange Uses const char* rather than std::string Why isn’t BPatch_statement a Symtab::Statement then? Annoying, inconsistent backwards compatibility wrappers What if it were a Symtab::Statement? …and what if the wrapper functions went away? Big change to the interface—thoughts?

Standardizing function formats At last count, at least six ways to return a collection of values in Dyninst toolkits At least four ways to indicate an error Some combinations (returning pointer to newly allocated container, with NULL meaning empty or error or both) are obviously awkward Interface grew organically, but it’s time for a principled design

A GitHub year in review Integrated Jenkins into our repository 13 committers in the past year 44 unique cloners in the past two weeks ~120 issues reported in the past year More than two-thirds of these issues resolved

Software info Main project page: https://github.com/dyninst/dyninst Issue tracker Releases and manuals Test results LGPL Contributions welcome

Our current CFG structure Block A: call foo Points-to relationship Block A is cut into A and C Assume C is the second half of A Call edge needs to move its source from A to C Edge changes, therefore B must be locked Edge AB Block B: foo

Splitting the calling block Block A Edge AC Block C: call foo Points-to relationship Block A is cut into A and C Assume C is the second half of A Call edge needs to move its source from A to C Edge changes, therefore B must be locked B may change, therefore foo may need to be locked Edge CB Block B: foo