Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

Slides:



Advertisements
Similar presentations
Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Memory Protection: Kernel and User Address Spaces  Background  Address binding  How memory protection is achieved.
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
1 Lecture 4: Procedure Calls Today’s topics:  Procedure calls  Large constants  The compilation process Reminder: Assignment 1 is due on Thursday.
Comprehensive Kernel Instrumentation via Dynamic Binary Translation Peter Feiner, Angela Demke Brown, Ashvin Goel University of Toronto Presenter: Chuong.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
OS Fall’02 Virtual Memory Operating Systems Fall 2002.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
Multiprocessing Memory Management
Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
University of Colorado
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010.
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶
Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009.
CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.
CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
Compiler Construction
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
1 Memory Management. 2 Fixed Partitions Legend Free Space 0k 4k 16k 64k 128k Internal fragmentation (cannot be reallocated) Divide memory into n (possible.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
Trace Fragment Selection within Method- based JVMs Duane Merrill Kim Hazelwood VEE ‘08.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Processes and Virtual Memory
Lecture 26 Virtual Machine Monitors. Virtual Machines Goal: run an guest OS over an host OS Who has done this? Why might it be useful? Examples: Vmware,
Page Table Implementation. Readings r Silbershatz et al:
Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
8 July 2015 Charles Reiss
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Non Contiguous Memory Allocation
Simone Campanoni A research CAT Simone Campanoni
CS161 – Design and Architecture of Computer
Memory Protection: Kernel and User Address Spaces
Antonia Zhai, Christopher B. Colohan,
Inlining and Devirtualization Hal Perkins Autumn 2011
Inlining and Devirtualization Hal Perkins Autumn 2009
/ Computer Architecture and Design
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Program Execution in Linux
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
rePLay: A Hardware Framework for Dynamic Optimization
CSE 542: Operating Systems
Structure of Processes
Dynamic Binary Translators and Instrumenters
CSE 542: Operating Systems
Presentation transcript:

Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009

ACACES 2009 – Process Virtualization Course Outline Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization 1

ACACES 2009 – Process Virtualization 2 JIT-Based Process Virtualization Application Transform Code Cache Execute Profile

ACACES 2009 – Process Virtualization 3 What are the Challenges? Performance! Solutions: Code caches – only transform code once Trace selection – focus on hot paths Branch linking – only perform cache lookup once Indirect branch hash tables / chaining Memory “management” Correctness – self-modifying code, munmaps, multithreading Transparency – context switching, eflags

ACACES 2009 – Process Virtualization 4 What is the Overhead? The latest Pin overhead numbers …

ACACES 2009 – Process Virtualization 5 Sources of Overhead Internal Compiling code & exit stubs (region detection, region formation, code generation) Managing code (eviction, linking) Managing directories and performing lookups Maintaining consistency (SMC, DLLs) External User-inserted instrumentation

ACACES 2009 – Process Virtualization 6 Improving Performance: Code Caches Code Cache Branch Target Address Hit Region Formation & Optimization Evict Code Update Hash Table Miss No Yes No Interpret Code is Hot? Room in Code Cache? Insert Start Hash Table Lookup Counter++ Delete Exit Stub

ACACES 2009 – Process Virtualization 7 Software-Managed Code Caches Store transformed code at run time to amortize overhead of process VMs Contain a (potentially altered) copy of application code Application Transform Code Cache Execute Profile

ACACES 2009 – Process Virtualization 8 Code Cache Contents Every application instruction executed is stored in the code cache (at least) Code Regions Altered copies of application code Basic blocks and/or traces Exit stubs Swap applicationVM state Return control to the process VM

ACACES 2009 – Process Virtualization 9 Code Regions Basic Blocks Traces A BBL A: Inst1 Inst2 Inst3 Branch B C A B D CFG A B C D In Memory A B C D D Trace

ACACES 2009 – Process Virtualization 10 Exit Stubs One exit stub exists for every exit from every trace or basic block Functionality Prepare for context switch Return control to VM dispatch Details Each exit stub ≈ 3 instructions A B D Exit to C Exit to E

ACACES 2009 – Process Virtualization 11 A BC DE FG HI Call Return CFG Performance: Trace Selection Interprocedural path Single entry, multiple exit A B C D I G H E F Call Return Layout in Memory Exit to C Exit to F A B D E G H I Layout in Code Cache Trace (superblock)

ACACES 2009 – Process Virtualization 12 Performance: Cache Linking Trace #2 Exit #1a Exit #1b Trace #1 Dispatch Trace #3

ACACES 2009 – Process Virtualization 13 Linking Traces Proactive linking Lazy linking Exit to C Exit to F A B D E G H I Exit to A F H I C D E G H I Exit to F A BC DE FG HI Call Return A B D E G H I C D E G H I F H I

ACACES 2009 – Process Virtualization 14 Are Links Highly Beneficial? Bench- mark With Linking Without Linking Slow-down gzip230 sec7951 sec3357% vpr333 sec2474 sec643% gcc206 sec3284 sec1494% mcf368 sec2014 sec447% crafty215 sec3547 sec1550% parser350 sec6795 sec1841% perlbmk336 sec6945 sec1967% gap195 sec4231 sec2070% vortex382 sec4655 sec1119% bzip2287 sec4294 sec1396% twolf658 sec6490 sec886%

ACACES 2009 – Process Virtualization 15 Code Cache Visualization

ACACES 2009 – Process Virtualization 16 Challenge: Rewriting Instructions We must regularly rewrite branches No atomic branch write on x86 Pin uses a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006

ACACES 2009 – Process Virtualization 17 Pretend as though the original program is executing Original Code: 0x1000 call 0x4000 Challenge: Achieving Transparency Code cache address mapping: 0x1000  0x7000 “caller” 0x4000  0x8000 “callee” Translated Code: 0x7000 push 0x1006 0x7006 jmp 0x8000 Push 0x1006 on stack, then jump to 0x4000 SPC TPC

ACACES 2009 – Process Virtualization 18 Challenge: Self-Modifying Code The problem Code cache must detect SMC and invalidate corresponding cached traces Solutions Many proposed … but without HW support, they are very expensive! Changing page protection Memory diff prior to execution On ARM, there is an explicit instruction for SMC!

ACACES 2009 – Process Virtualization 19 Self-Modifying Code Handler (Written by Alex Skaletsky) void main (int argc, char **argv) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(InsertSmcCheck,0); PIN_StartProgram(); // Never returns } void InsertSmcCheck () {... memcpy(traceCopyAddr, traceAddr, traceSize); TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck, IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END); } void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr, USIZE traceSize, CONTEXT* ctxP) { if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { CODECACHE_InvalidateTrace((ADDRINT)traceAddr); PIN_ExecuteAt(ctxP); } }

ACACES 2009 – Process Virtualization 20 Challenge: Parallel Applications JIT Compiler Syscall Emulator Signal Emulator Dispatcher Instrumentation Code Call-Back Handlers Analysis Code Code Cache Pin SerializedParallel T1 T2 T1 T2 Pin Tool

ACACES 2009 – Process Virtualization 21 Challenge: Code Cache Consistency Cached code must be removed for a variety of reasons: Dynamically unloaded code Ephemeral/adaptive instrumentation Self-modifying code Bounded code caches EXE Transform Code Cache Execute Profile

ACACES 2009 – Process Virtualization 22 Motivating a Bounded Code Cache The Perl Benchmark

ACACES 2009 – Process Virtualization 23 Option 1: All threads have a private code cache (oops, doesn’t scale) Option 2: Shared code cache across threads If one thread flushes the code cache, other threads may resume in stale memory Flushing the Code Cache

ACACES 2009 – Process Virtualization 24 Naïve Flush Wait for all threads to return to the code cache Could wait indefinitely! VM CC1 VMstall VMstall CC2 VMCC1VMCC2 Flush Delay Thread1 Thread2 Thread3 Time

ACACES 2009 – Process Virtualization 25 Generational Flush Allow threads to continue to make progress in a separate area of the code cache VM CC1 VM CC2 VMCC1VMCC2 Thread1 Thread2 Thread3 Requires a high water mark Time

ACACES 2009 – Process Virtualization % pin –cache_size –t flusher -- /bin/ls SWOOSH! 26 Build-Your-Own Cache Replacement void main(int argc, char **argv) { PIN_Init(argc,argv); CODECACHE_CacheIsFull(FlushOnFull); PIN_StartProgram(); //Never returns } void FlushOnFull() { CODECACHE_FlushCache(); cout << “SWOOSH!” << endl; } Eviction Granularities Entire Cache One Cache Block One Trace Address Range

ACACES 2009 – Process Virtualization 27 A Graphical Front-End

ACACES 2009 – Process Virtualization 28 Memory Scalability of the Code Cache Ensuring scalability also requires carefully configuring the code stored in the cache Trace Lengths First basic block is non-speculative, others are speculative Longer traces = fewer entries in the lookup table, but more unexecuted code Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

ACACES 2009 – Process Virtualization 29 Effect of Trace Length on Trace Count

ACACES 2009 – Process Virtualization 30 Effect of Trace Length on Memory

ACACES 2009 – Process Virtualization 31 Sources of Overhead Internal Compiling code & exit stubs (region detection, region formation, code generation) Managing code (eviction, linking) Managing directories and performing lookups Maintaining consistency (SMC, DLLs) External User-inserted instrumentation

ACACES 2009 – Process Virtualization 32 Adding Instrumentation

ACACES 2009 – Process Virtualization 33 “Normal Pin” Execution Flow Instrumentation is interleaved with application Uninstrumented Application Instrumented Application Pin Overhead Instrumentation Overhead “Pinned” Application time 

ACACES 2009 – Process Virtualization 34 “SuperPin” Execution Flow SuperPin creates instrumented slices Uninstrumented Application SuperPinned Application Instrumented Slices

ACACES 2009 – Process Virtualization 35 Issues and Design Decisions Creating slices How/when to start a slice How/when to end a slice System calls Merging results

ACACES 2009 – Process Virtualization 36 fork S6+ fork S5+ fork S4+ fork record sigr3, sleep S3+ fork S2+ sleep S1+ Execution Timeline fork S1 S2S3S4S5S6 detect sigr4 detect exit resume detect sigr3 detect sigr6 detect sigr2 detect sigr5 resume record sigr4, sleep CPU2 CPU3 CPU4 time record sigr2, sleep resume record sigr5, sleep resume record sigr6, sleep resume original application instrumented application slices CPU1

ACACES 2009 – Process Virtualization 37 Performance – icount1 % pin –t icount1 --

ACACES 2009 – Process Virtualization What Did We Learn Today? Building Process VMs is only half the battle Robustness, correctness, performance are paramount Lots of “tricks” are in play Code caches, trace selection, etc. Knowing about these tricks is beneficial Lots of research opportunities Understanding the inner workings often helps you write better tools 38

ACACES 2009 – Process Virtualization Want More Info? Read the seminal Dynamo paper See the more recent papers by the Pin, DynamoRIO, Valgrind teams Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT 39 Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization