Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Instrumentation of Linux Programs with Pin Robert Cohn & C-K Luk Platform Technology & Architecture Development Enterprise Platform Group Intel Corporation.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.
Integrity & Malware Dan Fleck CS469 Security Engineering Some of the slides are modified with permission from Quan Jia. Coming up: Integrity – Who Cares?
Chapter 7 Process Environment Chien-Chung Shen CIS, UD
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec
Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.
Pin : Building Customized Program Analysis Tools with Dynamic Instrumentation Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff.
1 Cost Effective Dynamic Program Slicing Xiangyu Zhang Rajiv Gupta The University of Arizona.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.
Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan.
Using Set Operations on Code Coverage Data to Discover Program Properties by Nick Rutar.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Advanced Programming in the UNIX Environment Hop Lee.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
1 Instrumentation of Intel® Itanium® Linux* Programs with Pin download: Robert Cohn MMDC Intel * Other names and brands.
1 Software Instrumentation and Hardware Profiling for Intel® Itanium® Linux* CGO’04 Tutorial 3/21/04 Robert Cohn, Intel Stéphane Eranian, HP CK Luk, Intel.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems Kim Hazelwood Greg Lueck Robert Cohn.
Dynamo: A Transparent Dynamic Optimization System Bala, Dueterwald, and Banerjia projects/Dynamo.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
What is exactly Exploit writing?  Writing a piece of code which is capable of exploit the vulnerability in the target software.
Determina, Inc. Persisting Information Across Application Executions Derek Bruening Determina, Inc.
Runtime Software Power Estimation and Minimization Tao Li.
Swap Space and Other Memory Management Issues Operating Systems: Internals and Design Principles.
Full and Para Virtualization
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.
Adaptive Inlining Keith D. CooperTimothy J. Harvey Todd Waterman Department of Computer Science Rice University Houston, TX.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Best detection scheme achieves 100% hit detection with
Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Protecting C and C++ programs from current and future code injection attacks Yves Younan, Wouter Joosen and Frank Piessens DistriNet Department of Computer.
Hello world !!! ASCII representation of hello.c.
§ Georgia Institute of Technology, † Intel Corporation Initial Observations of Hardware/Software Co-simulation using FPGA in Architecture Research Taeweon.
Chapter 7 Process Environment Chien-Chung Shen CIS/UD
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
A Robust Main-Memory Compression Scheme (ISCA 06) Magnus Ekman and Per Stenström Chalmers University of Technolog, Göteborg, Sweden Speaker: 雋中.
Examples (D. Schmidt et al)
Gift Nyikayaramba 30 September 2014
Assessing and Understanding Performance
Linking & Loading.
Chapter 3: Processes.
Program Execution in Linux
CS-3013 Operating Systems C-term 2008
Antonia Zhai, Christopher B. Colohan,
Taeweon Suh § Hsien-Hsin S. Lee § Shih-Lien Lu † John Shen †
Adaptive Code Unloading for Resource-Constrained JVMs
More Model Elements.
The Design & Implementation of Hyperupcalls
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation Vijay Janapa Reddi † Dan Connors ‡, Robert Cohn §, Michael D. Smith †

Runtime Compilation System Execution environments that provide an interface to the dynamic instruction stream of an application Runtime Compilation Systems Process Managers Resource Management Program Introspection Overheads 1. Runtime compilation 2. Performance of the compiled code

RSA’B’RS C’ A’ Runtime Sys. (RS) Code caching A Managing compilation overhead via software code caching Execution time Reuse of cached code BCCA Original dynamic instruction stream Basis: 90% execution time in 10% (hot) code

Problem statement There exist execution domains where code caching is ineffective, which limits the deployment of runtime compilation systems Challenges in deploying dynamic binary instrumentation into production regression testing environments Case study of the Oracle database Highlight of this talk:

Caching performance varies based on program behavior Loop intensive application Large code footprint & infrequent code re-use 176.gcc 181.mcf Runtime Compilation Code Cache

Caching performance varies based on program behavior Normalized execution time Mcf Eon Vpr Twolf Gap Bzip2 Gzip Parser Vortex Crafty Perl Gcc Large footprint (infrequent reuse) Loop intensive (frequent reuse) Runtime Compilation Code Cache

Benchmark 176.gcc is not an outlier Oracle Gedit Dia Gvim File Roller Gftp Gqview Normalized execution time Runtime Compilation Code Cache GUI applications - Large startup cost - Library initialization executed < 10 times

Code caching suffers under certain execution behaviors Less code reuse Large code footprint Short run times Not uncommon! Regression testing Oracle (100,000 tests) Gcc (4000+ tests) 176.gcc (5 SPEC reference inputs) Execution time Cold code is hot code across executions!!! Cold code is hot code across executions!!!

RSA’B’RS C’ A’ Caching (Run 1) A Caching code across executions improves caching performance BCCA Original dynamic instruction stream RSA’B’RS C’ A’ Caching (Run 2) Persistent caching (Run 2) A’B’C’A’ Reduce overhead by storing & reusing caches C’ Execution time

Implementation Framework: Pin (Dynamic binary instrumentation) Address Space Operating System Hardware Application Client Runtime System Components Code Cache Interface Appropriate system for evaluating persistence  General model  Robust design  Enterprise-scale usage

Persistent Pin Persistent Cache  Translated code  Translation data structures  Correctness metadata Persistence Mgr. Persistent Cache DB Address Space Operating System Hardware Application Client Pin Components Code Cache Interface

Experimental setup IA32 Linux implementation Bounded cache (320MB)  Applications ran unmodified  No cache flushes occurred Input X Empty Cache Pin Persistent Cache X Persistent Cache X Pin Input ? Measure improvement

Same-input Cross-input Cross- application Exploiting code reuse across executions and applications Code coverage: Bull's eye (100% reuse)

Persistent caching works across program classes SPEC 2000 INT (Reference inputs) Benefits large code footprint applications Persistent caching is complementary to the current code caching model

Persistent caching is effective for short-running applications Input data set alters program behavior Small improvements gets bigger (Gap) and large improvements get even larger (Gcc)

Evaluating persistent caching across program inputs 50% 60% 70% 80% 90% 100% Oracle 175.vpr 253.perlbmk 176.gcc164.gzip 256.bzip2 Code coverage between inputs

Production environments require runtime systems improvements Case study: Regression testing of Oracle XE Oracle: 80s Oracle + Pin (translation): 2000s Oracle + Pin (translation) + Instrumentation (memory tracing): 3000s One unit-test!

Oracle is a multi-process programming environment Large number of process compilations 1 Challenges Start Mount Open Work Close Oracle’s execution phases

Processes exhibit code sharing Start Mount Open Work Close Oracle’s execution phases ACCBZACCBZ Large number of process compilations 1 Redundant translations across processes 2 Challenges

Every Oracle unit-test starts a new instance of the database Start Mount Open Unit-test 1 Close Oracle’s execution phases Start Mount Open Unit-test 2 Close Only phase changing across all unit-tests Large number of process compilations 1 Redundant translations across processes 2 Challenges Redundant translations across unit-tests 3 Every unit-test executes all phases

Persistent Cache (Start) Low code coverage (15%) Persistent Cache (Open) High code coverage (77%) Leveraging persistence across processes

Persistent Cache Accumulation (PCA) addresses limited code coverage Pin Input Z Input X Empty Cache Pin Persistent Cache X Input Y Persistent Cache X Pin Accumulate code across executions Timed Run Persistent Cache X+Y Persistent Cache X+Y

Persistent Cache Accumulation (PCA) improves unit-test performance Accumulated persistent caches Performance improves with more accumulation of code

Contributions: Improved code caching Reuse  Cold code is hot code!  Persistence is effective Less code reuse Short run times Large code footprint  Robust and performance efficient implementation  Production environment regression testing study

Backup Slides

Future Research Questions Selective persistent caching  Cache only cold/hot code Effectiveness of optimizations across  Inputs  Applications Impact of excessive cache accumulation

Persistent Cache Sizes: DS is larger than CC!

29 Cross-input Persistence reduces re-translation across inputs Re-invocation w/ Persistence using a cache from a different input for a previously unseen input Persistence is effective even across changing input data sets Without Persistence Re-invocation w/ Persistence using a previously cached execution ~30% improvement via Cross-input Persistence time

VOID Analysis(COUNTER * counter) { (*counter) ++; } VOID Instrumentation(INS ins, VOID *v) { STATS * stats = new STATS( INS_Address(ins)); INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR (Analysis), IARG_PTR, &stats->counter, …); … } VOID main(INT32 argc, CHAR *argv[]) { … INS_AddInstrumentFunction(Instrumentation, 0); … PIN_StartProgram(); } Persistent instrumentation issues Dynamically allocated memory Called upon every instruction execution Called once per instruction compilation Solution: Allocate memory using the Persistent Memory Allocator Invalid pointer during cache reuse Memory allocation during cache generation

Inter-Application exploits redundancy of library translations Input X Empty Cache Pin Persistent Cache X Persistent Cache Y Pin Input X Input Y Empty Cache Pin Persistent Cache Y Persistent Cache X Pin Input Y Application AApplication B Libraries (DSO)  Initialization  Toolkits/Pkgs X11 GTK+ FLTK Timed Run

Inter-Application Persistence Verifies that large amount of time is spent initializing library routines ~60% improvement

Processes exhibit code sharing Start Mount Open Work Close Oracle’s execution phases Large number of process compilations 1 Redundant translations across processes 2 fork() exec() exec() loses parent cache: May re-translate parent code! Challenges