1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept.

Slides:



Advertisements
Similar presentations
Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.
Advertisements

Dynamic Determinism Checking for Structured Parallelism Edwin Westbrook 1, Raghavan Raman 2, Jisheng Zhao 3, Zoran Budimlić 3, Vivek Sarkar 3 1 Kestrel.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
1 Deadlock Solutions: Avoidance, Detection, and Recovery CS 241 March 30, 2012 University of Illinois.
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.
Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Inlining Java Native Calls at Runtime (CASCON 2005 – 4 th Workshop on Compiler Driven Performance) Levon Stepanian, Angela Demke Brown Computer Systems.
Efficient and Flexible Architectural Support for Dynamic Monitoring YUANYUAN ZHOU, PIN ZHOU, FENG QIN, WEI LIU, & JOSEP TORRELLAS UIUC.
Securing software by enforcing data-flow integrity Manuel Costa Joint work with: Miguel Castro, Tim Harris Microsoft Research Cambridge University of Cambridge.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
Operating System Support Focus on Architecture
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
1 Refinement-Based Context-Sensitive Points-To Analysis for Java Manu Sridharan, Rastislav Bodík UC Berkeley PLDI 2006.
Improving Code Generation Honors Compilers April 16 th 2002.
Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Compiler Code Optimizations. Introduction Introduction Optimized codeOptimized code Executes faster Executes faster efficient memory usage efficient memory.
Making Object-Based STM Practical in Unmanaged Environments Torvald Riegel and Diogo Becker de Brum ( Dresden University of Technology, Germany)
P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
Storage in Big Data Systems
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
An Undergraduate Course on Software Bug Detection Tools and Techniques Eric Larson Seattle University March 3, 2006.
1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures 1 Tae Jun Ham (Princeton Univ.) Juan Luis Aragón (Univ. of Murcia)
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
1 Compiler Support for Efficient Software-only Checkpointing Chuck (Chengyan) Zhao Dept. of Computer Science University of Toronto Ph.D. Thesis Exam Sept.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
CDA 5155 Virtual Memory Lecture 27. Memory Hierarchy Cache (SRAM) Main Memory (DRAM) Disk Storage (Magnetic media) CostLatencyAccess.
Introduction to Optimization
Software Coherence Management on Non-Coherent-Cache Multicores
YAHMD - Yet Another Heap Memory Debugger
Ph.D. in Computer Science
PHyTM: Persistent Hybrid Transactional Memory
EEC 688/788 Secure and Dependable Computing
HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.
Introduction to Optimization
Page Replacement.
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Alan Mishchenko University of California, Berkeley
Adapted from the slides of Prof
EEC 688/788 Secure and Dependable Computing
Efficient software checkpointing framework for speculative techniques
Compiler Code Optimizations
Adapted from the slides of Prof
Introduction to Optimization
EEC 688/788 Secure and Dependable Computing
CETS: Compiler-Enforced Temporal Safety for C
Presentation transcript:

1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept. of Electrical and Computer Engineering University of Toronto IBM Toronto Lab* Nov. 10, 2011

2 Productivity and Compilers Programmer’s Productivity: important computers: fast, cheap programmers: slow (relatively), expensive new way for compiler to help? automatic fine-grain checkpointing (CKPT) optimizations to reduce checkpoint overhead applications of checkpointing accelerate bug-finding process automated support for backtracking algorithms a compiler can improve programmer’s productivity via automatic CKPT

Annotated source Enable Checkpointing Optimize Checkpointing LLVM frontend Callsite Analysis Inter-procedural Transformations Intra-procedural Transformations Special Cases Handling Source code C/C++ LLVM IR Backend Process Compiler Checkpointing (CKPT) Framework x86 x64 … POWER C/C++ 2. Pre Optimize 3. Redundancy Eliminations 4. Hoisting 6. Non Rollback Exposed Store Elimination 1. CKPT Inlining 7. Heap Optimize 8. Array Optimize 9. Post Optimize 5. Aggregation 3

4 compiler-based checkpointing basics … a = 5; b = 7; … main program a: b: checkpoint buffer failure recovery (&a, 0) (&b, 0) main memory

5 start_ckpt(); … backup(&a, sizeof(a)); a = …; handleMemcpy(…); memcpy(d, s, len); foo_ckpt(); foo(); … stop_ckpt(cond); foo(…){ /* body of foo() */} foo_ckpt(…){ /* body of foo_ckpt() */ }… Transformations to Enable Checkpointing 3 Steps: 1. Callsite analysis 2. Intra-procedural transformation 3. Inter-procedural transformation

Optimize Checkpointing Checkpointing Optimization Framework 2. Pre Optimization 3. Redundancy Eliminations (3 REs) 4. Hoisting 6. Non Rollback Exposed Store Elimination 1. CKPT Inlining 7. DynMem (Heap) Optimization 8. Array Optimization 9. Post Optimization 5. Aggregation 6

start_ckpt(); … if (C){ backup(&a, sizeof(a)); a = …; } … backup(&a, sizeof(a)); a = …; … backup(&a, sizeof(a)); a = …; … … stop_ckpt(cond); Redundancy Elimination Optimization Algorithm establish dominating relationship stop_ckpt() marker promote leading backup call re-establish dominating relationship among backup calls eliminate all non-leading backup call(s) 7 RE1: remove all non-leading backup call(s) dom

int a, b; … start_ckpt(); … b = … a op …; … backup(&a, sizeof(a)); a = …; … stop_ckpt(cond); 8 Definition: Rollback Exposed Store must backup 'a' because the prior load of 'a' must access the "old" value on rollback---i.e., 'a' is "rollback exposed" Rollback Exposed Store: a store to a location with a possible previous load of that location Rollback Exposed Store needs backup

int a, b; … start_ckpt(); … backup(&a, sizeof(a)); a = …; … stop_ckpt(cond); Algorithm Description no use of the address (&a) on any path the backup address (&a) isn’t aliased to anything empty points-to set 9 NRESE is a new, checkpoint-specific optimization Non-Rollback Exposed Store Elimination (NRESE) no prior use of 'a', hence it is non- rollback-exposed we can eliminate the backup of 'a'

Applications 10

11 Q: place where the bug manifests (a user or programmer notices the bug at this point) T: safe point, literally earlier than P, the program can reach through checkpoint recovery CKPT Region P: root cause of a bug App1: CKPT enabled debugging 11 Key benefits execution rewinding arbitrarily large region unlimited # of retries no restart from beginning

12 Q: keep swap if improvement, discard otherwise T: pick a pair of blocks to swap CKPT Region App2: CKPT enabled backtracking 12 Proceed with VPR’s random/simulated- annealing based algorithm Key benefits automate support for backtracking backup actions abort commit cover arbitrarily complex algorithm cleaner code, simplify programming programmer focus on algorithm

Evaluation 13

Platform and Benchmarks Evaluation Platform Core i7 920, 12GB DDR3, 200GB SATA Debian6-i386, gcc/g LLVM-2.9 Benchmarks BugBench: programs with buffer-overflow bugs 3 CKPT regions per program: Small. Medium. Large VPR: FPGA CAD tool, 1 CKPT region CKPT Comparison libCKPT: U. Tennessee ICCSTM: Intel ICC based STM 14

15 Compare with Coarse-gain Scheme: libCKPT HUGE gain over coarse-grain libCKPT

16 Compare with Fine-gain Scheme: ICCSTM better than best-known fine-grain ICCSTM

17 % % % % % RE1 Optimization: buffer size reduction RE1 is the single most-effective optimization

18 % % % % % % % % % Post RE1 Optimization: buffer size reduction Other optimizations also contribute

Conclusion CKPT Optimization Framework compiler-driven automatic software-only compiler analysis and optimizations X less overhead: over coarse-grain scheme 4-50X improvement: over fine-grain scheme CKPT-supported Apps debugger: execution rewind in time up to: 98% of CKPT buffer size reduction up to: 95% of backup call reduction VPR: automatic software backtracking only 15% CKPT overhead 19

20 Questions and Answers ?

Algorithm: Redundancy Elimination 1 1. Build dominating relationship (DOM) among backup calls 2. Identify leading backup call 3. Promote suitable leading backup call 4. Remove non-leading backup call(s) 21

Algorithm: NRESE Backup address is NOT aliased to anything points-to set is empty AND On any path from begin of CKPT to the respective write, there is no use of the backup address the value can be independently re-generated without the need of it self 22

1D array vs. Hash Tables Buffer Schemes 23

24 10X 100X 1KX 10KX 100KX Compare with Coarse-gain Scheme: libCKPT HUGE gain over coarse-grain libCKPT

Annotated source Enable Checkpointing Optimize Checkpointing Source code C/C++ LLVM IR Backend Process Compiler Checkpointing (CKPT) Framework x86 x64 … Power C/C++ 2. Pre Optimize 3. Redundancy Eliminations 4. Hoisting 6. Non Rollback Exposed Store Elimination 1. CKPT Inlining 7. Heap Optimize 8. Array Optimize 9. Post Optimize 5. Aggregation 25

CKPT Enabled Debugging Key benefits execution rewinding arbitrarily large region unlimited # of retries no restart 26

27 Compare with Fine-gain Scheme: ICCSTM better than best-known fine-grain solution

start_ckpt(); … backup(&a, sizeof(a)); a = …; … backup(&a, sizeof(a)); a = …; … if (C){ backup(&a, sizeof(a)); a = …; … } … stop_ckpt(c); Redundancy Elimination Optimization 1 Algorithm establish dominating relationship among backup calls promote leading backup call eliminate all non- leading backup call(s) 28 D RE1: keep only dominating backup call

29 initial guess obtain a new result (manual CKPT) check result … commit and continue good abort and try next bad CKPT Support for Automatic Backtracking (VPR) CKPT automates the process, regardless of backtracking complexity

30 

31 Key benefits automate support for backtracking backup actions abort commit cover arbitrarily complex algorithm cleaner code, simplify programming programmer focus on algorithm

32 App2: CKPT enabled backtracking Evaluate (manual CKPT) Initial Guess bad Reset Data good Commit Data Finish stop condition reached Key benefits automate support for backtracking backup actions abort commit cover arbitrarily complex algorithm cleaner code, simplify programming programmer focus on algorithm

33 Key benefits automate CKPT process backup actions abort commit cover arbitrarily complex algorithm simplify programming programmer focus on algorithm

2. Pre Optimize 3. Redundancy Eliminations 4. Hoisting 6. Non Rollback Exposed Store Elimination 1. CKPT Inlining 7. Heap Optimize 8. Array Optimize 9. Post Optimize 5. Aggregation 34

How Can A Compiler Help Checkpointing? Enable CKPT compiler transformations Optimize CKPT do standard optimizations apply? support CKPT-specific optimizations? CKPT Uses debugging backtracking 35

36 Optimization: buffer size reduction up to 98% of CKPT buffer size reduction % % % % %