Efficient software checkpointing framework for speculative techniques ECE Connections 2006 Co-Supervisors: Prof. Greg. Steffan Prof. Cristiana Amza Chuck (Chengyan) Zhao Department of Computer Science University of Toronto Jun. 09, 2006 Might need to introduce my supervisors to the audience
Chip Multi-Processor (CMP) is now everywhere IBM: Power 4 Power 5 Intel: Montecito Smithfield AMD: dual-core Opteron, Athlon X2 Four-core Opteron Sun: UltraSparc T1: 32 cores UltraSparc T2: 64 cores Sony, Toshiba, IBM: Cell:9 cores … … Power 4 Dual-core Intel chip We are interested in improving the performance of a single application, using the abundant CMP resources (which most of would stay idle most of the time) Dual-core Opteron Cell use CMP for single-threaded applications through parallelization
Parallelization Techniques Automatic Parallelization conservative + precise: prove of non dependence limited domain Speculative Parallelization non-conservative has to recover from failures focus: speculative parallelization, use TLS
Thread-Level Speculation (TLS) Parallelism Code example for ( …){ … *p = …; … = … *q; } difficult to parallelize automatically uncertain dependence between *p and *q might be runtime or user-input dependent Points at slide while talking. 2. turn each loop iteration into a thread 3. checkpointing scheme + dependence testing
How Thread-Level Speculation works TLS …*q *p… violation Recover …*q Exec. Time We take a sequential program and carve it into threads. Watch for violations We then execute the threads speculatively in parallel. The speculative part is that we don’t know whether these threads are actually independent. Instead, we depend on runtime support to tell us whether the threads actually were independent whenever we have violated a data dependence we simply re-execute that thread so that it is redone with the proper value, otherwise we can commit the speculative work. But even when speculation has failed, we can still reduce overall execution time by exploiting the available parallelism. If you are usually right, then it is faster to apologize when you are wrong than to always ask for permission exploit available thread-level parallelism
Memory Checkpointing Compiler Transformations mark region of interest backup each memory write (store) generate buffer refresh calls generate recovery code remove region marking delimiters start_instrument(); setjmp(buf1); for(…){ refresh_ckpt(); backup_mem(a); a = …; backup_mem(b); b = …; … } if(error_spec()){ ckp_restore(); longjmp(buf1); } stop_instrument(); mention those function calls are currently organized into a runtime library
Preliminary Results: MCF in SPEC2KINT index fname 1 refresh_potential() 2 bea_compute_red_cost() 3 primal_bea_mpp() 4 1 + 2 5 1 + 3 6 2 + 3 7 1 + 2 + 3 Picked SPEC2000 CPU INT Benchmark suite (make 10 / 12 applications available) Remember to show the key point: performance degradation is can be up to 50%, but have large room of improvements
Challenges and Future Work Challenges: software overhead Proposed Solutions: optimizations inlining optimal buffer sizing and refreshing placement memory optimizations Applications value prediction debugging support reliability enhancement TLS (long term) ... Mention that the challenge of software-only checkpointing is to significantly reducing the software overhead by aggressively optimizations
Questions and Answers