Download presentation
Presentation is loading. Please wait.
Published byえの ちゃわんや Modified over 6 years ago
2
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Sumanth Suraneni Sharath Prasad Harsha Sutaone 9/17/2018
3
Introduction GPU. CPU – GPU systems Checkpoint GPU on CPU-GPU
Restart from checkpoint on CPU 9/17/2018
4
Motivation CPU – GPUs general purpose workloads
Dependability an issue in future GPUs GPU fault tolerance is a nascent field Checkpointing implementations on GPUs are at application level We explore micro-architectural changes to GPUs 9/17/2018
5
Background OpenCL Programming Model Southern Islands Architecture
Multi2Sim CPU-GPU Simulator 9/17/2018
6
OpenCL Programming Model
9/17/2018
7
OpenCL Programming Model
Simplified Mapping of OpenCL onto AMD Accelerated Parallel Processing 9/17/2018
8
OpenCL Programming Model
Work-item Grouping into Work-groups and Wavefronts 9/17/2018
9
Southern Islands Architecture
9/17/2018
10
Southern Islands Architecture
Compute Unit 9/17/2018
11
Southern Islands Architecture
Kernel State 9/17/2018
12
Multi2Sim CPU-GPU Simulator
Software entities defined in the OpenCL Programming Model An ND-Range is formed of work-groups, which are, in turn, sets of work-items executing the same OpenCl C Kernel code 9/17/2018
13
Multi2Sim CPU-GPU Simulator
Interaction between user code, OS-code, and hardware, comparing native and simulated environments 9/17/2018
14
Multi2Sim CPU-GPU Simulator
Running an OpenCL Kernel on a Southern islands GPU Block Diagram of a Compute Unit 9/17/2018
15
Implementation SIEmuCreate() SIEmuRun() si_wavefront_execute()
Assign global memory List running and waiting work-groups SIEmuRun() Dequeue & Enqueue running work-groups and waiting work-groups Work-group create si_wavefront_execute() Instruction dump Next PC = Current PC + Instruction Size 9/17/2018
16
Implementation Checkpoint Implementation ND-Range : ID, work dimension, number of VGPRs & SGPRs used Work-group : ID, work-groups finished, wavefronts completed & at barrier, wavefront count Wavefront : ID, SREGs, execution state of wavefront, instruction count. Work-item : ID, VREGs, global memory access size & address 9/17/2018
17
Implementation LDS (Local Data Share) Global memory
LDS module of executing work-group. All pages are stored. Global memory Stored until global memory top. 9/17/2018
18
Implementation Completed Work-groups
Store the list of finished work-groups in a file. Unexecuted Wavefronts during checkpoint Store into a separate file while writing the checkpoint file. Read from the file to start execution during restart. 9/17/2018
19
Implementation Checkpoint Checkpoint Trace 9/17/2018
20
Implementation Restart Restart Trace 9/17/2018
21
Implementation Verification Strategy 9/17/2018
22
Evaluation Workgroups 9/17/2018
23
Evaluation Instruction Count 9/17/2018
24
Evaluation Checkpoint Size 9/17/2018
25
Evaluation LDS Comparison 9/17/2018
26
Bugs Encountered LDS misalignment. 9/17/2018
27
Bugs Encountered Unexecuted wave front during checkpoint 9/17/2018
28
Future Scope Further minimization of LDS snapshot
Keeping track of pages modified and storing only those Implementing a driver call to checkpoint Hardware Complexity of the implementation Compression algorithms during multiple checkpoints 9/17/2018
29
THANK YOU 9/17/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.