Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Similar presentations


Presentation on theme: "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro."— Presentation transcript:

1 An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu 3/17/2010 ASPLOS 2010 -- Pittsburgh 1

2 1. Introduction: Heterogeneous Computing 3/17/2010 ASPLOS 2010 -- Pittsburgh 2 Heterogeneous Parallel Systems: CPU: sequential control-intensive code Accelerators: massively data-parallel code CPU ACC Existent programming models are DMA based: Explicit memory copy Programmer-managed memory coherence CPU ACC IN OUT

3 Outline 1.Introduction 2.Motivation 3.ADSM: Asymmetric Distributed Shared Memory 4.GMAC: Global Memory for ACcelerators 5.Experimental Results 6.Conclusions 3/17/2010 ASPLOS 2010 -- Pittsburgh 3

4 2.1 Motivation: Reference System 3/17/2010 ASPLOS 2010 -- Pittsburgh CPU (N - Cores) PCIe Bus RAM Memory GPU-like Accelerator 4 High bandwidth Weak consistency Large page size Low latency Strong consistency Small page size RAM Memory Device Memory System Memory

5 2.2 Motivation: Memory Requirements High memory bandwidth requirements Non fully-coherent systems: Long-latency coherence traffic Different coherence protocols Accelerator memory always growing (e.g. 6GB NVIDIA Fermi, 16GB PowerXCell 8i) 3/17/2010 ASPLOS 2010 -- Pittsburgh 5

6 2.3 Motivation: DMA-Based Programming 3/17/2010 ASPLOS 2010 -- Pittsburgh 6 Duplicated Pointers Explicit Coherence Management CUDA Sample Code CPUGPU foo void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel >>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo); }

7 3.1 ADSM: Unified Virtual Address Space Unified Virtual Shared Address Space CPU: access both, system and accelerator memory Accelerator: access to its own memory Under ADSM, both will use the same virtual address when referencing the shared object CPU ACC bar baz foo Shared Data Object 3/17/2010 ASPLOS 2010 -- Pittsburgh 7 System Memory Device Memory

8 3.2 ADSM: Simplified Code Simpler CPU code than in DMA- based programming models Hardware-independent code Single Pointer Data Assignment Peer DMA Legacy Support 3/17/2010 ASPLOS 2010 -- Pittsburgh 8 void compute(FILE *file, int size) { float *foo; foo = adsmMalloc(size); fread(foo, size, 1, file); kernel >>(foo, size); cpuComputation(foo); adsmFree(foo); } CPUGPU foo

9 3.3 ADSM: Memory Distribution Asymmetric Distributed Shared Memory principles: CPU accesses objects in accelerator memory but not vice versa All coherency actions are performed by the CPU Trashing unlikely to happen: Synchronization Variables: Interrupt-based and dedicated hardware False-sharing: Data object sharing granularity 3/17/2010 ASPLOS 2010 -- Pittsburgh 9

10 3.4 ADSM: Consistency and Coherence Release consistency: Consistency only relevant from CPU perspective Implicit release/acquire at accelerator call/return CPUACC Foo CPUACC Foo Accelerator Return Accelerator Call 3/17/2010 ASPLOS 2010 -- Pittsburgh 10 Memory Coherence: Data ownership information enables eager data transfers CPU maintains coherency

11 4. Global Memory for Accelerators ADSM implementation User-level shared library GNU / Linux Systems NVIDIA CUDA GPUs 3/17/2010 ASPLOS 2010 -- Pittsburgh 11

12 4.1 GMAC: Overall Design Layered Design: Multiple Memory Consistency Protocols Operating System and Accelerator Independent code CUDA-like Front-End Memory Manager (Different Policies) Kernel Scheduler (FIFO) Operating System Abstraction Layer Accelerator Abstraction Layer (CUDA) 3/17/2010 ASPLOS 2010 -- Pittsburgh 12

13 4.2 GMAC: Unified Address Space System Virtual Address Space GPU Physical Address Space 3/17/2010 ASPLOS 2010 -- Pittsburgh 13 Virtual Address Space formed by GPU and System physical memories GPU memory address range cannot be selected Allocate same virtual memory address range in both, GPU and CPU Accelerator Virtual memory would ease this process

14 Batch-Update: copy all shared objects Lazy-Update: copy modified / needed shared objects Data object granularity Detect CPU read/write accesses to shared objects Rolling-Update: copy only modified / needed memory Memory block size granularity Fixed maximum number of modified blocks in system memory  flush data when maximum is reached 4.3 GMAC: Coherence Protocols 3/17/2010 ASPLOS 2010 -- Pittsburgh 14

15 5.1 Results: GMAC vs. CUDA 3/17/2010 ASPLOS 2010 -- Pittsburgh 15 Batch-Update overheads: – Copy output data on call – Copy non-used data Similar performance for CUDA, Lazy- Update and Rolling-Update

16 5.2 Results: Lazy vs. Rolling on 3D Stencil Extra data copy for small data objects Trade-off between bandwidth and page fault overhead 3/17/2010 ASPLOS 2010 -- Pittsburgh 16

17 6. Conclusions Unified virtual shared address space simplifies programming of heterogeneous systems Asymmetric Distributed Shared Memory CPU access accelerator memory but not vice versa Coherence actions only executed by CPU Experimental results shows no performance degradation Memory translation in accelerators is key to efficient implement ADSM 3/17/2010 ASPLOS 2010 -- Pittsburgh 17

18 Thank you for your attention Eager to start using GMAC? http://code.google.com/p/adsm/ igelado@ac.upc.edu adsm-users@googlegroups.com 3/17/2010 18 ASPLOS 2010 -- Pittsburgh

19 Backup Slides 3/17/2010 ASPLOS 2010 -- Pittsburgh 19

20 4.4 GMAC: Memory Mapping Software: allocate different address space and provide translation function ( gmacSafePtr() ) Hardware: implement virtual memory in the GPU 3/17/2010 ASPLOS 2010 -- Pittsburgh 20 System Virtual Address Space GPU Physical Address Space Allocation might fail if the range is in use

21 4.5 GMAC: Protocol States Protocol States: Invalid, Read-only, Dirty 3/17/2010 ASPLOS 2010 -- Pittsburgh 21 Invalid Dirty Read Only Call Read Write Flush Invalid Dirty Call Return Batch-Update: Call / Return Lazy-Update: Call / Return Read / Write Rolling-Update: Call / Return Read / Write Flush

22 4.6 GMAC: Rolling vs. Lazy 3/17/2010 ASPLOS 2010 -- Pittsburgh 22 Batch – Update: transfer on kernel call Rolling – Update: transfer while CPU computes

23 5.3 Results: Break-down of Execution 3/17/2010 ASPLOS 2010 -- Pittsburgh 23

24 5.4 Results: Rolling Size vs. Block Size No appreciable effect on most benchmarks 3/17/2010 ASPLOS 2010 -- Pittsburgh 24 Small Rolling size leads to performance aberrations Prefer relative large rolling sizes

25 6.1 Conclusions: Wish—list GPU Anonymous Memory Mappings: GPU to CPU mappings never fail Dynamic memory re—allocations GPU dynamic Pinned Memory: No intermediate data copies on flush Peer DMA: Speed—up I/O operations No intermediate copies on GPU-to-GPU copies 3/17/2010 ASPLOS 2010 -- Pittsburgh 25


Download ppt "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro."

Similar presentations


Ads by Google