An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu 3/17/2010 ASPLOS 2010 -- Pittsburgh 1

1. Introduction: Heterogeneous Computing 3/17/2010 ASPLOS 2010 -- Pittsburgh 2 Heterogeneous Parallel Systems: CPU: sequential control-intensive code Accelerators: massively data-parallel code CPU ACC Existent programming models are DMA based: Explicit memory copy Programmer-managed memory coherence CPU ACC IN OUT

Outline 1.Introduction 2.Motivation 3.ADSM: Asymmetric Distributed Shared Memory 4.GMAC: Global Memory for ACcelerators 5.Experimental Results 6.Conclusions 3/17/2010 ASPLOS 2010 -- Pittsburgh 3

2.1 Motivation: Reference System 3/17/2010 ASPLOS 2010 -- Pittsburgh CPU (N - Cores) PCIe Bus RAM Memory GPU-like Accelerator 4 High bandwidth Weak consistency Large page size Low latency Strong consistency Small page size RAM Memory Device Memory System Memory

2.2 Motivation: Memory Requirements High memory bandwidth requirements Non fully-coherent systems: Long-latency coherence traffic Different coherence protocols Accelerator memory always growing (e.g. 6GB NVIDIA Fermi, 16GB PowerXCell 8i) 3/17/2010 ASPLOS 2010 -- Pittsburgh 5

2.3 Motivation: DMA-Based Programming 3/17/2010 ASPLOS 2010 -- Pittsburgh 6 Duplicated Pointers Explicit Coherence Management CUDA Sample Code CPUGPU foo void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel >>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo); }

3.1 ADSM: Unified Virtual Address Space Unified Virtual Shared Address Space CPU: access both, system and accelerator memory Accelerator: access to its own memory Under ADSM, both will use the same virtual address when referencing the shared object CPU ACC bar baz foo Shared Data Object 3/17/2010 ASPLOS 2010 -- Pittsburgh 7 System Memory Device Memory

3.2 ADSM: Simplified Code Simpler CPU code than in DMA- based programming models Hardware-independent code Single Pointer Data Assignment Peer DMA Legacy Support 3/17/2010 ASPLOS 2010 -- Pittsburgh 8 void compute(FILE *file, int size) { float *foo; foo = adsmMalloc(size); fread(foo, size, 1, file); kernel >>(foo, size); cpuComputation(foo); adsmFree(foo); } CPUGPU foo

3.3 ADSM: Memory Distribution Asymmetric Distributed Shared Memory principles: CPU accesses objects in accelerator memory but not vice versa All coherency actions are performed by the CPU Trashing unlikely to happen: Synchronization Variables: Interrupt-based and dedicated hardware False-sharing: Data object sharing granularity 3/17/2010 ASPLOS 2010 -- Pittsburgh 9

3.4 ADSM: Consistency and Coherence Release consistency: Consistency only relevant from CPU perspective Implicit release/acquire at accelerator call/return CPUACC Foo CPUACC Foo Accelerator Return Accelerator Call 3/17/2010 ASPLOS 2010 -- Pittsburgh 10 Memory Coherence: Data ownership information enables eager data transfers CPU maintains coherency

4. Global Memory for Accelerators ADSM implementation User-level shared library GNU / Linux Systems NVIDIA CUDA GPUs 3/17/2010 ASPLOS 2010 -- Pittsburgh 11

4.1 GMAC: Overall Design Layered Design: Multiple Memory Consistency Protocols Operating System and Accelerator Independent code CUDA-like Front-End Memory Manager (Different Policies) Kernel Scheduler (FIFO) Operating System Abstraction Layer Accelerator Abstraction Layer (CUDA) 3/17/2010 ASPLOS 2010 -- Pittsburgh 12

4.2 GMAC: Unified Address Space System Virtual Address Space GPU Physical Address Space 3/17/2010 ASPLOS 2010 -- Pittsburgh 13 Virtual Address Space formed by GPU and System physical memories GPU memory address range cannot be selected Allocate same virtual memory address range in both, GPU and CPU Accelerator Virtual memory would ease this process

Batch-Update: copy all shared objects Lazy-Update: copy modified / needed shared objects Data object granularity Detect CPU read/write accesses to shared objects Rolling-Update: copy only modified / needed memory Memory block size granularity Fixed maximum number of modified blocks in system memory  flush data when maximum is reached 4.3 GMAC: Coherence Protocols 3/17/2010 ASPLOS 2010 -- Pittsburgh 14

5.1 Results: GMAC vs. CUDA 3/17/2010 ASPLOS 2010 -- Pittsburgh 15 Batch-Update overheads: – Copy output data on call – Copy non-used data Similar performance for CUDA, Lazy- Update and Rolling-Update

5.2 Results: Lazy vs. Rolling on 3D Stencil Extra data copy for small data objects Trade-off between bandwidth and page fault overhead 3/17/2010 ASPLOS 2010 -- Pittsburgh 16

6. Conclusions Unified virtual shared address space simplifies programming of heterogeneous systems Asymmetric Distributed Shared Memory CPU access accelerator memory but not vice versa Coherence actions only executed by CPU Experimental results shows no performance degradation Memory translation in accelerators is key to efficient implement ADSM 3/17/2010 ASPLOS 2010 -- Pittsburgh 17

Thank you for your attention Eager to start using GMAC? http://code.google.com/p/adsm/ igelado@ac.upc.edu adsm-users@googlegroups.com 3/17/2010 18 ASPLOS 2010 -- Pittsburgh

Backup Slides 3/17/2010 ASPLOS 2010 -- Pittsburgh 19

4.4 GMAC: Memory Mapping Software: allocate different address space and provide translation function ( gmacSafePtr() ) Hardware: implement virtual memory in the GPU 3/17/2010 ASPLOS 2010 -- Pittsburgh 20 System Virtual Address Space GPU Physical Address Space Allocation might fail if the range is in use

4.5 GMAC: Protocol States Protocol States: Invalid, Read-only, Dirty 3/17/2010 ASPLOS 2010 -- Pittsburgh 21 Invalid Dirty Read Only Call Read Write Flush Invalid Dirty Call Return Batch-Update: Call / Return Lazy-Update: Call / Return Read / Write Rolling-Update: Call / Return Read / Write Flush

4.6 GMAC: Rolling vs. Lazy 3/17/2010 ASPLOS 2010 -- Pittsburgh 22 Batch – Update: transfer on kernel call Rolling – Update: transfer while CPU computes

5.3 Results: Break-down of Execution 3/17/2010 ASPLOS 2010 -- Pittsburgh 23

5.4 Results: Rolling Size vs. Block Size No appreciable effect on most benchmarks 3/17/2010 ASPLOS 2010 -- Pittsburgh 24 Small Rolling size leads to performance aberrations Prefer relative large rolling sizes

6.1 Conclusions: Wish—list GPU Anonymous Memory Mappings: GPU to CPU mappings never fail Dynamic memory re—allocations GPU dynamic Pinned Memory: No intermediate data copies on flush Peer DMA: Speed—up I/O operations No intermediate copies on GPU-to-GPU copies 3/17/2010 ASPLOS 2010 -- Pittsburgh 25

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Similar presentations

Presentation on theme: "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Similar presentations

Presentation on theme: "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro."— Presentation transcript:

Similar presentations

About project

Feedback