Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology

 System Model  The Algorithm  Experiments 5 8 3 23 2235 858

Normal processorsGraphics processors  Multi-core  Large cache  Few threads  Speculative execution  Many-core  Small cache  Wide and fast memory bus  Thousands of threads hides memory latency

 Designed for general purpose computation  Extensions to C/C++

Global Memory

Multiprocessor 0 Multiprocessor 1 SharedMemorySharedMemory

Global Memory Multiprocessor 0 Multiprocessor 1 SharedMemorySharedMemory Thread Block 0 Thread Block 1 Thread Block x Thread Block y

 Minimal, manual cache  32-word SIMD instruction  Coalesced memory access  No block synchronization  Expensive synchronization primitives

5 8 3 23 2235 858

5 8 3 23 2235 858 Data Parallel!

5 8 3 23 2235 858 NOT Data Parallel! NOT Data Parallel!

5 8 3 23 2235 858 Phase One Sequences divided Block synchronization on the CPU Phase One Sequences divided Block synchronization on the CPU Thread Block 1Thread Block 2

5 8 3 23 2235 858 Phase Two Each block is assigned own sequence Run entirely on GPU Phase Two Each block is assigned own sequence Run entirely on GPU Thread Block 1Thread Block 2

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

 Assign Thread Blocks 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

 Uses an auxiliary array  In-place too expensive 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Fetch-And-AddonLeftPointerFetch-And-AddonLeftPointer LeftPointer = 0 Thread Block ONE Thread Block TWO 1 0

Fetch-And-AddonLeftPointerFetch-And-AddonLeftPointer 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 LeftPointer = 2 Thread Block ONE Thread Block TWO 2 3 32

Fetch-And-AddonLeftPointerFetch-And-AddonLeftPointer 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 LeftPointer = 4 Thread Block ONE Thread Block TWO 5 4 322 3

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 LeftPointer = 10 322330239778785769802

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 Three way partition 322330239477878576984402

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 < > = > <<< > < > < = < > < >>> < > = < >> ===

 Recurse on smallest sequence ◦ Max depth log(n)  Explicit stack  Alternative sorting ◦ Bitonic sort

 GPU-Quicksort  GPU-QuicksortCederman and Tsigas, ESA08  GPUSort  GPUSortGovindaraju et. al., SIGMOD05  Radix-Merge  Radix-MergeHarris et. al., GPU Gems 3 ’07  Global radix  Global radixSengupta et. al., GH07  Hybrid  HybridSintorn and Assarsson, GPGPU07  Introsort  Introsort David Musser, Software: Practice and Experience

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models

 Uniform  Gaussian  Bucket  Sorted  Zero  Stanford Models P x1x1x1x1 x2x2x2x2 x3x3x3x3 x4x4x4x4

 First Phase ◦ Average of maximum and minimum element  Second Phase ◦ Median of first, middle and last element

 8800GTX ◦ 16 multiprocessors (128 cores) ◦ 86.4 GB/s bandwidth  8600GTS ◦ 4 multiprocessors (32 cores) ◦ 32 GB/s bandwidth  CPU-Reference ◦ AMD Dual-Core Opteron 265 / 1.8 GHz

CPU-Reference ~4s CPU-Reference ~4s

GPUSort and Radix-Merge ~1.5s GPUSort and Radix-Merge ~1.5s

GPU-Quicksort ~380ms Global Radix ~510ms GPU-Quicksort ~380ms Global Radix ~510ms

Reference is faster than GPUSort and Radix-Merge

GPU-Quicksort takes less than 200ms GPU-Quicksort takes less than 200ms

Bitonic is unaffected by distribution

Three-way partition

Reference equal to Radix-Merge

Quicksort and hybrid ~4 times faster

Hybrid needs randomization

Reference better

Similar to uniform distribution

Sequence needs to be a power of two in length

 Minimal, manual cache ◦ Used only for prefix sum and bitonic  32-word SIMD instruction ◦ Main part executes same instructions  Coalesced memory access ◦ All reads coalesced  No block synchronization ◦ Only required in first phase  Expensive synchronization primitives ◦ Two passes amortizes cost

 Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way  It is competitive

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Similar presentations

Presentation on theme: "Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.

Similar presentations

Presentation on theme: "Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology."— Presentation transcript:

Similar presentations

About project

Feedback