GPU-based Computing
Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and 10 registers/thread ) (8192)
Tesla C870 Threads
Example B blocks of T threads Which element am I processing?
Case Study: Irregular Terrain Model each profile is 1KB 16*16 threads 128*16 threads 45 regs 192*16 threads 37 regs
GPU Strategies for ITM
OpenMP vs CUDA for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]} #pragma omp parallel for shared(A) private(i,j)
Easy to Learn Takes time to master
For further discussion and additional information
Tesla C870 GPU vs. CELL BE 3072 threads on GPU 15x faster than CELL BE 128x faster than GPP Tesla C1060: 30 cores, double amount of registers ($1,500) with 8X GFLOPS over Intel Xeon W5590 Quad Core ($1600)
32 cores, 1536 threads per core No complex mechanisms –speculation, out of order execution, superscalar, instruction level parallelism, branch predictors Column access to memory –faster search, read First GPU with error correcting code –registers, shared memory, caches, DRAM Languages supported –C, C++, FORTRAN, Java, Matlab, and Python IEEE 754’08 Double-precision floating point –fused multiply-add operations, Streaming and task switching (25 microseconds) –Launching multiple kernels (up to 16) simultaneously visualization and computing Fermi: Next Generation GPU
H ij = max { H i-1,j-1 + S i, j, H i-1,j – G, H i,j-1 – G, 0 } //G = 10 Sequence Alignment ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W ABCDFW A B C D F W Substituion Matrix
Cost Function ABCD…Z A5-2-2…-3 B-25-3 …-5 C-313-4…-5 D …-5 ………………… Z-3-5 …15 Only need space for 26x26 matrix New Cost Function ARND…X Q100… U … E 002… R 50-2… Y-2 -3… ………………… Previous Methods Space needed is 23x(Query Length) Sorted Substitution Table Computed new table from substitution matrix with substitution characters for top row and query sequence for column *Does not use modulo
Protein Length GPU (1.35GHz) Time (s) SSEARCH (3.2GHz) Time(s) SpeedupGPU Cycles (Billions) SSEARCH Cycles (Billions) Cycles Ratio Alignment Database: Swissprot (Aug 2008), containing 392,768 sequences. GSW vs SSEARCH. Results
From Software Perspective A kernel is a SPMD-style Programmer organizes threads into thread blocks; Kernel consists of a grid of one or more blocks A thread block is a group of concurrent threads that can cooperate amongst themselves through –barrier synchronization –a per-block private shared memory space Programmer specifies –number of blocks –number of threads per block Each thread block = virtual multiprocessor –each thread has a fixed register –each block has a fixed allocation of per-block shared memory.
Efficiency Considerations Avoid execution divergence –threads within a warp follow different execution paths. –Divergence between warps is ok Allow loading a block of data into SM –process it there, and then write the final result back out to external memory. Coalesce memory accesses –Access executive words instead of gather-scatter Create enough parallel work –5K to 10K threads
Most important factor number of thread blocks =“processor count” At thread block (or SM) level –Internal communication is cheap Focusing on decomposing the work between the “p” thread blocks