Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multicore and GPU Programming

Similar presentations


Presentation on theme: "Multicore and GPU Programming"— Presentation transcript:

1 Multicore and GPU Programming
Multiple CPU lead to multiple core lead to GPU GPU – hundreds to Thousands of cores GPU – order of magnitude faster

2 Flynn’s Taxonomy of Parallel Architectures
SISD SIMD MISD MIMD

3 Graphics Coding NVIDA – cuda
INTEL – APU accelerated processing unit, OpenCL

4 Cell BE processor Sony’s PS3 Master – worker heterogeneous
MIMD machine on a chip Master 2 threads PPE power processing elelment Runs OS and manages workers Workers 8 – 128 bit vector processors SPE synergistic processing elements Local memory 256k holds data and code SIMD

5 Cell BE continued PPE and SPE instruction set incompatible
Difficult to program Speed versus ease of programming 102.4 Gflops IBM Roadrunner supercomputer fastest 12,240 PowerXCell 8i 6562 AMD Opteron processors PowerXCell 8i is an enhanced version of original cell processor Not built anymore – too complex

6 Nvidia’s Kepler The third GPU architecture Nvidia designed for compute applications Cores arranged in groups called Streaming Multiprocessors (SMX) 192 cores SIMD on a SMX Each SMX can run its own program A chip in the Kepler family is distinguished by the number of blocks GTX Titan has 15 SMXs 14 of which are usable, 14*192 = 2688 cores Dual GPU GTX Titan Z has 5760 cores

7 The Process: Send data to GPU Launch a kernel Wait to collect results

8 AMD APUs CPU and GPU on same chip
Share memory, eliminates memory transfer Implement AMD’s APU Heterogeneous System Architecture (HSA) 2 core types Latency Compute Unit (LCU) a general CPU Supports native CPU instruction set HSA intermediate language (HSAIL) instruction set Throughput Compute Unit (TCU) a general GPU Supports only HSAIL Targets efficient execution

9 Multicore to Many-Core: Tilera’s Tile-GX8072 2007
2 dim grid Mesh computer Up to 72 cores Integrated with cpu/os

10 Intel’s Xeon Phi 2012 Used in 2 of the 10 super computers
China’s Tianhe-2 is top 61x86 cores Handles 4 threads at the same time 512 bit wide Vector Processing Unit (VPU) SIMD 16 single precision, 8 double-precision floating point numbers per clock cycle Each core has 32K data, 32K instruction L1 caches and 512K L2 cache Easy to use, can use OpenMP

11 PERFORMANCE Speedup = time seq / time parallel Wall clock time
Effected by Programmer skill Compiler Compiler switches OS File system (EXT4, NTFS….) Load

12 EFFICIENCY Efficiency = speedup / N = time seq/ N*time parallel
N is number of CPUs or cores If speedup = N we have linear speedup --- ideal

13 Hyperthreading Runs 2 software threads by duplicating part of the CPU
Makes OS think there are double number of processors 30% speedup

14 More resources – More speedup
Not necessarily so Sequential part Super Linear speed up, finds solution in exactly the same way

15 Scaling More resources yield speedup
If no scaling, probably a poor design Strong scaling efficiency = timesequential/N*timeparallel Same as general efficiency

16 Weak Scaling weakScaling Efficiency(N) = tseq/t’par
T’par is time to solve a problem N times bigger than one the single machine is solving in time tseq GPU’s offer a bigger challenge Never uses 1 core for tseq No fair to use CPU GPU needs host CPU, does the host count

17 Building a parallel program
Coordination problems Access to shared resources Load balancing issues Termination problems Halting problem in a coordinated fashion Etc.

18 How to build a parallel program
Build a sequential one of the desired parallel program Shows efficiency Shows correctness Shows most time-consuming parts of problem (profiler) Shows how much performance gain can be expected

19 guidelines Duration of the whole execution, not just parallel part
Create an avg over several runs Exclude outliers Scalability is important so run on different data sizes and workers Threads should not exceed number of processors or cores Hyperthreading should be disabled

20 Amdahl’s law Bunch of ants versus a heard of elephants

21 Gustafson-Barsis’s rebuttal
Parallel program does more than just speed up a sequential pgm Can handle bigger problem instances Rather than consider parallel pgm relative to seq, consider seq compared to parallel one


Download ppt "Multicore and GPU Programming"

Similar presentations


Ads by Google