Multicore and GPU Programming

Multicore and GPU Programming
Multiple CPU lead to multiple core lead to GPU GPU – hundreds to Thousands of cores GPU – order of magnitude faster

Flynn’s Taxonomy of Parallel Architectures
SISD SIMD MISD MIMD

Graphics Coding NVIDA – cuda
INTEL – APU accelerated processing unit, OpenCL

Cell BE processor Sony’s PS3 Master – worker heterogeneous
MIMD machine on a chip Master 2 threads PPE power processing elelment Runs OS and manages workers Workers 8 – 128 bit vector processors SPE synergistic processing elements Local memory 256k holds data and code SIMD

Cell BE continued PPE and SPE instruction set incompatible
Difficult to program Speed versus ease of programming 102.4 Gflops IBM Roadrunner supercomputer fastest 12,240 PowerXCell 8i 6562 AMD Opteron processors PowerXCell 8i is an enhanced version of original cell processor Not built anymore – too complex

Nvidia’s Kepler The third GPU architecture Nvidia designed for compute applications Cores arranged in groups called Streaming Multiprocessors (SMX) 192 cores SIMD on a SMX Each SMX can run its own program A chip in the Kepler family is distinguished by the number of blocks GTX Titan has 15 SMXs 14 of which are usable, 14*192 = 2688 cores Dual GPU GTX Titan Z has 5760 cores

The Process: Send data to GPU Launch a kernel Wait to collect results

AMD APUs CPU and GPU on same chip
Share memory, eliminates memory transfer Implement AMD’s APU Heterogeneous System Architecture (HSA) 2 core types Latency Compute Unit (LCU) a general CPU Supports native CPU instruction set HSA intermediate language (HSAIL) instruction set Throughput Compute Unit (TCU) a general GPU Supports only HSAIL Targets efficient execution

Multicore to Many-Core: Tilera’s Tile-GX8072 2007
2 dim grid Mesh computer Up to 72 cores Integrated with cpu/os

Intel’s Xeon Phi 2012 Used in 2 of the 10 super computers
China’s Tianhe-2 is top 61x86 cores Handles 4 threads at the same time 512 bit wide Vector Processing Unit (VPU) SIMD 16 single precision, 8 double-precision floating point numbers per clock cycle Each core has 32K data, 32K instruction L1 caches and 512K L2 cache Easy to use, can use OpenMP

PERFORMANCE Speedup = time seq / time parallel Wall clock time
Effected by Programmer skill Compiler Compiler switches OS File system (EXT4, NTFS….) Load

EFFICIENCY Efficiency = speedup / N = time seq/ N*time parallel
N is number of CPUs or cores If speedup = N we have linear speedup --- ideal

Hyperthreading Runs 2 software threads by duplicating part of the CPU
Makes OS think there are double number of processors 30% speedup

More resources – More speedup
Not necessarily so Sequential part Super Linear speed up, finds solution in exactly the same way

Scaling More resources yield speedup
If no scaling, probably a poor design Strong scaling efficiency = timesequential/N*timeparallel Same as general efficiency

Weak Scaling weakScaling Efficiency(N) = tseq/t’par
T’par is time to solve a problem N times bigger than one the single machine is solving in time tseq GPU’s offer a bigger challenge Never uses 1 core for tseq No fair to use CPU GPU needs host CPU, does the host count

Building a parallel program
Coordination problems Access to shared resources Load balancing issues Termination problems Halting problem in a coordinated fashion Etc.

How to build a parallel program
Build a sequential one of the desired parallel program Shows efficiency Shows correctness Shows most time-consuming parts of problem (profiler) Shows how much performance gain can be expected

guidelines Duration of the whole execution, not just parallel part
Create an avg over several runs Exclude outliers Scalability is important so run on different data sizes and workers Threads should not exceed number of processors or cores Hyperthreading should be disabled

Amdahl’s law Bunch of ants versus a heard of elephants

Gustafson-Barsis’s rebuttal
Parallel program does more than just speed up a sequential pgm Can handle bigger problem instances Rather than consider parallel pgm relative to seq, consider seq compared to parallel one

Multicore and GPU Programming

Similar presentations

Presentation on theme: "Multicore and GPU Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore and GPU Programming

Similar presentations

Presentation on theme: "Multicore and GPU Programming"— Presentation transcript:

Similar presentations

About project

Feedback