Multicore and GPU Programming Multiple CPU lead to multiple core lead to GPU GPU – hundreds to Thousands of cores GPU – order of magnitude faster
Flynn’s Taxonomy of Parallel Architectures SISD SIMD MISD MIMD
Graphics Coding NVIDA – cuda INTEL – APU accelerated processing unit, OpenCL
Cell BE processor Sony’s PS3 Master – worker heterogeneous MIMD machine on a chip Master 2 threads PPE power processing elelment Runs OS and manages workers Workers 8 – 128 bit vector processors SPE synergistic processing elements Local memory 256k holds data and code SIMD
Cell BE continued PPE and SPE instruction set incompatible Difficult to program Speed versus ease of programming 102.4 Gflops IBM Roadrunner supercomputer fastest 2008-2009 12,240 PowerXCell 8i 6562 AMD Opteron processors PowerXCell 8i is an enhanced version of original cell processor Not built anymore – too complex
Nvidia’s Kepler The third GPU architecture Nvidia designed for compute applications Cores arranged in groups called Streaming Multiprocessors (SMX) 192 cores SIMD on a SMX Each SMX can run its own program A chip in the Kepler family is distinguished by the number of blocks GTX Titan has 15 SMXs 14 of which are usable, 14*192 = 2688 cores Dual GPU GTX Titan Z has 5760 cores
The Process: Send data to GPU Launch a kernel Wait to collect results
AMD APUs CPU and GPU on same chip Share memory, eliminates memory transfer Implement AMD’s APU Heterogeneous System Architecture (HSA) 2 core types Latency Compute Unit (LCU) a general CPU Supports native CPU instruction set HSA intermediate language (HSAIL) instruction set Throughput Compute Unit (TCU) a general GPU Supports only HSAIL Targets efficient execution
Multicore to Many-Core: Tilera’s Tile-GX8072 2007 2 dim grid Mesh computer Up to 72 cores Integrated with cpu/os
Intel’s Xeon Phi 2012 Used in 2 of the 10 super computers China’s Tianhe-2 is top 61x86 cores Handles 4 threads at the same time 512 bit wide Vector Processing Unit (VPU) SIMD 16 single precision, 8 double-precision floating point numbers per clock cycle Each core has 32K data, 32K instruction L1 caches and 512K L2 cache Easy to use, can use OpenMP
PERFORMANCE Speedup = time seq / time parallel Wall clock time Effected by Programmer skill Compiler Compiler switches OS File system (EXT4, NTFS….) Load
EFFICIENCY Efficiency = speedup / N = time seq/ N*time parallel N is number of CPUs or cores If speedup = N we have linear speedup --- ideal
Hyperthreading Runs 2 software threads by duplicating part of the CPU Makes OS think there are double number of processors 30% speedup
More resources – More speedup Not necessarily so Sequential part Super Linear speed up, finds solution in exactly the same way
Scaling More resources yield speedup If no scaling, probably a poor design Strong scaling efficiency = timesequential/N*timeparallel Same as general efficiency
Weak Scaling weakScaling Efficiency(N) = tseq/t’par T’par is time to solve a problem N times bigger than one the single machine is solving in time tseq GPU’s offer a bigger challenge Never uses 1 core for tseq No fair to use CPU GPU needs host CPU, does the host count
Building a parallel program Coordination problems Access to shared resources Load balancing issues Termination problems Halting problem in a coordinated fashion Etc.
How to build a parallel program Build a sequential one of the desired parallel program Shows efficiency Shows correctness Shows most time-consuming parts of problem (profiler) Shows how much performance gain can be expected
guidelines Duration of the whole execution, not just parallel part Create an avg over several runs Exclude outliers Scalability is important so run on different data sizes and workers Threads should not exceed number of processors or cores Hyperthreading should be disabled
Amdahl’s law Bunch of ants versus a heard of elephants
Gustafson-Barsis’s rebuttal Parallel program does more than just speed up a sequential pgm Can handle bigger problem instances Rather than consider parallel pgm relative to seq, consider seq compared to parallel one