Multicore and GPU Programming Multiple CPU lead to multiple core lead to GPU GPU – hundreds to Thousands of cores GPU – order of magnitude faster
Flynn’s Taxonomy of Parallel Architectures SISD SIMD MISD MIMD
Graphics Coding NVIDA – cuda INTEL – APU accelerated processing unit, OpenCL
Cell BE processor Sony’s PS3 Master – worker heterogeneous MIMD machine on a chip Master 2 threads PPE power processing elelment Runs OS and manages workers Workers 8 – 128 bit vector processors SPE synergistic processing elements Local memory 256k holds data and code SIMD
Cell BE continued PPE and SPE instruction set incompatible Difficult to program Speed versus ease of programming 102.4 Gflops IBM Roadrunner supercomputer fastest 2008-2009 12,240 PowerXCell 8i 6562 AMD Opteron processors PowerXCell 8i is an enhanced version of original cell processor Not built anymore – too complex
Nvidia’s Kepler The third GPU architecture Nvidia designed for compute applications Cores arranged in groups called Streaming Multiprocessors (SMX) 192 cores SIMD on a SMX Each SMX can run its own program A chip in the Kepler family is distinguished by the number of blocks GTX Titan has 15 SMXs 14 of which are usable, 14*192 = 2688 cores Dual GPU GTX Titan Z has 5760 cores
The Process: Send data to GPU Launch a kernel Wait to collect results
AMD APUs CPU and GPU on same chip Share memory, eliminates memory transfer Implement AMD’s APU Heterogeneous System Architecture (HSA) 2 core types Latency Compute Unit (LCU) a general CPU Supports native CPU instruction set HSA intermediate language (HSAIL) instruction set Throughput Compute Unit (TCU) a general GPU Supports only HSAIL Targets efficient execution
Multicore to Many-Core: Tilera’s Tile-GX8072 2007 2 dim grid Mesh computer Up to 72 cores Integrated with cpu/os
Intel’s Xeon Phi 2012 Used in 2 of the 10 super computers China’s Tianhe-2 is top 61x86 cores Handles 4 threads at the same time 512 bit wide Vector Processing Unit (VPU) SIMD 16 single precision, 8 double-precision floating point numbers per clock cycle Each core has 32K data, 32K instruction L1 caches and 512K L2 cache Easy to use, can use OpenMP
PERFORMANCE Speedup = time seq / time parallel Wall clock time Effected by Programmer skill Compiler Compiler switches OS File system (EXT4, NTFS….) Load
EFFICIENCY Efficiency = speedup / N = time seq/ N*time parallel N is number of CPUs or cores If speedup = N we have linear speedup --- ideal