Download presentation
Presentation is loading. Please wait.
1
Multicore and GPU Programming
Multiple CPU lead to multiple core lead to GPU GPU – hundreds to Thousands of cores GPU – order of magnitude faster
2
Flynn’s Taxonomy of Parallel Architectures
SISD SIMD MISD MIMD
3
Graphics Coding NVIDA – cuda
INTEL – APU accelerated processing unit, OpenCL
4
Cell BE processor Sony’s PS3 Master – worker heterogeneous
MIMD machine on a chip Master 2 threads PPE power processing elelment Runs OS and manages workers Workers 8 – 128 bit vector processors SPE synergistic processing elements Local memory 256k holds data and code SIMD
5
Cell BE continued PPE and SPE instruction set incompatible
Difficult to program Speed versus ease of programming 102.4 Gflops IBM Roadrunner supercomputer fastest 12,240 PowerXCell 8i 6562 AMD Opteron processors PowerXCell 8i is an enhanced version of original cell processor Not built anymore – too complex
6
Nvidia’s Kepler The third GPU architecture Nvidia designed for compute applications Cores arranged in groups called Streaming Multiprocessors (SMX) 192 cores SIMD on a SMX Each SMX can run its own program A chip in the Kepler family is distinguished by the number of blocks GTX Titan has 15 SMXs 14 of which are usable, 14*192 = 2688 cores Dual GPU GTX Titan Z has 5760 cores
7
The Process: Send data to GPU Launch a kernel Wait to collect results
8
AMD APUs CPU and GPU on same chip
Share memory, eliminates memory transfer Implement AMD’s APU Heterogeneous System Architecture (HSA) 2 core types Latency Compute Unit (LCU) a general CPU Supports native CPU instruction set HSA intermediate language (HSAIL) instruction set Throughput Compute Unit (TCU) a general GPU Supports only HSAIL Targets efficient execution
9
Multicore to Many-Core: Tilera’s Tile-GX8072 2007
2 dim grid Mesh computer Up to 72 cores Integrated with cpu/os
10
Intel’s Xeon Phi 2012 Used in 2 of the 10 super computers
China’s Tianhe-2 is top 61x86 cores Handles 4 threads at the same time 512 bit wide Vector Processing Unit (VPU) SIMD 16 single precision, 8 double-precision floating point numbers per clock cycle Each core has 32K data, 32K instruction L1 caches and 512K L2 cache Easy to use, can use OpenMP
11
PERFORMANCE Speedup = time seq / time parallel Wall clock time
Effected by Programmer skill Compiler Compiler switches OS File system (EXT4, NTFS….) Load
12
EFFICIENCY Efficiency = speedup / N = time seq/ N*time parallel
N is number of CPUs or cores If speedup = N we have linear speedup --- ideal
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.