Download presentation
Presentation is loading. Please wait.
1
Multi-/Many-Core Processors
Ana Lucia Varbanescu
2
A.L.Varbanescu - PP course @ VU
Why? Ultimately, we arrived at multi-cores because we search for performance. We are interested in more performance for our codes 11/8/2018 A.L.Varbanescu - PP VU
3
In the search for performance
11/8/2018 A.L.Varbanescu - PP VU
4
In the search for performance
We have M(o)ore transistors … How do we use them? Bigger cores Hit the walls*: power, memory, parallelism (ILP) “Dig through” ? Requires new technologies “Go around”? Multi-/many-cores *David Patterson – The Future of Computer Architecture – 2006 11/8/2018 A.L.Varbanescu - PP VU
5
A.L.Varbanescu - PP course @ VU
Multi-/many-cores In the search for performance … Build (HW) What architectures? Evaluate (HW) What metrics? How do we measure? Use (HW + SW) What workloads? Expected performance? Program (SW (+HW)) How to program? How to optimize? Benchmark How to analyze performance? 11/8/2018 A.L.Varbanescu - PP VU
6
Build
7
Choices … Core type(s): Number of cores: Memory Parallelism
Fat or slim ? Vectorized (SIMD) ? Homogeneous or heterogeneous? Number of cores: Few or many ? Memory Shared-memory or distributed-memory? Parallelism SIMD/MIMD, SPMD/MPMD, … Main constraint: chip area! 11/8/2018 A.L.Varbanescu - PP VU
8
A.L.Varbanescu - PP course @ VU
A taxonomy Based on “field-of-origin”: General-purpose (GPP/GPMC) Intel, AMD Graphics (GPUs) NVIDIA, ATI Embedded systems Philips/NXP, ARM Servers Sun (Oracle), IBM Gaming/Entertainment Sony/Toshiba/IBM High Performance Computing Intel, IBM, … 11/8/2018 A.L.Varbanescu - PP VU
9
General Purpose Processors
Architecture Few fat cores Homogeneous Stand-alone Memory Shared, multi-layered; Per-core cache Programming SMP machines Both symmetrical and asymmetrical threading OS Scheduler Gain performance … MPMD, coarse-level parallelism 11/8/2018 A.L.Varbanescu - PP VU
10
A.L.Varbanescu - PP course @ VU
Intel 11/8/2018 A.L.Varbanescu - PP VU
11
A.L.Varbanescu - PP course @ VU
Intel’s next gen 11/8/2018 A.L.Varbanescu - PP VU
12
A.L.Varbanescu - PP course @ VU
AMD 11/8/2018 A.L.Varbanescu - PP VU
13
A.L.Varbanescu - PP course @ VU
AMD’s next gen 11/8/2018 A.L.Varbanescu - PP VU
14
A.L.Varbanescu - PP course @ VU
Server-side GPP-like with more HW threads Lower performance-per-thread Examples Sun UltraSPARC T2, T2+ 8 cores x 8 threads each high throughput IBM POWER7 11/8/2018 A.L.Varbanescu - PP VU
15
Graphics Processing Units
Architecture Hundreds/thousands of slim cores Homogeneous Accelerator(s) Memory Very complex hierarchy Both shared and per-core Programming Off-load model (Many) Symmetrical threads Hardware scheduler Gain performance … fine-grain parallelism, SIMT 11/8/2018 A.L.Varbanescu - PP VU
16
A.L.Varbanescu - PP course @ VU
NVIDIA G80/GT200/Fermi G80 GT200 SM = streaming multiprocessor 1 SM = 8 SP (streaming processors/CUDA cores) 1TPC = 2 x SM / 3 x SM = thread processing clusters 11/8/2018 A.L.Varbanescu - PP VU
17
A.L.Varbanescu - PP course @ VU
NVIDIA GT200 11/8/2018 A.L.Varbanescu - PP VU
18
A.L.Varbanescu - PP course @ VU
NVIDIA Fermi 11/8/2018 A.L.Varbanescu - PP VU
19
A.L.Varbanescu - PP course @ VU
ATI GPUs 11/8/2018 A.L.Varbanescu - PP VU
20
A.L.Varbanescu - PP course @ VU
Cell/B.E. Architecture Heterogeneous 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE) Accelerator or stand-alone Memory Per-core only Programming Asymmetrical multi-threading User-controlled scheduling 6 levels of parallelism, all under user control Gain performance … Fine- and coarse-grain parallelism (MPMD, SPMD) SPE-specific optimizations Scheduling 11/8/2018 A.L.Varbanescu - PP VU
21
A.L.Varbanescu - PP course @ VU
Cell/B.E. 1 x PPE 64-bit PowerPC L1: 32 KB I$+32 KB D$ L2: 512 KB 8 x SPE cores: Local mem (LS): 256 KB 128 x 128 bit vector registers Main memory access: PPE: Rd/Wr SPEs: Async DMA Available: Cell blades (QS2*): 2xCell and PS3: 1xCell (6 SPEs only) 11/8/2018 A.L.Varbanescu - PP VU
22
Intel Single-chip Cloud Computer
Architecture Tile-based many-core (48 cores) A tile is a dual-core Stand-alone / cluster Memory Per-core and per-tile Shared off-chip Programming Multi-processing with message passing User-controlled mapping/scheduling Gain performance … Coarse-grain parallelism (MPMD, SPMD) Multi-application workloads (cluster-like) 11/8/2018 A.L.Varbanescu - PP VU
23
A.L.Varbanescu - PP course @ VU
Intel SCC 11/8/2018 A.L.Varbanescu - PP VU
24
A.L.Varbanescu - PP course @ VU
Summary Computation Replace with LCPC table 11/8/2018 A.L.Varbanescu - PP VU
25
A.L.Varbanescu - PP course @ VU
Summary Memory 11/8/2018 A.L.Varbanescu - PP VU
26
A.L.Varbanescu - PP course @ VU
Take home message Variety of platforms Core types & counts Memory architecture & sizes Parallelism layers & types Scheduling Open question(s): Why so many? How many platforms do we need? Any application to run on any platform? 11/8/2018 A.L.Varbanescu - PP VU
27
Evaluate – in theory…
28
HW Performance metrics
Clock frequency [Hz] = Absolute HW speed(s) Memories, CPUs, interconnects Operational speed [GFLOPs] Operations per cycle Bandwidth [GB/s] memory access speed(s) differs a lot between different memories on chip Power Per core/per chip Derived metrics FLOP/Byte FLOP/Watt 11/8/2018 A.L.Varbanescu - PP VU
29
A.L.Varbanescu - PP course @ VU
Peak performance Peak = # cores * # threads_per_core * # FLOPS/cycle * clock_frequency Examples: Nehalem EX: 8 * 2 * 4 * 2.26GHz = 170 GFLOPs HD 5870: (20*16) * 5 * 0.85GHz = 1360 GFLOPs GF100: (16*32) * 2 * 1.45GHz = 1484 GFLOPs 11/8/2018 A.L.Varbanescu - PP VU
30
On-chip memory bandwidth
Registers and per-core caches - specification Shared memory: Peak_Data_Rate x Data_Bus_Width = (frequency * data_rate) * data_bus_width Example(s): Nehalem DDR3: *2*64 = GB/s HD 5870: * 256 = GB/s Fermi: * 384 = GB/s 11/8/2018 A.L.Varbanescu - PP VU
31
Off-chip memory bandwidth
Depends on the interconnect Intel’s technology: QPI 25.6 GB/s AMD’s technology: HT3 19.2 GB/s Accelerators: PCI-e 1.0 or 2.0 8GB/s or 16 GB/s 11/8/2018 A.L.Varbanescu - PP VU
32
A.L.Varbanescu - PP course @ VU
Summary Cores Threads/ALUs GFLOPS BW FLOPS/Byte Cell/B.E. 8 204.80 25.6 8.0000 Nehalem EE 4 57.60 25.5 2.2588 Nehalem EX 16 170.00 63 2.6984 Niagara 32 9.33 20 0.4665 Niagara 2 64 11.20 76 0.1474 AMD Barcelona 37.00 21.4 1.7290 AMD Istanbul 6 62.40 2.4375 AMD Magny-Cours 12 124.80 4.8750 IBM Power 7 264.96 68.22 3.8839 G80 128 404.80 86.4 4.6852 GT200 30 240 933.00 141.7 6.5843 GF100 512 201.6 7.3611 ATI Radeon 4890 160 800 680.00 124.8 5.4487 HD5870 320 1600 153.6 8.8542 11/8/2018 A.L.Varbanescu - PP VU
33
Absolute HW performance [1]
Achieved in the optimal conditions: Processing units 100% used All parallelism 100% exploited All data transfers at maximum bandwidth Basically none – it’s even hard to build the right benchmarks … How many applications like this? 11/8/2018 A.L.Varbanescu - PP VU
34
Evaluate – in use
35
A.L.Varbanescu - PP course @ VU
Workloads For a new application … Design parallel algorithm Implement Optimize Benchmark Any application can run on any platform … Influence on performance portability productivity Ideally, we want a good fit! 11/8/2018 A.L.Varbanescu - PP VU
36
A.L.Varbanescu - PP course @ VU
Performance goals Hardware designer: How fast is my hardware running? End-user: How fast is my application running? End-user’s manager: How efficient is my application? Developer’s manager: How much time it takes to program it? Developer: How close can I get to the peak performance? 11/8/2018 A.L.Varbanescu - PP VU
37
SW Performance metrics
Execution time (user) Speed-up vs. best available sequential application Achieved GFLOPs (developer/user’s manager) Computational efficiency Achieved GB/s (developer) Memory efficiency Productivity and portability (developer’s manager) Production costs Maintenance costs 11/8/2018 A.L.Varbanescu - PP VU
38
For example … Hundreds of applications to reach
speed-ups of up to 2 orders of magnitude!!! Incredible performance! Or is it? 11/8/2018 A.L.Varbanescu - PP VU
39
A.L.Varbanescu - PP course @ VU
Developer Searching for peak performance … Which platform to use? What is the maximum I can achieve? And how? Performance models Amdahl’s Law Arithmetic Intensity and the Roofline model 11/8/2018 A.L.Varbanescu - PP VU
40
A.L.Varbanescu - PP course @ VU
Amdahl’s Law How can we apply Amdahl’s law for MC applications ? - Discussion 11/8/2018 A.L.Varbanescu - PP VU
41
Arithmetic intensity (AI)
AI = #OP/Byte How many operations are executed per transferred byte? Determines the boundary between compute intensive and data intensive 11/8/2018 A.L.Varbanescu - PP VU
42
Applications AI Example: AI (RGB-to-Gray conversion) = 5/4
A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods Is the application compute intensive or memory intensive ? Example: AI (RGB-to-Gray conversion) = 5/4 Read : 3B; Write : 1B Compute: 3 MUL + 2 ADD 11/8/2018 A.L.Varbanescu - PP VU
43
Platform AI Is the application compute intensive or memory intensive ?
RGB to Gray 11/8/2018 A.L.Varbanescu - PP VU
44
A.L.Varbanescu - PP course @ VU
The Roofline model [1] Achievable_peak = min { PeakGFLOPs, AI * streamBW } Peak GFLOPs = platform peak StreamBW = streaming bandwidth AI = application arithmetic intensity Theoretical peak values to be replaced by “real” values Without various optimizations Assumptions: Bandwidth is independent on arithmetic intensity Complete overlap of either communication or computation Computation is independent of optimization Bandwidth is independent of optimization or access pattern 1D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik, “Quantitative System Performance” 11/8/2018 A.L.Varbanescu - PP VU
45
A.L.Varbanescu - PP course @ VU
The Roofline model [2] 2 1/8 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/4 1/2 1 Black: Theoretical peak Yellow: No streaming optimizations Green: No in-core optimizations Red: “worst case” performance zone Dashed The application 11/8/2018 A.L.Varbanescu - PP VU
46
A.L.Varbanescu - PP course @ VU
Use the Roofline model To determine what to do first to gain performance? Increase arithmetic intensity Increase streaming rate Apply in-core optimizations … and these are topics for your next lecture Samuel Williams et. al: “Roofline: an insightful visual performance model for multicore architectures” 11/8/2018 A.L.Varbanescu - PP VU
47
A.L.Varbanescu - PP course @ VU
Take home message Performance evaluation depends on goals: Execution time (users) GFLOPs and GB/s (developers) Efficiency (budget holders ) Stop tweaking when: Reach performance goal Constrained by the capabilities of the (application,platform) pair – e.g., as predicted by Roofline Choose platform to fit application Parallelism layers Arithmetic intensity Streaming capabilities 11/8/2018 A.L.Varbanescu - PP VU
48
A.L.Varbanescu - PP course @ VU
Questions Ana Lucia Varbanescu 11/8/2018 A.L.Varbanescu - PP VU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.