Multi-/Many-Core Processors

Slides:



Advertisements
Similar presentations
Chapter 7 Multicores, Multiprocessors, and Clusters.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Structure of Computer Systems
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Understanding Parallel Computers Parallel Processing EE 613.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CS427 Multicore Architecture and Parallel Computing
Distributed Processors
Chapter 6 Parallel Processors from Client to Cloud
Parallel Processing - introduction
Parallel Computing Lecture
Morgan Kaufmann Publishers
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Parallel Computers Today
Multi-Processing in High Performance Computer Architecture:
Mattan Erez The University of Texas at Austin
Summary Background Introduction in algorithms and applications
Chapter 1 Introduction.
EE 4xx: Computer Architecture and Performance Programming
Chapter 4 Multiprocessors
Multicore and GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Multi-/Many-Core Processors Ana Lucia Varbanescu analucia@cs.vu.nl

A.L.Varbanescu - PP course @ VU Why? Ultimately, we arrived at multi-cores because we search for performance. We are interested in more performance for our codes 11/8/2018 A.L.Varbanescu - PP course @ VU

In the search for performance 11/8/2018 A.L.Varbanescu - PP course @ VU

In the search for performance We have M(o)ore transistors … How do we use them? Bigger cores Hit the walls*: power, memory, parallelism (ILP) “Dig through” ? Requires new technologies “Go around”? Multi-/many-cores *David Patterson – The Future of Computer Architecture – 2006 http://www.slidefinder.net/f/future_computer_architecture_david_patterson/6912680 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Multi-/many-cores In the search for performance … Build (HW) What architectures? Evaluate (HW) What metrics? How do we measure? Use (HW + SW) What workloads? Expected performance? Program (SW (+HW)) How to program? How to optimize? Benchmark How to analyze performance? 11/8/2018 A.L.Varbanescu - PP course @ VU

Build

Choices … Core type(s): Number of cores: Memory Parallelism Fat or slim ? Vectorized (SIMD) ? Homogeneous or heterogeneous? Number of cores: Few or many ? Memory Shared-memory or distributed-memory? Parallelism SIMD/MIMD, SPMD/MPMD, … Main constraint: chip area! 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU A taxonomy Based on “field-of-origin”: General-purpose (GPP/GPMC) Intel, AMD Graphics (GPUs) NVIDIA, ATI Embedded systems Philips/NXP, ARM Servers Sun (Oracle), IBM Gaming/Entertainment Sony/Toshiba/IBM High Performance Computing Intel, IBM, … 11/8/2018 A.L.Varbanescu - PP course @ VU

General Purpose Processors Architecture Few fat cores Homogeneous Stand-alone Memory Shared, multi-layered; Per-core cache Programming SMP machines Both symmetrical and asymmetrical threading OS Scheduler Gain performance … MPMD, coarse-level parallelism 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Intel 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Intel’s next gen 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU AMD 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU AMD’s next gen 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Server-side GPP-like with more HW threads Lower performance-per-thread Examples Sun UltraSPARC T2, T2+ 8 cores x 8 threads each high throughput IBM POWER7 11/8/2018 A.L.Varbanescu - PP course @ VU

Graphics Processing Units Architecture Hundreds/thousands of slim cores Homogeneous Accelerator(s) Memory Very complex hierarchy Both shared and per-core Programming Off-load model (Many) Symmetrical threads Hardware scheduler Gain performance … fine-grain parallelism, SIMT 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU NVIDIA G80/GT200/Fermi G80 GT200 SM = streaming multiprocessor 1 SM = 8 SP (streaming processors/CUDA cores) 1TPC = 2 x SM / 3 x SM = thread processing clusters 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU NVIDIA GT200 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU NVIDIA Fermi 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU ATI GPUs 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Cell/B.E. Architecture Heterogeneous 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE) Accelerator or stand-alone Memory Per-core only Programming Asymmetrical multi-threading User-controlled scheduling 6 levels of parallelism, all under user control Gain performance … Fine- and coarse-grain parallelism (MPMD, SPMD) SPE-specific optimizations Scheduling 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Cell/B.E. 1 x PPE 64-bit PowerPC L1: 32 KB I$+32 KB D$ L2: 512 KB 8 x SPE cores: Local mem (LS): 256 KB 128 x 128 bit vector registers Main memory access: PPE: Rd/Wr SPEs: Async DMA Available: Cell blades (QS2*): 2xCell and PS3: 1xCell (6 SPEs only) 11/8/2018 A.L.Varbanescu - PP course @ VU

Intel Single-chip Cloud Computer Architecture Tile-based many-core (48 cores) A tile is a dual-core Stand-alone / cluster Memory Per-core and per-tile Shared off-chip Programming Multi-processing with message passing User-controlled mapping/scheduling Gain performance … Coarse-grain parallelism (MPMD, SPMD) Multi-application workloads (cluster-like) 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Intel SCC 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Summary Computation Replace with LCPC table 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Summary Memory 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Take home message Variety of platforms Core types & counts Memory architecture & sizes Parallelism layers & types Scheduling Open question(s): Why so many? How many platforms do we need? Any application to run on any platform? 11/8/2018 A.L.Varbanescu - PP course @ VU

Evaluate – in theory…

HW Performance metrics Clock frequency [Hz] = Absolute HW speed(s) Memories, CPUs, interconnects Operational speed [GFLOPs] Operations per cycle Bandwidth [GB/s] memory access speed(s) differs a lot between different memories on chip Power Per core/per chip Derived metrics FLOP/Byte FLOP/Watt 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Peak performance Peak = # cores * # threads_per_core * # FLOPS/cycle * clock_frequency Examples: Nehalem EX: 8 * 2 * 4 * 2.26GHz = 170 GFLOPs HD 5870: (20*16) * 5 * 0.85GHz = 1360 GFLOPs GF100: (16*32) * 2 * 1.45GHz = 1484 GFLOPs 11/8/2018 A.L.Varbanescu - PP course @ VU

On-chip memory bandwidth Registers and per-core caches - specification Shared memory: Peak_Data_Rate x Data_Bus_Width = (frequency * data_rate) * data_bus_width Example(s): Nehalem DDR3: 1.333*2*64 = 21 GB/s HD 5870: 4.800 * 256 = 153.6 GB/s Fermi: 4.200 * 384 = 201.6 GB/s 11/8/2018 A.L.Varbanescu - PP course @ VU

Off-chip memory bandwidth Depends on the interconnect Intel’s technology: QPI 25.6 GB/s AMD’s technology: HT3 19.2 GB/s Accelerators: PCI-e 1.0 or 2.0 8GB/s or 16 GB/s 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Summary Cores Threads/ALUs GFLOPS BW FLOPS/Byte Cell/B.E. 8 204.80 25.6 8.0000 Nehalem EE 4 57.60 25.5 2.2588 Nehalem EX 16 170.00 63 2.6984 Niagara 32 9.33 20 0.4665 Niagara 2 64 11.20 76 0.1474 AMD Barcelona 37.00 21.4 1.7290 AMD Istanbul 6 62.40 2.4375 AMD Magny-Cours 12 124.80 4.8750 IBM Power 7 264.96 68.22 3.8839 G80 128 404.80 86.4 4.6852 GT200 30 240 933.00 141.7 6.5843 GF100 512 1484.00 201.6 7.3611 ATI Radeon 4890 160 800 680.00 124.8 5.4487 HD5870 320 1600 1360.00 153.6 8.8542 11/8/2018 A.L.Varbanescu - PP course @ VU

Absolute HW performance [1] Achieved in the optimal conditions: Processing units 100% used All parallelism 100% exploited All data transfers at maximum bandwidth Basically none – it’s even hard to build the right benchmarks … How many applications like this? 11/8/2018 A.L.Varbanescu - PP course @ VU

Evaluate – in use

A.L.Varbanescu - PP course @ VU Workloads For a new application … Design parallel algorithm Implement Optimize Benchmark Any application can run on any platform … Influence on performance portability productivity Ideally, we want a good fit! 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Performance goals Hardware designer: How fast is my hardware running? End-user: How fast is my application running? End-user’s manager: How efficient is my application? Developer’s manager: How much time it takes to program it? Developer: How close can I get to the peak performance? 11/8/2018 A.L.Varbanescu - PP course @ VU

SW Performance metrics Execution time (user) Speed-up vs. best available sequential application Achieved GFLOPs (developer/user’s manager) Computational efficiency Achieved GB/s (developer) Memory efficiency Productivity and portability (developer’s manager) Production costs Maintenance costs 11/8/2018 A.L.Varbanescu - PP course @ VU

For example … Hundreds of applications to reach speed-ups of up to 2 orders of magnitude!!! Incredible performance! Or is it? 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Developer Searching for peak performance … Which platform to use? What is the maximum I can achieve? And how? Performance models Amdahl’s Law Arithmetic Intensity and the Roofline model 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Amdahl’s Law How can we apply Amdahl’s law for MC applications ? - Discussion 11/8/2018 A.L.Varbanescu - PP course @ VU

Arithmetic intensity (AI) AI = #OP/Byte How many operations are executed per transferred byte? Determines the boundary between compute intensive and data intensive 11/8/2018 A.L.Varbanescu - PP course @ VU

Applications AI Example: AI (RGB-to-Gray conversion) = 5/4 A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods Is the application compute intensive or memory intensive ? Example: AI (RGB-to-Gray conversion) = 5/4 Read : 3B; Write : 1B Compute: 3 MUL + 2 ADD 11/8/2018 A.L.Varbanescu - PP course @ VU

Platform AI Is the application compute intensive or memory intensive ? RGB to Gray 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU The Roofline model [1] Achievable_peak = min { PeakGFLOPs, AI * streamBW } Peak GFLOPs = platform peak StreamBW = streaming bandwidth AI = application arithmetic intensity Theoretical peak values to be replaced by “real” values Without various optimizations Assumptions: Bandwidth is independent on arithmetic intensity Complete overlap of either communication or computation Computation is independent of optimization Bandwidth is independent of optimization or access pattern 1D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik, “Quantitative System Performance” 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU The Roofline model [2] 2 1/8 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/4 1/2 1 Black: Theoretical peak Yellow: No streaming optimizations Green: No in-core optimizations Red: “worst case” performance zone Dashed The application 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Use the Roofline model To determine what to do first to gain performance? Increase arithmetic intensity Increase streaming rate Apply in-core optimizations … and these are topics for your next lecture Samuel Williams et. al: “Roofline: an insightful visual performance model for multicore architectures” 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Take home message Performance evaluation depends on goals: Execution time (users) GFLOPs and GB/s (developers) Efficiency (budget holders ) Stop tweaking when: Reach performance goal Constrained by the capabilities of the (application,platform) pair – e.g., as predicted by Roofline Choose platform to fit application Parallelism layers Arithmetic intensity Streaming capabilities 11/8/2018 A.L.Varbanescu - PP course @ VU

A.L.Varbanescu - PP course @ VU Questions Ana Lucia Varbanescu analucia@cs.vu.nl 11/8/2018 A.L.Varbanescu - PP course @ VU