Introduction.

Slides:



Advertisements
Similar presentations
COE 502 / CSE 661 Parallel and Vector Architectures Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
Advertisements

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
ECE669 L1: Course Introduction January 29, 2004 ECE 669 Parallel Computer Architecture Lecture 1 Course Introduction Prof. Russell Tessier Department of.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
CIS 629 Parallel Arch. Intro Parallel Computer Architecture Slides blended from those of David Patterson, CS 252 and David Culler, CS 258 UC Berkeley.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Why Parallel Architecture? Todd C. Mowry CS 495 January 15, 2002.
1 Chapter 01 Authors: John Hennessy & David Patterson.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Parallel Computing Laxmikant Kale
Multi-core architectures. Single-core computer Single-core CPU chip.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Multi-Core Architectures
ECE 568: Modern Comp. Architectures and Intro to Parallel Processing Fall 2006 Ahmed Louri ECE Department.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
ECE 569: High-Performance Computing: Architectures, Algorithms and Technologies Spring 2006 Ahmed Louri ECE Department.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
INEL6067 Technology ---> Limitations & Opportunities Wires -Area -Propagation speed Clock Power VLSI -I/O pin limitations -Chip area -Chip crossing delay.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
CS433 Spring 2001 Introduction Laxmikant Kale. 2 Course objectives and outline You will learn about: –Parallel programming models Emphasis on 3: message.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 6th Edition
CS203 – Advanced Computer Architecture
What is *Computer Architecture*
Lecture 2: Performance Evaluation
Lynn Choi School of Electrical Engineering
CMSC 611: Advanced Computer Architecture
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Lynn Choi School of Electrical Engineering
Morgan Kaufmann Publishers
COSC 3406: Computer Organization
Architecture & Organization 1
Modern Processor Design: Superscalar and Superpipelining
Morgan Kaufmann Publishers
Hyperthreading Technology
What is Parallel and Distributed computing?
Morgan Kaufmann Publishers
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
Microprocessor & Assembly Language
Lecture 14: Reducing Cache Misses
Multicore / Multiprocessor Architectures
CMSC 611: Advanced Computer Architecture
Performance of computer systems
Course Description: Parallel Computer Architecture
Parallel Processing Architectures
CS 258 Parallel Computer Architecture
Adaptive Single-Chip Multiprocessing
Chapter 1 Introduction.
The University of Adelaide, School of Computer Science
Performance of computer systems
Computer Evolution and Performance
What is Computer Architecture?
COMS 361 Computer Organization
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
Performance of computer systems
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Utsunomiya University
CSE378 Introduction to Machine Organization
Presentation transcript:

Introduction

Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution and Convergence of Parallel Architectures Fundamental Design Issues

What is Parallel Architecture? Introduction A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation: how large a collection? how powerful are the elements? how much memory? Data access, Communication and Synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for cooperation? Performance and Scalability how does it all translate into performance? how does it scale?

Why Study Parallel Architecture? Introduction Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. Parallelism: Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing

Why Study it Today? Introduction History: diverse and innovative organizational structures, often tied to novel programming models Rapidly maturing under strong technological constraints The “killer micro” is ubiquitous Laptops and supercomputers are fundamentally similar! Technological trends cause diverse approaches to converge Technological trends make parallel computing inevitable In the mainstream Need to understand fundamental principles and design tradeoffs, not just taxonomies Naming, Ordering, Replication, Communication performance Ubiquitous = všadeprítomný Inevitable = neodvratný

Inevitability of Parallel Computing Introduction Application demands: Our insatiable need for computing cycles Scientific computing: CFD, Biology, Chemistry, Physics, ... General-purpose computing: Video, Graphics, CAD, Databases, TP... Technology Trends Number of transistors on chip growing rapidly Clock rates expected to go up only slowly Architecture Trends Instruction-level parallelism valuable but limited Coarser-level parallelism, as in MPs, the most viable approach Economics Current trends: Today’s microprocessors have multiprocessor support Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!... Tomorrow’s microprocessors are multiprocessors/multicore systems Insatiable = nenásytný Viable = životoschopný CFD = Computational fluid dynamics

Application Trends Introduction Demand for cycles fuels advances in hardware, and vice-versa Cycle drives exponential increase in microprocessor performance Drives parallel architecture harder: most demanding applications Range of performance demands Need range of system performance with progressively increasing cost Platform pyramid Goal of applications in using parallel machines: Speedup Speedup (p processors) = For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Performance (p processors) Performance (1 processor) Demand = dopyt Time (1 processor) Time (p processors)

The University of Adelaide, School of Computer Science 15 May 2018 Computer Technology Introduction Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC architectures Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages Chapter 2 — Instructions: Language of the Computer

Scientific Computing Demand Introduction

Engineering Computing Demand Introduction Large parallel machines a mainstay in many industries Petroleum (reservoir analysis) Automotive (crash simulation, drag analysis, combustion efficiency), Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), Computer-aided design Pharmaceuticals (molecular modeling) Visualization in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering) Financial modeling (yield and derivative analysis) etc. Mainstay = pilier Combustion = spaľovanie Yield = výnos

Applications: Speech and Image Processing Introduction Also CAD, Databases, . . . 100 processors gets you 10 years, 1000 gets you 20 !

Scientific Computing Demand Introduction

Learning Curve for Parallel Applications Introduction AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D

Commercial Computing Also relies on parallelism for high end Introduction Also relies on parallelism for high end Scale not so large, but use much more wide-spread Computational power determines scale of business that can be handled Databases, online-transaction processing, decision support, data mining, data warehousing ... TPC benchmarks (TPC-C order entry, TPC-D decision support) TPC-C simulates a complete computing environment where a population of users executes transactions against a database. Explicit scaling criteria provided Size of enterprise scales with size of system Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm)

TPC-C Results for March 1996 Introduction Pervsive = prenikavý Parallelism is pervasive Small to moderate scale parallelism very important Difficult to obtain snapshot to compare across vendor platforms

TPC-C Results for December 2010 Introduction Oracle uses 9x CPUs to Achieve only 3x the Performance for TPC-C Benchmark Oracle benchmark result uses 27 64-core Sun SPARC T3-4 servers to process more than 30M tpmC*. In contrast, IBM’s most recent clustered TPC-C result uses 3 64-core IBM Power 780 servers to process more than 10M tpmC**. Results on Transaction Processing Performance Council Web site at http://www.tpc.org. Results as of 12/02/10. * Oracle SPARC SuperCluster with T3-4 Servers (27 x 64 core) (108 chips, 1728 cores, 13824 threads); 30,249,688 tpmC; $1.01/tpmC; available 6/1/11. ** IBM Power 780 cluster (3 x 64 core) (24 chips, 192 cores, 768 threads); 10,366,254 tpmC; $1.38/tpmC; available 10/13/10 Pervsive = prenikavý

TPC-C Results for June 2011 Introduction Pervsive = prenikavý Results on Transaction Processing Performance Council Web site at http://www.tpc.org. Results as of 06/08/11. Oracle SPARC SuperCluster (108 chips, 1728 cores, 13824 threads); 30,249,688 tpmC; $1.01/tpmC; available 6/1/11. IBM Power 780 cluster (24 chips, 192 cores, 768 threads); 10,366,254 tpmC; $1.38/tpmC; available 10/13/10. HP Integrity Superdome (64 chips, 128 cores, 256 threads); 4,092,799 tpmC; $2.93/tpmC; available 08/06/07.

Summary of Application Trends Introduction Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing Database and transactions as well as financial Usually smaller-scale, but large-scale systems also used Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads Greatest use of small-scale multiprocessors Solid application demand exists and will increase

Technology Trends Introduction The natural building block for multiprocessors is now also about the fastest!

Morgan Kaufmann Publishers 15 May, 2018 Comparing Systems Introduction Example: Opteron X2 vs. Opteron X4 2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHz Same memory system To get higher performance on X4 than X2 Need high arithmetic intensity Or working set must fit in X4’s 2MB L-3 cache Attainable = dosiahnuteľný Chapter 7 — Multicores, Multiprocessors, and Clusters

Optimizing Performance Morgan Kaufmann Publishers 15 May, 2018 Optimizing Performance Introduction Optimize FP performance Balance adds & multiplies Improve superscalar ILP and use of SIMD instructions Optimize memory usage Software prefetch Avoid load stalls Memory affinity Avoid non-local data accesses Chapter 7 — Multicores, Multiprocessors, and Clusters

Optimizing Performance Morgan Kaufmann Publishers 15 May, 2018 Optimizing Performance Introduction Choice of optimization depends on arithmetic intensity of code Arithmetic intensity is not always fixed May scale with problem size Caching reduces memory accesses Increases arithmetic intensity Chapter 7 — Multicores, Multiprocessors, and Clusters

General Technology Trends Introduction Microprocessor performance increases 50% - 100% per year Transistor count doubles every 3 years DRAM size quadruples every 3 years Huge investment per generation is carried by huge commodity market Not that single-processor performance is plateauing, but that parallelism is a natural way to improve it. 180 160 140 DEC 120 alpha 100 Integer FP IBM 80 HP 9000 RS6000 750 60 540 MIPS 40 MIPS Sun 4 M2000 M/120 20 260 1987 1988 1989 1990 1991 1992

Introduction Dimension Increase in Metal-Oxide-Semiconductor Memories and Transistors

Single Processor Performance Introduction Move to multi-processor RISC Copyright © 2012, Elsevier Inc. All rights reserved.

Moore's Law: 2015 mouse brain has been reached. Introduction

Technology: A Closer Look Introduction Basic advance is decreasing feature size ( ) Circuits become either faster or lower in power Die size is growing too Clock rate improves roughly proportional to improvement in  Number of transistors improves like (or faster) Performance > 100x per decade; clock rate 10x, rest transistor count How to use more transistors? Parallelism in processing multiple operations per cycle reduces CPI Locality in data access avoids latency and reduces CPI also improves processor utilization Both need resources, so tradeoff Fundamental issue is resource distribution, as in uniprocessors Proc $ Interconnect Rest = zvyšok

Clock Frequency Growth Rate Introduction 30% per year

The University of Adelaide, School of Computer Science 15 May 2018 Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 — Instructions: Language of the Computer

Transistor Count Growth Rate Introduction 100 million transistors on chip by early 2000’s A.D. Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades

Transistor Count Growth Rate Introduction 100 million transistors on chip by early 2000’s A.D. Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades

Morgan Kaufmann Publishers 15 May, 2018 i7-960 vs. NVIDIA Tesla 280/480 Introduction Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 15 May, 2018 Introduction Chapter 7 — Multicores, Multiprocessors, and Clusters

Similar Story for Storage Introduction Divergence between memory capacity and speed more pronounced Capacity increased by 1000x from 1980-95, speed only 2x Gigabit DRAM by c. 2000, but gap with processor speed much greater Larger memories are slower, while processors get faster Need to transfer more data in parallel Need deeper cache hierarchies How to organize caches? Parallelism increases effective size of each level of hierarchy, without increasing access time Parallelism and locality within memory systems too New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface Buffer caches most recently accessed data Disks too: Parallel disks plus caching Gap = medzera

Introduction The new generation of high-performance, power-efficient memory that delivers greater reliability

Architectural Trends Introduction Architecture translates technology’s gifts to performance and capability Resolves the tradeoff between parallelism and locality Microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Tradeoffs may change with scale and technology advances Understanding microprocessor architectural trends Helps build intuition about design issues or parallel machines Shows fundamental role of parallelism even in “sequential” computers Four generations of architectural history: tube, transistor, IC, VLSI Here focus only on VLSI generation Greatest delineation in VLSI has been in type of parallelism exploited

Architectural Trends Introduction Greatest trend in VLSI generation is increase in parallelism Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) great inflection point when 32-bit micro and cache fit on a chip Mid 80s to mid 90s: instruction level parallelism pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units => superscalar execution greater sophistication: out of order execution, speculation, prediction to deal with control transfer and latency problems Next step: thread level parallelism

Phases in VLSI Generation Introduction How good is instruction-level parallelism? Thread-level needed in microprocessors?

Architectural Trends: ILP Introduction • Reported speedups for superscalar processors • Horst, Harris, and Jardine [1990] ...................... 1.37 • Wang and Wu [1988] .......................................... 1.70 • Smith, Johnson, and Horowitz [1989] .............. 2.30 • Murakami et al. [1989] ........................................ 2.55 • Chang et al. [1991] ............................................. 2.90 • Jouppi and Wall [1989] ...................................... 3.20 • Lee, Kwok, and Briggs [1991] ........................... 3.50 • Wall [1991] .......................................................... 5 • Melvin and Patt [1991] ....................................... 8 • Butler et al. [1991] ............................................. 17+ Large variance due to difference in application domain investigated (numerical versus non-numerical) capabilities of processor modeled

ILP Ideal Potential Introduction Infinite resources and fetch bandwidth, perfect branch prediction and renaming real caches and non-zero miss latencies

Results of ILP Studies Concentrate on parallelism for 4-issue machines Introduction Concentrate on parallelism for 4-issue machines Realistic studies show only 2-fold speedup Recent studies show that more ILP needs to look across threads

Pipeline Stages Introduction Year Microarchitecture Pipeline stages max. Clock 1989 486 (80486) 3 100 MHz 30 ns 1993 P5 (Pentium) 5 300 MHz 16.7 ns 1995 P6 (Pentium Pro/II) 14 (17 with load & store/retire) 450 MHz 31 ns 1999 Pentium III 12 (15 with load & store/retire) 450~1400 MHz 2000 NetBurst 20 800~3000 MHz 2003 Pentium M 10 (12 with fetch/retire) 400~1000 MHz 2004 Prescott 31 3800 MHz 2006 Intel Core 12 (14 with fetch/retire) 3000 MHz 4 ns 2008 Nehalem Bonnell 16 (20 with prediction miss) 2100 MHz 2011 Sandy Bridge 14 (16 with fetch/retire) 3600 MHz 2013 Haswell ≈4000 MHz 3.5 ns 2015 Skylake

Architectural Trends: Bus-based MPs Introduction Micro on a chip makes it natural to connect many to shared memory dominates server and enterprise market, moving down to desktop Faster processors began to saturate bus, then bus technology advanced today, range of sizes for bus-based systems, desktop to large servers No. of processors in fully configured commercial shared-memory systems

Log-log plot of bandwidth and latency milestones Trends in Technology Log-log plot of bandwidth and latency milestones

Economics Commodity microprocessors not only fast but CHEAP Introduction Commodity microprocessors not only fast but CHEAP • Development cost is tens of millions of dollars (5-100 typical) • BUT, many more are sold compared to supercomputers Crucial to take advantage of the investment, and use the commodity building block Exotic parallel architectures no more than special-purpose Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization by Intel makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? Multiprocessor on a chip

Consider Scientific Supercomputing Introduction Proving ground and driver for innovative architecture and techniques Market smaller relative to commercial as MPs become mainstream Dominated by vector machines starting in 70s Microprocessors have made huge gains in floating-point performance high clock rates pipelined floating point units (e.g., multiply-add every cycle) instruction-level parallelism effective use of caches (e.g., automatic blocking) Plus economics Large-scale multiprocessors replace vector supercomputers Well under way already

Morgan Kaufmann Publishers 15 May, 2018 Concluding Remarks Introduction Goal: higher performance by using multiple processors Difficulties Developing parallel software Devising appropriate architectures SaaS importance is growing and clusters are a good match Performance per dollar and performance per Joule drive both mobile and WSC Chapter 7 — Multicores, Multiprocessors, and Clusters

Concluding Remarks (con’t) Introduction SIMD and vector operations match multimedia applications and are easy to program

Raw Uniprocessor Performance: LINPACK Introduction

Raw Parallel Performance: LINPACK Introduction Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32) Since 1993, Cray produces MPPs too (T3D, T3E)

The top 10 Introduction

Introduction

Introduction

Introduction

Introduction

Summary: Why Parallel Architecture? Introduction Increasingly attractive Economics, technology, architecture, application demand Increasingly central and mainstream Parallelism exploited at many levels Instruction-level parallelism Multiprocessor servers Large-scale multiprocessors (“MPPs”) Focus of this class: multiprocessor level of parallelism Same story from memory system perspective Increase bandwidth, reduce average latency with many local memories Wide range of parallel architectures make sense Different cost, performance and scalability