Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Structure of Computer Systems
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Parallel Computing Overview CS 524 – High-Performance Computing.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Chapter 7 Multicores, Multiprocessors, and Clusters.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Computer System Architectures Computer System Software
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Parallel Computing.
Pipelining and Parallelism Mark Staveley
Chapter 2: Parallel Programming Models Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval.
Outline Why this subject? What is High Performance Computing?
Chapter 1: Perspectives Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Lynn Choi School of Electrical Engineering
CMSC 611: Advanced Computer Architecture
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Course Outline Introduction in algorithms and applications
Chapter 1: Perspectives
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
Chapter 1: Perspectives
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Parallel Processing Architectures
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Presentation transcript:

Chapter 1: Perspectives Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.

Fundamentals of Computer Architecture - Chapter 12 Evolution in Microprocessors

Fundamentals of Computer Architecture - Chapter 13 Key Points  Increasingly more and more components can be integrated on a single chip  Speed of integration tracks Moore’s law: doubling every months.  Performance tracks speed of integration up until recently  At the architecture level, there are two techniques Instruction Level Parallelism Cache Memory  Performance gain from uniprocessor system so significant making multiprocessor systems not profitable

Fundamentals of Computer Architecture - Chapter 14 Illustration  100-processor system with perfect speedup  Compared to a single processor system Year 1: 100x faster Year 2: 62.5x faster Year 3: 39x faster … Year 10: 0.9x faster  Single processor performance catches up in just a few years!  Even worse It takes longer to develop a multiprocessor system Low volume means prices must be very high High prices delay adoption Perfect speedup is unattainable

Fundamentals of Computer Architecture - Chapter 15 Why did uniproc performance grow so fast?  ~ half from circuit improvement (smaller transistors, faster clock, etc.)  ~ half from architecture/organization:  Instruction Level Parallelism (ILP) Pipelining: RISC, CISC with RISC backend Superscalar Out of order execution  Memory hierarchy (Caches) Exploiting spatial and temporal locality Multiple cache levels

Fundamentals of Computer Architecture - Chapter 16 But the uniproc perf growth is stalling  Source of uniprocessor performance growth: instruction level parallelism (ILP) Parallel execution of independent instructions from a single thread  ILP growth has slowed abruptly Memory wall: Processor speed grows at 55%/year, memory speed grows at 7% per year ILP wall: achieving higher ILP requires quadratically increasing complexity (and power)  Power efficiency  Thermal packaging limit vs. cost

Fundamentals of Computer Architecture - Chapter 17  Instruction level (ECE 521) Pipelining Types of parallelism A (a load) B C IFIDMEMEXWB IFIDMEMEXWB IFIDMEMEXWB

Fundamentals of Computer Architecture - Chapter 18  Superscalar/ VLIW  Original:  Schedule as: + Moderate degree of parallelism (sometimes 50) - Requires fast communication (register level) LD F0, 34(R2) ADDD F4, F0, F2 LD F7, 45(R3) ADDD F8, F7, F6 LD F0, 34(R2) | LD F7, 45(R3) ADDD F4, F0, F2 | ADDD F8, F0, F6

Fundamentals of Computer Architecture - Chapter 19 Why ILP is slowing  Branch prediction accuracy is already > 90% Hard to improve it even more  Number of pipeline stages is already deep (~20-30 stages) But critical dependence loops do not change Memory latency requires more clock cycles to satisfy  Processor width is already high Quadratically increasing complexity to increase the width  Cache size Effective, but also shows diminishing returns In general, the size must be doubled to reduce miss rate by a half

Fundamentals of Computer Architecture - Chapter 110 Current Trend: Multicore and Manycore AspectsIntel Clovertown AMD Barcelona IBM Cell # cores448+1 Clock Freq2.66 GHz2.3 GHz3.2 GHz Core typeOOO Superscalar 2-issue SIMD Caches2x4MB L2512KB L2 (private), 2MB L3 (shd) 256KB local store Chip power120 Watts95 Watts100 Watts

Fundamentals of Computer Architecture - Chapter 111 Historical Perspectives  80s – early 90s: prime time for parallel architecture research A microprocessor cannot fit on a chip, so naturally need multiple chips (and processors) J-machine, M-machine, Alewife, Tera, HEP, etc.  90s: at the low end, uniprocessor systems’ speed grows much faster than parallel systems’ speed A microprocessor fits on a chip. So do branch predictor, multiple functional units, large caches, etc! Microprocessor also exploits parallelism (pipelining, multiple-issue, VLIW) – parallelisms originally invented for multiprocessors Many parallel computer vendors went bankrupt Prestigious but small high-performance computing market

Fundamentals of Computer Architecture - Chapter 112 “If the automobile industry advanced as rapidly as the semiconductor industry, a Rolls Royce would get ½ million miles per gallon and it would be cheaper to throw it away than to park it.” Gordon Moore, Intel Corporation

Fundamentals of Computer Architecture - Chapter 113  90s: emergence of distributed (vs. parallel) machines Progress in network technologies:  Network bandwidth grows faster than Moore’s law  Fast interconnection network getting cheap Connects cheap uniprocessor systems into a large distributed machine Network of Workstations, Clusters, GRID  00s: parallel architectures are back Transistors per chip >> microproc transistors Harder to get more performance from a uniprocessor SMT (Simultaneous multithreading), CMP (Chip Multi- Processor), ultimately Massive CMP E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5, Sun Niagara, etc.

Fundamentals of Computer Architecture - Chapter 114 What is a Parallel Architecture? “A parallel computer is a collection of processing elements that can communicate and cooperate to solve a large problem fast.” -- Almasi & Gottlieb

Fundamentals of Computer Architecture - Chapter 115 Parallel computers  A parallel computer is a collection of processing elements that can communicate and cooperate to solve a large problem fast. [Almasi&Gottlieb]  “collection of processing elements” How many? How powerful each? Scalability? Few very powerful (e.g., Altix) vs. many small ones (BlueGene)  “that can communicate” How do PEs communicate? (shared memory vs. msg passing) Interconnection network (bus, multistage, crossbar, …) Evaluation criteria: cost, latency, throughput, scalability, and fault tolerance

Fundamentals of Computer Architecture - Chapter 116  “and cooperate” Issues: granularity, synchronization, and autonomy Synchronization allows sequencing of operations to ensure correctness Granularity up => parallelism down, communication down, overhead down  Statement/Instruction level: 2-10 instructions (ECE 521)  Loop level: 10-1K instructions  Task level: 1K – 1M instructions  Program level: > 1M instructions Autonomy  SIMD (single instruction stream) vs. MIMD (multiple instruction streams)

Fundamentals of Computer Architecture - Chapter 117  “solve a large problem fast” General vs. special purpose machine? Any machine can solve certain problems well What domains? Highly (embarassingly) parallel apps  Many scientific codes Medium parallel apps  Many engineering apps (finite-elements, VLSI-CAD) Not parallel apps  Compilers, editors (do we care?)

Fundamentals of Computer Architecture - Chapter 118 Why parallel computers?  Absolute performance: Can we afford to wait? Folding of a single protein takes years to simulate on the most advanced microprocessor. It only takes days on a parallel computer Weather forecast: timeliness is crucial  Cost/performance Harder to improve performance on a single processor Bigger monolithic processor vs. many, simple processors  Power/performance  Reliability and availability  Key enabling technology: Advances in microproc and interconnect technology Advances in software technology

Fundamentals of Computer Architecture - Chapter 119 Scope of CSC/ECE 506  Parallelism Loop Level and Task Level Parallelism  Flynn taxonomy: SIMD (vector architecture) MIMD  Shared memory machines (SMP and DSM)  Clusters  Programming Model: Shared Memory Message passing Hybrid

Fundamentals of Computer Architecture - Chapter 120 Loop level parallelism  Each iteration can be computed independently  Each iteration cannot be computed independently, thus does not have loop level parallelism + Very high parallelism > 1K + Often easy to achieve load balance - Some loops are not parallel - Some apps do not have many loops for (i=0; i<8; i++) a[i] = b[i] + c[i]; for (i=0; i<8; i++) a[i] = b[i] + a[i-1];

Fundamentals of Computer Architecture - Chapter 121 Task level parallelism  Arbitrary code segments in a single program  Across loops:  Subroutines:  Threads: e.g. editor: GUI, printing, parsing + Larger granularity => low overheads, communication - Low degree of parallelism - Hard to balance … for (i=0; i<n; i++) sum = sum + a[i]; for (i=0; i<n; i++) prod = prod * a[i]; … Cost = getCost(); A = computeSum(); B = A + Cost;

Fundamentals of Computer Architecture - Chapter 122 Program level parallelism  Various independent programs execute together  gmake: gcc –c code1.c// assign to proc1 gcc –c code2.c// assign to proc2 gcc –c main.c// assign to proc3 gcc main.o code1.o code2.o + no communication - Hard to balance - Few opportunities

Fundamentals of Computer Architecture - Chapter 123 Scope of CSC/ECE 506  Parallelism Loop Level and Task Level Parallelism  Flynn taxonomy: SIMD (vector architecture) MIMD  *Shared memory machines (SMP and DSM)  Clusters  Programming Model: Shared Memory Message passing Hybrid

Fundamentals of Computer Architecture - Chapter 124 Taxonomy of Parallel Computers The Flynn taxonomy: Single or multiple instruction streams. Single or multiple data streams.  1. SISD machine (Most desktops, laptops) Only one instruction fetch stream Most of today’s workstations or desktops

Fundamentals of Computer Architecture - Chapter 125 SIMD  Examples: Vector processors, SIMD extensions (MMX)  A single instruction operates on multiple data items.  Pseudo-SIMD popular for multimedia extension SISD: for (i=0; i<8; i++) a[i] = b[i] + c[i]; SIMD: a = b + c; // vector addition

Fundamentals of Computer Architecture - Chapter 126 MISD machine  Example: CMU Warp  Systolic arrays

Fundamentals of Computer Architecture - Chapter 127 Systolic Arrays (contd.) Practical realizations (e.g. iWARP) use quite general processors  Enable variety of algorithms on same hardware But dedicated interconnect channels  Data transfer directly from register to register across channel Specialized, and same problems as SIMD  General purpose systems work well for same algorithms (locality etc.) Example: Systolic array for 1-D convolution

Fundamentals of Computer Architecture - Chapter 128 MIMD machine  Independent processors connected together to form a multiprocessor system.  Physical organization: Determines which memory hierarchy level is shared  Programming abstraction: Shared Memory:  on a chip: Chip Multiprocessor (CMP)  Interconnected by a bus: Symmetric multiprocessors (SMP)  Point-to-point interconnection: Distributed Shared Memory (DSM) Distributed Memory:  Clusters, Grid

Fundamentals of Computer Architecture - Chapter 129 MIMD Physical Organization P caches MP Shared Cache Architecture: - CMP (or Simultaneous Multi-Threading) - e.g.: Pentium4 chip, IBM Power4 chip, SUN Niagara, Pentium D, etc. - Implies shared memory hardware … P caches MP … Network UMA (Uniform Memory Access) Shared Memory : - Pentium Pro Quad, Sun Enterprise, etc. - What interconnection network? - Bus - Multistage - Crossbar - etc. - Implies shared memory hardware

Fundamentals of Computer Architecture - Chapter 130 MIMD Physical Organization (2) P caches M … Network P caches M NUMA (Non-Uniform Memory Access) Shared Memory : - SGI Origin, Altix, IBM p690, AMD Hammer-based system - What interconnection network? - Crossbar - Mesh - Hypercube - etc. - Also referred to as Distributed Shared Memory

Fundamentals of Computer Architecture - Chapter 131 MIMD Physical Organization (3) P caches M Network P caches M I/O Distributed System/Memory: - Also called clusters, grid - Don’t confuse it with distributed shared memory

Fundamentals of Computer Architecture - Chapter 132 Parallel vs. Distributed Computers Cost size Parallel comp Distrib comp Perf size Parallel comp Distrib comp Small scale machines: parallel system cheaper Large scale machines: distributed system cheaper Performance: parallel system better (but more expensive) System size: parallel system limited, and cost grows fast However, must also consider software cost

Fundamentals of Computer Architecture - Chapter 133 Scope of CSC/ECE 506  Parallelism Loop Level and Task Level Parallelism  Flynn taxonomy: MIMD  Shared memory machines (SMP and DSM)  Programming Model: Shared Memory Message passing Hybrid (e.g., UPC) Data parallel

Fundamentals of Computer Architecture - Chapter 134 Programming Models  Shared Memory / Shared Address Space: Each processor can see the entire memory Programming model = thread programming in uniprocessor systems

Fundamentals of Computer Architecture - Chapter 135  Distributed Memory / Message Passing / Multiple Address Space: a processor can only directly access its own local memory. All communication happens by explicit messages.

Fundamentals of Computer Architecture - Chapter 136 Shared Mem compared to Msg Passing + Can easily be automated (parallelizing compiler, OpenMP) + Shared vars are not communicated, but must be guarded - How to provide shared memory? Complex hardware - Synchronization overhead grows fast with more processors +- Difficult to debug, not intuitive for users

Fundamentals of Computer Architecture - Chapter 137 Data Parallel Prog Paradigm & Systems  Programming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor associated with each data element  Architectural model Array of many simple, cheap processors with little memory each  Processors don’t sequence through instructions Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization Original motivations Matches simple differential equation solvers Centralize high cost of instruction fetch/sequencing

Fundamentals of Computer Architecture - Chapter 138 Application of Data Parallelism Each PE contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 Logically, the whole operation is a single step Some processors enabled for arithmetic operation, others disabled  Other examples: Finite differences, linear algebra,... Document searching, graphics, image processing,...  Some recent machines: Thinking Machines CM-1, CM-2 (and CM-5) Maspar MP-1 and MP-2,

Fundamentals of Computer Architecture - Chapter 139 Common Today  Systolic Arrays: idea adopted in graphics and network processors  Dataflow: idea adopted in superscalar processors  Shared memory: most small scale servers (up to 128 processors) Now in workstations/desktops/laptops, too  Message passing: most large scale systems clusters, grid (hundreds to thousands of processors)  Data parallel/SIMD: small scale: SIMD multimedia extension (MMX, VIS) large scale: vector processors

Fundamentals of Computer Architecture - Chapter 140 Top 500 Supercomputer   Let’s look at the Earth Simulator Was #1 in 2004, now #10 in 2006  Hardware: 5,120 (640 8-way nodes) 500 MHz NEC CPUs 8 GFLOPS per CPU (41 TFLOPS total)  30s TFLOPS sustained performance! 2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total) shared memory inside the node 10 TB total memory 640 × 640 crossbar switch between the nodes 16 GB/s inter-node bandwidth 20 kVA power consumption per node

Fundamentals of Computer Architecture - Chapter 141  Programming Model In a CPU: data parallel, using automatic vectorization  Instruction level In a node (8 CPUs): shared memory using OpenMP  Loop level Across nodes: message passing using MPI-2 or HPF  Algorithm level

Fundamentals of Computer Architecture - Chapter 142  “The machine room sits at approximately 4th floor level. The 3rd floor level is taken by hundreds of kilometers of copper cabling, and the lower floors house the air conditioning and electrical equipment. The structure is enclosed in a cooling shell, with the air pumped from underneath through the cabinets, and collected to the two long sides of the building. The aeroshell gives the building its "pumped-up" appearance. The machine room is electromagnetically shielded to prevent interference from nearby expressway and rail. Even the halogen light sources are outside the shield, and the light is distributed by a grid of scattering pipes under the ceiling. The entire structure is mechanically isolated from the surroundings, suspended in order to make it less prone to earthquake damage. All attachments (power, cooling, access walkways) are flexible. “

Fundamentals of Computer Architecture - Chapter 143

Fundamentals of Computer Architecture - Chapter 144  Linpack performance: 40 TFlops, 80% peak  Real world performance: 33-66% peak (vs. less than 15% for clusters)  Cost? Hint: starts with 4  Maintenance $15M per year  Failure one processor per week  Distributed Memory Parallel Computing System which 640 processor nodes interconnected by Single-Stage Crossbar Network

Fundamentals of Computer Architecture - Chapter 145 Fastest (#1 as of Aug 2006)  BlueGene  processors  Each processor PowerPC MHz (2.8 GFlops)  Rpeak (GFlops):183 TFLOPS  Rmax (GFlops):136 TFLOPS

Fundamentals of Computer Architecture - Chapter 146 Limitations of very large machines  Niche market  Power wall By using low power processor, BlueGene can scale to a very large processor count Many practical issues: electricity, cooling, etc.  Programming wall Extremely hard to extract performance out of very large machine