Chapter 1: Perspectives

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
Structure of Computer Systems
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Chapter 7 Multicores, Multiprocessors, and Clusters.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 1: Perspectives Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Pipelining and Parallelism Mark Staveley
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
Outline Why this subject? What is High Performance Computing?
EKT303/4 Superscalar vs Super-pipelined.
Chapter 1: Perspectives Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
These slides are based on the book:
CIT 668: System Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Lynn Choi School of Electrical Engineering
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Advanced Architectures
18-447: Computer Architecture Lecture 30B: Multiprocessors
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
Architecture & Organization 1
Chapter 1: Perspectives
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Introduction to Multiprocessors
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
Chapter 1 Introduction.
Distributed Systems CS
Computer Evolution and Performance
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Multiprocessor and Thread-Level Parallelism Chapter 4
The University of Adelaide, School of Computer Science
Presentation transcript:

Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.

Fundamentals of Computer Architecture - Chapter 1 Outline for Lecture 1 Introduction Types of parallelism Architectural trends Why parallel computers? Scope of CSC/ECE 506 Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Key Points More and more components can be integrated on a single chip Speed of integration tracks Moore’s law, doubling every 18–24 months. Exercise: Look up how the number of transistors per chip has changed, esp. since 2006. Submit here. Until recently, performance tracked speed of integration At the architectural level, two techniques facilitated this: Instruction-level parallelism Cache memory Performance gain from uniprocessor system was high enough that multiprocessor systems were not viable for most uses. Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Illustration 100-processor system with perfect speedup Compared to a single processor system Year 1: 100x faster Year 2: 62.5x faster Year 3: 39x faster … Year 10: 0.9x faster Single-processor performance catches up in just a few years! Even worse It takes longer to develop a multiprocessor system Low volume means prices must be very high High prices delay adoption Perfect speedup is unattainable Fundamentals of Computer Architecture - Chapter 1

Why did uniprocessor performance grow so fast? ≈ half from circuit improvement (smaller transistors, faster clock, etc.) ≈ half from architecture/organization: Instruction-level parallelism (ILP) Pipelining: RISC, CISC with RISC back-end Superscalar Out-of-order execution Memory hierarchy (caches) Exploit spatial and temporal locality Multiple cache levels Fundamentals of Computer Architecture - Chapter 1

But uniprocessor perf. growth is stalling Source of performance growth had been ILP Parallel execution of independent instructions from a single thread But ILP growth has slowed abruptly Memory wall: Processor speed grows at 55%/year, memory speed grows at 7% per year ILP wall: achieving higher ILP requires quadratically increasing complexity (and power) Power efficiency Thermal packaging limit vs. cost Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Types of parallelism Instruction level (cf. ECE 521) Pipelining A (a load) IF ID EX MEM WB IF ID MEM EX WB B IF ID MEM EX WB C Fundamentals of Computer Architecture - Chapter 1

Types of parallelism, cont. Superscalar/ VLIW Original: Schedule as: + Moderate degree of parallelism – Requires fast communication (register level) LD F0, 34(R2) ADDD F4, F0, F2 LD F7, 45(R3) ADDD F8, F7, F6 LD F0, 34(R2) | LD F7, 45(R3) ADDD F4, F0, F2 | ADDD F8, F0, F6 Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Why ILP is slowing Branch-prediction accuracy is already > 90% Hard to improve it even more Number of pipeline stages is already deep (≈ 20–30 stages) But critical dependence loops do not change Memory latency requires more clock cycles to satisfy Processor width is already high Quadratically increasing complexity to increase the width Cache size Effective, but also shows diminishing returns In general, the size must be doubled to reduce miss rate by a half Fundamentals of Computer Architecture - Chapter 1

Current trends: multicore and manycore Aspect Intel Clovertown AMD Barcelona IBM Cell # cores 4 8+1 Clock frequency 2.66 GHz 2.3 GHz 3.2 GHz Core type OOO Superscalar 2-issue SIMD Caches 2x4MB L2 512KB L2 (private), 2MB L3 (sh’d) 256KB local store Chip power 120 watts 95 watts 100 watts Exercise: Browse the Web (or the textbook ) for information on more recent processors, and for each processor, fill out this form. (You can view the submissions.) Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Scope of CSC/ECE 506 Parallelism Loop-level and task-level parallelism Flynn taxonomy: SIMD (vector architecture) MIMD Shared memory machines (SMP and DSM) Clusters Programming Model: Shared memory Message-passing Hybrid Fundamentals of Computer Architecture - Chapter 1

Loop-level parallelism Sometimes each iteration can be performed independently Sometimes iterations cannot be performed independently  no loop-level parallelism + Very high parallelism > 1K + Often easy to achieve load balance – Some loops are not parallel – Some apps do not have many loops for (i=0; i<8; i++) a[i] = b[i] + c[i]; for (i=0; i<8; i++) a[i] = b[i] + a[i-1]; Fundamentals of Computer Architecture - Chapter 1

Task-level parallelism Arbitrary code segments in a single program Across loops: Subroutines: Threads: e.g., editor: GUI, printing, parsing + Larger granularity => low overheads, communication – Low degree of parallelism – Hard to balance … for (i=0; i<n; i++) sum = sum + a[i]; prod = prod * a[i]; Cost = getCost(); A = computeSum(); B = A + Cost; Fundamentals of Computer Architecture - Chapter 1

Program-level parallelism Various independent programs execute together gmake: gcc –c code1.c // assign to proc1 gcc –c code2.c // assign to proc2 gcc –c main.c // assign to proc3 gcc main.o code1.o code2.o + No communication – Hard to balance – Few opportunities Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Scope of CSC/ECE 506 Parallelism Loop-level and task-level parallelism Flynn taxonomy: SIMD (vector architecture) MIMD Shared-memory machines (SMP and DSM) Clusters Programming Model: Shared memory Message-passing Hybrid (Data parallel) Fundamentals of Computer Architecture - Chapter 1

Taxonomy of parallel computers The Flynn taxonomy: • Single or multiple instruction streams. • Single or multiple data streams. 1. SISD machine (desktops, laptops) Only one instruction fetch stream Most of today’s workstations or desktops Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 SIMD Examples: Vector processors, SIMD extensions (MMX), GPUs A single instruction operates on multiple data items. SISD: for (i=0; i<8; i++) a[i] = b[i] + c[i]; SIMD: a = b + c; // vector addition Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 MISD Example: CMU Warp Systolic arrays Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 MIMD Independent processors connected together to form a multiprocessor system. Physical organization: Determines which memory hierarchy level is shared Programming abstraction: Shared Memory: on a chip: Chip Multiprocessor (CMP) Interconnected by a bus: Symmetric multiprocessors (SMP) Point-to-point interconnection: Distributed Shared Memory (DSM) Distributed Memory: Clusters, Grid Fundamentals of Computer Architecture - Chapter 1

MIMD Physical Organization Shared-cache architecture: CMP (or Simultaneous Multi-Threading) e.g.: Pentium 4 chip, IBM Power4 chip, Sun Niagara, Pentium D, etc. - Implies shared-memory hardware … P P caches M UMA (Uniform Memory Access) Shared Memory : Pentium Pro Quad, Sun Enterprise, etc. What interconnection network? Bus Multistage Crossbar Implies shared-memory hardware P P caches caches … Network M Fundamentals of Computer Architecture - Chapter 1

MIMD Physical Organization (2) caches M P caches M NUMA (Non-Uniform Memory Access) Shared Memory : SGI Origin, Altix, IBM p690, AMD Hammer-based system What interconnection network? Crossbar Mesh Hypercube etc. … Network Fundamentals of Computer Architecture - Chapter 1

MIMD Physical Organization (3) Distributed System/Memory: Also called clusters, grid P P caches M caches M I/O I/O Network Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Scope of CSC/ECE 506 Parallelism Loop-level and task-level parallelism Flynn taxonomy: MIMD Shared memory machines (SMP and DSM) Programming Model: Shared memory Message-passing Hybrid (e.g., UPC) Data parallel Fundamentals of Computer Architecture - Chapter 1

Programming models: shared memory Shared Memory / Shared Address Space: Each processor can see the entire memory Programming model = thread programming in uniprocessor systems Fundamentals of Computer Architecture - Chapter 1

Programming models: message-passing Distributed Memory / Message Passing / Multiple Address Space: a processor can only directly access its own local memory. All communication happens by explicit messages. Fundamentals of Computer Architecture - Chapter 1

Programming models: data parallel Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor associated with each data element Architectural model Array of many simple, cheap processing elements (PEs) with little memory each Processing elements don’t sequence through instructions PEs are attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization Original motivation Matches simple differential equation solvers Centralize high cost of instruction fetch/sequencing Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Top 500 supercomputers http://www.top500.org Let’s look at the Earth Simulator Was #1 in 2004, #10 in 2006 Hardware: 5,120 (640 8-way nodes) 500 MHz NEC CPUs 8 GFLOPS per CPU (41 TFLOPS total) 30s TFLOPS sustained performance! 2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total) Shared memory inside the node 10 TB total memory 640 × 640 crossbar switch between the nodes 16 GB/s inter-node bandwidth 16 kW power consumption per node Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1

Fundamentals of Computer Architecture - Chapter 1 Exercise Go to http://www.top500.org and look at the Statistics menu in the top menu bar. From the dropdown, choose one of the statistics, e.g., Vendors, Processor Architecture, and examine what kind of systems are prevalent. Then do the same for earlier lists, and report on the trend. You may find interesting results by clicking on “Historical charts”. You can go all the way back to the first list from 1993. Submit your results here. Fundamentals of Computer Architecture - Chapter 1