-1- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Parallel Computer Architectures 2 nd week References Flynn’s Taxonomy Classification of Parallel.

Slides:

Advertisements

Similar presentations

© 2009 Fakultas Teknologi Informasi Universitas Budi Luhur Jl. Ciledug Raya Petukangan Utara Jakarta Selatan Website:

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

CSCI-455/522 Introduction to High Performance Computing Lecture 2.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter 17 Parallel Processing.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

1 Recap. 2 No. of Processors C.P.I Computational Power Improvement Multiprocessor Uniprocessor.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Parallel Processing Architectures Laxmi Narayan Bhuyan

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Parallel Processing & Distributed Systems Thoai Nam And Vu Le Hung.

MATRIX MULTIPLICATION 4 th week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MATRIX MULTIPLICATION 4 th week References Sequential matrix.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Parallel Computing.

Parallel Processing & Distributed Systems Thoai Nam Chapter 2.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

Classification of parallel computers Limitations of parallel processing.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

Overview Parallel Processing Pipelining

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

18-447: Computer Architecture Lecture 30B: Multiprocessors

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

buses, crossing switch, multistage network.

Parallel Processing - introduction

CS 147 – Parallel Processing

Chapter 17 Parallel Processing

Outline Interconnection networks Processor arrays Multiprocessors

Multiprocessors - Flynn’s taxonomy (1966)

Introduction to Multiprocessors

Parallel Processing Architectures

buses, crossing switch, multistage network.

AN INTRODUCTION ON PARALLEL PROCESSING

Part 2: Parallel Models (I)

Chapter 4 Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Presentation transcript:

-1- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Parallel Computer Architectures 2 nd week References Flynn’s Taxonomy Classification of Parallel Computers Based on Architectures Trend in Parallel Computer Architectures

-2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM REFERENCES Scalable Parallel Computing – Kai Hwang and Zhiwei Xu, McGraw-Hill Chapter.1 Parallel Computing: Theory and Practice – Michael J. Quinn, McGraw-Hill Chapter 3 Parallel Processing Course – Yang-Suk Kee School of EECS, Seoul National University

-3- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM FLYNN’S TAXONOMY A way to classify parallel computers (Flynn,1972) Based on notions of instruction and data streams – SISD (Single Instruction stream over a Single Data stream ) – SIMD (Single Instruction stream over Multiple Data streams ) – MISD (Multiple Instruction streams over a Single Data stream) – MIMD (Multiple Instruction streams over Multiple Data stream) Popularity – MIMD > SIMD > MISD

-4- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SISD (Single Instruction Stream Over A Single Data Stream ) CUPUMU IS DS I/O IS : Instruction StreamDS : Data Stream CU : Control UnitPU : Processing Unit MU : Memory Unit SISD – Conventional sequential machines

-5- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SIMD (Single Instruction Stream Over Multiple Data Streams ) SIMD – Vector computers – Special purpose computations CU PE 1 LM 1 PE n LM n DS IS Program loaded from host Data sets loaded from host SIMD architecture with distributed memory PE : Processing ElementLM : Local Memory

-6- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MISD – Processor arrays, systolic arrays – Special purpose computations Is it different from pipeline computer? MISD (Multiple Instruction Streams Over A Single Data Streams) Memory (Program, Data) PU 1 PU 2 PU n CU 1 CU 2 CU n DS IS DS I/O MISD architecture (the systolic array)

-7- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MIMD (Multiple Instruction Streams Over Multiple Data Stream) MIMD – General purpose parallel computers CU 1 PU 1 Shared Memory IS DS I/O CU n PU n ISDS I/O IS MIMD architecture with shared memory

-8- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CLASSIFICATION BASED ON ARCHITECTURES Pipelined Computers Dataflow Architectures Data Parallel Systems Multiprocessors Multicomputers

-9- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Instructions are divided into a number of steps (segments, stages) At the same time, several instructions can be loaded in the machine and be executed in different steps PIPELINED COMPUTERS Instruction iIFIDEXWB IFIDEXWB IFIDEXWB IFIDEXWB IFIDEXWB Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 Instruction # Cycles

-10- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW ARCHITECTURES Data-driven model – A program is represented as a directed acyclic graph in which a node represents an instruction and an edge represents the data dependency relationship between the connected nodes – Firing rule: A node can be scheduled for execution if and only if its input data become valid for consumption Dataflow languages – Id, SISAL, Silage,... – Single assignment, applicative(functional) language, no side effect – Explicit parallelism

-11- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW GRAPH z = (a + b) * c + * a b c z The dataflow representation of an arithmetic expression

-12- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATA-FLOW COMPUTER Execution of instructions is driven by data availability Basic components – Data are directly held inside instructions – Data availability check unit – Token matching unit – Chain reaction of asynchronous instruction executions What is the difference between this and normal (control flow) computers?

-13- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW COMPUTERS Advantages – Very high potential for parallelism – High throughput – Free from side-effect Disadvantages – Time lost waiting for unneeded arguments – High control overhead – Difficult in manipulating data structures

-14- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Dataflow Machine (Manchester Dataflow Computer) From host To host First actual hardware implementation I/O Switch Matching Unit Overflow Unit Token Queue Instruction Store func 1 func k Network Token Match

-15- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATAFLOW REPRESENTATION input d,e,f c 0 = 0 for i from 1 to 4 do begin a i := d i / e i b i := a i * f i c i := b i + c i-1 end output a, b, c / d1d1 e1e1 * + f1f1 / d2d2 e2e2 * + f2f2 / d3d3 e3e3 * + f3f3 / d4d4 e4e4 * + f4f4 a1a1 b1b1 a2a2 b2b2 a3a3 b3b3 a4a4 b4b4 c0c0 c4c4 c1c1 c2c2 c3c3

-16- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM EXECUTION ON CONTROL FLOW MACHINES a1b1c1a2b2c2a4b4c4 Sequential execution on a uniprocessor in 24 cycles Assume all the external inputs are available before entering do loop + : 1 cycle, * : 2 cycles, / : 3 cycles, How long will it take to execute this program on a dataflow computer with 4 processors?

-17- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM EXECUTION ON A DATAFLOW MACHINE c1c2c3c4a1 a2 a3 a4 b1 b2 b4 b3 Data-driven execution on a 4-processor dataflow computer in 9 cycles Can we further reduce the execution time of this program ?

-18- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PROBLEMS OF DATAFLOW COMPUTERS Excessive copying of large data structures in dataflow operations – I-structure : a tagged memory unit for overlapped usage by the producer and consumer Retreat from pure dataflow approach (shared memory) Handling complex data structures Chain reaction control is difficult to implement – Complexity of matching store and memory units Expose too much parallelism (?)

-19- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATA PARALLEL SYSTEMS Programming model – Operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps – Conceptually, a processor associated with each data element

-20- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DATA PARALLEL SYSTEMS (cont’d) SIMD Architectural model – Array of many simple, cheap processors with little memory each Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, cheap global synchronization

-21- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM VECTOR PROCESSOR Instruction set includes operation on vectors as well as scalars 2 types of vector computers – Processor arrays – Pipelined vector processors

-22- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PROCESSOR ARRAY A sequential computer connected with a set of identical processing elements simultaneouls doing the same operation on different data. Eg CM-200 Data path Processing element Data memory Interconnection network Processing element Data memory Processing element Data memory I/0 … … Processor array Instruction path Program and Data Memory CPU I/O processor Front-end computer I/0

-23- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINED VECTOR PROCESSOR Stream vector from memory to the CPU Use pipelined arithmetic units to manipulate data Eg: Cray-1, Cyber-205

-24- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MULTIPROCESSOR Consists of many fully programmable processors each capable of executing its own program Shared Address Space Architecture Classified into 2 types – Uniform Memory Access (UMA) Multiprocessors – Non-Uniform Memory Access (NUMA) Multiprocessors

-25- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SHARED ADDRESS SPACE ARCHITECTURE Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion of address space Pn private Common physical addresses P 2 private P 1 private P 0 private Store Load

-26- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Uses a central switching mechanism to reach a centralized shared memory All processors have equal access time to global memory Tightly coupled system Problem: cache consistency UMA MULTIPROCESSOR P1P1 Switching mechanism I/O P2P2 … PnPn Memory banks C1C1 C2C2 CnCn PiPi CiCi Processor i Cache i

-27- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM UMA MULTIPROCESSOR (cont’d) Crossbar switching mechanism Mem Cache P I/OCache P I/O

-28- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM UMA MULTIPROCESSOR (cont’d) Shared- Bus switching mechanism Mem Cache P I/OCache P I/O

-29- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM UMA MULTIPROCESSOR (cont’d) Packet-switched network Network Mem Cache P P P Mem

-30- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM NUMA MULTIPROCESSOR Distributed shared memory combined by local memory of all processors Memory access time depends on whether it is local to the processor Caching shared (particularly nonlocal) data? Network Cache P MemCache P Mem Cache P MemCache P Mem Distributed Memory

-31- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CURRENT TYPES OF MULTIPROCESSORS PVP (Parallel Vector Processor) – A small number of proprietary vector processors connected by a high-bandwidth crossbar switch SMP (Symmetric Multiprocessor) – A small number of COTS microprocessors connected by a high-speed bus or crossbar switch DSM (Distributed Shared Memory) – Similar to SMP – The memory is physically distributed among nodes.

-32- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PVP (Parallel Vector Processor) VP Crossbar Switch VP SM VP : Vector ProcessorSM : Shared Memory

-33- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SMP (Symmetric Multi-Processor) P/C Bus or Crossbar Switch P/C SM P/C : Microprocessor and Cache

-34- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM DSM (Distributed Shared Memory) Custom-Designed Network MB P/C LM NIC DIR P/C LM NIC DIR DIR : Cache Directory

-35- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MULTICOMPUTERS Consists of many processors with their own memory No shared memory Processors interact via message passing  loosely coupled system Message-Passing Interconnection Network M P M P M P P M P M P M PM PM … P P … M M … …

-36- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CURRENT TYPES OF MULTICOMPUTERS MPP (Massively Parallel Processing) – Total number of processors > 1000 Cluster – Each node in system has less than 16 processors. Constellation – Each node in system has more than 16 processors

-37- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MPP (Massively Parallel Processing) P/C LM NIC MB P/C LM NIC MB Custom-Designed Network MB : Memory BusNIC : Network Interface Circuitry

-38- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Cluster Commodity Network (Ethernet, ATM, Myrinet, VIA) MB P/C M NIC P/C M Bridge LD NIC IOB LD : Local DiskIOB : I/O Bus

-39- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM CONSTELLATION P/C SM NIC LD Hub Custom or Commodity Network >= 16 IOC P/C SM NIC LD Hub >= 16 IOC IOC : I/O Controller

-40- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Trend in Parallel Computer Architectures Years Number of HPCs MPPsConstellationsClustersSMPs

-41- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM TOWARD ARCHITECTURAL CONVERGENCE Evolution and role of software have blurred boundary – Send/recv supported on SAS machine via buffers – Can construct global address space on MP Hardware organization converging too. – Tighter NI integration for MP (low latency, higher bandwidth) – At lower level, hardware SAS passes hardware messages. – Nodes connected by general network and communication assists. Cluster of workstations/SMPs – Emergence of fast SAN (System Area Networks)

-42- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM Converged Architecture of Current Supercomputers Interconnection Network Memory PPPP PPPP PPPP PPPP Multiprocessors