Multiprocessor Architectures and Parallel Programs

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Parallel Computing Presented by Justin Reschke
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Processing
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Overview Parallel Processing Pipelining
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Lecture 1 – Parallel Programming Primer
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Parallel Programming By J. H. Wang May 2, 2017.
CS 147 – Parallel Processing
EE 193: Parallel Computing
CS775: Computer Architecture
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
Parallel and Multiprocessor Architectures – Shared Memory
Lecture 1: Parallel Architecture Intro
Different Architectures
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Parallel Computer Architecture
Distributed Systems CS
Computer Evolution and Performance
Chapter 4 Multiprocessors
Introduction, background, jargon
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Presentation transcript:

Multiprocessor Architectures and Parallel Programs

What’s It All About, Then? An old definition: A parallel Computer is a “collection of processing elements that communicate and cooperate to solve large problems fast.” Is life that easy?! What about “Collection” size and scalability Processing element power Communication infrastructure Communication protocols for Cooperation Data sharing … The answer (and more) is in ECE 5504, so stick around! function form The definition is not enough to describe a parallel architecture. Life is not as simple as it appears in the definition. We need to answer many many questions. This is what this course is here for.

Why Are We Going Parallel? Application Trends Performance “travel into the future using Moore’s Law!” Ticket price is an exponential function of distance! (how?) Parallelizable examples Scientific/Engineering Commercial (Do you know Toy Story’s Buzz Lightyear?!) Technology Trends VLSI Component size decreasing Useful chip area increasing Multiple processors on chip possible Architectural Trends See next couple of slides The wording of the quote above is mine, and I take the blame or the credit for it. However, the main idea is not something I came up with. Toy Story movie was produced on a parallel computer system composed of 100s of Sun workstations. Inflection point for microprocessor design in mid-1980s came with the arrival of 32-bit word processors with caches.

Recent History of Microprocessor Design Up to mid-1980s: Bit-level parallelism (word size) Limited gain after a certain point mid-1980s to mid-1990s: instruction-level parallelism (pipelined and superscalar computers) Bigger cache to keep more instructions ready Branch predictors Replacement of CISC by RISC Problems Costly cache misses (performance-wise) Very costly design (money-wise) Natural question: “What is next?” Answer: process/thread-level parallelism

Recent History of Computing Systems Since mid-1980s, microprocessors support multiprocessor configuration Multiprocessor trend migrated to desktop Mid 1990s: shared memory multiprocessor trend covered servers to desktops Case study: Pentium Pro (1994) Four processors wired directly No logic No bus drivers Software vendors call for parallel architectures side-by-side with hardware Database companies Difference between large servers and small desktops is in the number of processors per system: desktops support a few, large servers support 10s, and large commercial systems support 100s

Supercomputing Used mostly for scientific computing Two Main Trends Vector processors, e.g.: CRAY Xmp (2 then 4) CRAY Ymp (8) CRAY T94 (32) Microprocessor-based (MPP) 100s then 1000s MPP is now dominating Even CRAY Research announced T3D based on the DEC alpha processors

Vector Processors vs. MPP Cray started producing MPPs too in 1993!

Common Parallel Architectures Shared Address Space Message Passing Data Parallel Dataflow Systolic

Shared Address Space (1) Remember “shared-memory” multiprocessing for interprocess communication, logical/physical spaces? Processes define shared portions of address space Process i Private Shared Process j Physical memory Little history Increase memory capacity (&maybe bandwidth) by adding memory blocks Same for I/O; I/O requires direct memory access Same for processing capacity Several memory blocks may be needed as a historical thing. MEm

Shared Address Space (2) Interconnect is required between I/O & memory Processor & memory Dancehall (also UMA) Uniform Memory Access M M $ I/O P M $ I/O P $ I/O P SMP: Symmetric Multiprocessor 1. Crossbar 2. Multistage 3. Bus (SMP) Scalability (cost, performance) 1: (, ) 2: (, ) 3: (, )

Shared Address Space (3) Dancehall UMA model NUMA model Nonuniform Memory Access Scalable network $ P M M M $ $ P P Each processor has a local memory block In any model, user-level operations are store and load

Message Passing (1) Like “message passing” for interprocess communication except Communication is among processors Like shared address space’s NUMA except Communication is at I/O level Like networks/clusters except Node packaging is tighter No human I/O devices per node Network much faster than LAN

Message Passing (2) User-level operations are send and receive FIFO and blocking operations DMA and non-blocking operations Topologies Communication Between neighbors only (classic) Between arbitrary nodes: store&forward (modern) Hypercube Ring Grid

Got Confused Enough? That’s the point; architectures are converging! Blurry boundary at user level Message passing supported on shared memory Shared address space supported on message passing User-defined Paging-level (like virtual memory but disk is remote memory) Fast LAN-based clusters is converging to parallel machines

Data Parallel and Flynn’s Taxonomy (1972) Taxonomy based on Number of instructions issued Number of data elements worked on Spawned classes SISD: single instruction stream single data stream SIMD MISD ? MIMD I thought you were gonna talk about data parallel!! Well, SIMD is also called data parallel processing Compiler generates low-level parallel code

Dataflow Architecture Program represented by a graph Simple example: z = (x+y)*(u-v) P+ x y u v P- + ― Network P* * z Static vs. dynamic operation-processor mapping

Systolic Architecture Processing Elements Array of simple PEs Like SIMD data parallel except Each PE may perform a different operation

What About Distributed Systems? Main purpose of a distributed system is to make use of resources that are distributed geographically or across several machines. Exploiting parallelism is not necessarily the purpose of using a distributed system Distributed object programming Examples CORBA DCOM Java RMI Using the distributed object programming does not guarantee parallelism Common Object Request Broker Architecture Distributed Component Object Model Remote Method Invocation

Parallel Programs

Why Bother? This is an “architecture” course. Why would we care about software? As a parallel computer architect To understand the effect of the target software on the design’s degree of freedom To understand the role and limitations of the architecture As an algorithm designer To understand how to design effective parallel algorithms (parallel algorithm design can be significantly different from that of sequential ones!)

Why Bother? (Continued) As a programmer To obtain the best performance out of a parallel system through careful coding For parallel programs and systems Interaction is very strong There is at least one more degree of freedom: number of processors

Parallelizing a Once-Sequential Program Parallelizable Inherently sequential Sequential Parallelized Speedup = ts/tp ts tp

Example: Simulating Ocean Currents (a) Cross sections (b) Spatial discretization of a cross section Model as two-dimensional grids Discretize in space and time Finer spatial/temporal resolution means greater accuracy Many different computations per time step Set up and solve equations Concurrency across and within grid computations

How Is It Done? Manually Automatically By programmer at design time By compiler at compile time By operating system at run time Still a research problem

Terminology Task: Process Thread Processor Smallest unit of concurrency a parallel program can exploit (done by one processor) Process Abstract entity performing a subset of tasks (This is not an OS definition) Thread A process (non-OS definition) Processor Physical entity executing processes

Parallelization Steps (1) Partitioning Decomposition: of computation into tasks Assignment: of tasks to processes Orchestration: communication/synchronization among processes Mapping: of processes to processors

Parallelization Steps (2) Partitioning p0 p1 p2 p3 p0 p1 p2 p3 Decomposition Assignment Orchestration Mapping Sequential computation Tasks Processes Parallel program Parallel architecture

Decomposition Description Goals Limitation Architecture dependence Break down sequential computation into tasks Goals To expose concurrency Limitation Limited concurrency Architecture dependence Usually independent

Assignment Description Specifying how tasks are distributed over processes Goals Balance workload (computation, data, I/O, communication) Reduce interprocess communication Reduce run-time overhead of assignment management Types Static Dynamic: reassign for better balance Architecture dependence Usually independent

Orchestration Description Goals Dependent on Related issues Specifying how processes communicate/synchronize Goals Reduce processor communication/synchronization Preserve locality of data reference Optimize schedule Reduce parallelism management overhead Dependent on Architecture Programming model (coding style & primitives provided) Programming language Related issues Data organization Local task scheduling (within a process) for locality exploitation Explicit vs. implicit communication Message size Choice of communication/synchronization primitives

Mapping Description Schemes Specifying which processes run on which processors A form of resource allocation Schemes Space sharing Static vs. dynamic Program vs. OS control

General Notes on Parallelization A task is the smallest unit of computation. Therefore, decomposition can determine the # of processes (and hence processors) used effectively Complete balance of workload maybe inherently unachievable for some computations Mapping, if not done by user, is done by the OS (not necessarily optimized)

Speedup Analysis (1) Speedup = “sequential time” ÷ “parallel time” Assume A system with p processors Tp: “parallel time” (Tp= Tcomp + Tcomm) Ts: “sequential time” Tcomp= s. Ts+(1-s) .Ts/p, where s: the inherently sequential fraction of the program Speedup = Ts ÷ Tp, i.e.

Inherently sequential fraction Speedup Analysis (2) Amdahl’s Law Ignore communication time Amdahl’s limit Speedup Inherently sequential fraction Number of processors For the case where s = 0.5, the best you can do is get a speedup of 2, and that’s by using a virtually infinite number of processors. 0.5 0.2 0.1 0.01 0.001 s 2 5 10 100 1000 1/s

Degree of parallelism: number of parallel operations in a program Speedup Analysis (3) Communication limit Assume Tcomm(p)= f (p) Ts (f: communication-to-computation ration) Assume a perfectly parallelizable program and do the math Degree of parallelism limit Degree of parallelism: number of parallel operations in a program Efficiency Linear speedup