Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Slides:

Advertisements

Similar presentations

© 2009 Fakultas Teknologi Informasi Universitas Budi Luhur Jl. Ciledug Raya Petukangan Utara Jakarta Selatan Website:

Advertisements

Distributed Systems CS

SE-292 High Performance Computing

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Computers Chapter 1

History of Distributed Systems Joseph Cordina

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   

Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Department of Computer Science University of the West Indies.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.

Parallel Programming with MPI and OpenMP

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Parallel Computing.

Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

An Overview of Parallel Processing

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

Classification of parallel computers Limitations of parallel processing.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

These slides are based on the book:

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

18-447: Computer Architecture Lecture 30B: Multiprocessors

Distributed Processors

buses, crossing switch, multistage network.

Parallel Processing - introduction

CS 147 – Parallel Processing

Flynn’s Classification Of Computer Architectures

Multi-Processing in High Performance Computer Architecture:

Chapter 17 Parallel Processing

Outline Interconnection networks Processor arrays Multiprocessors

Multiprocessors - Flynn’s taxonomy (1966)

buses, crossing switch, multistage network.

Overview Parallel Processing Pipelining

AN INTRODUCTION ON PARALLEL PROCESSING

Part 2: Parallel Models (I)

Chapter 4 Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Presentation transcript:

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Parallel computer: multiple-processor system supporting parallel programming. Three principle types of architecture Vector computers, in particular processor arrays Shared memory multiprocessors Specially designed and manufactured systems Distributed memory multicomputers Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 2

Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 3

Natural way to extend single processor model Have multiple processors connected to multiple memory modules such that each processor can access any memory module So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 4

Parallel Architectures and Performance Analysis – Slide 5

Type 2: Distributed Multiprocessor Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 6

Parallel Architectures and Performance Analysis – Slide 7

Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 8

Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters Parallel Architectures and Performance Analysis – Slide 9

Parallel Architectures and Performance Analysis – Slide 10

Parallel Architectures and Performance Analysis – Slide 11

Parallel Architectures and Performance Analysis – Slide 12

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 13

Parallel Architectures and Performance Analysis – Slide 14 Control unit Arithmetic Processor Memory Control Signals Instruction Data Stream Results

Parallel Architectures and Performance Analysis – Slide 15 Control Unit Control Signal PE 1 PE 2PE n Data Stream 1Data Stream 2Data Stream n

Parallel Architectures and Performance Analysis – Slide 16 Control Unit 1 Control Unit 2 Control Unit n Processing Element 1 Processing Element 2 Processing Element n Instruction Stream 1 Instruction Stream 2 Instruction Stream n Data Stream

Parallel Architectures and Performance Analysis – Slide 17 Serial execution of two processes with 4 stages each. Time to execute T = 8 t, where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t S1S2S3S4S1S2S3S4 S1S2S3S4 S1S2S3S4

Parallel Architectures and Performance Analysis – Slide 18 Control Unit 1 Control Unit 2 Control Unit n Processing Element 1 Processing Element 2 Processing Element n Instruction Stream 1 Instruction Stream 2 Instruction Stream n Data Stream 1 Data Stream 2 Data Stream n

Multiple Program Multiple Data (MPMD) Structure Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 19

Single Program Multiple Data (SPMD) Structure Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 20

Architectures Vector computers Shared memory multiprocessors: tightly coupled Centralized/symmetrical multiprocessor (SMP): UMA Distributed multiprocessor: NUMA Distributed memory/message-passing multicomputers: loosely coupled Asymmetrical vs. symmetrical Flynn’s Taxonomy SISD, SIMD, MISD, MIMD (MPMD, SPMD) Parallel Architectures and Performance Analysis – Slide 21

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input. The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements. Parallel Architectures and Performance Analysis – Slide 22

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel. The speedup factor of a parallel computation utilizing p processors is defined as the following ratio: In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time. Parallel Architectures and Performance Analysis – Slide 23

Speedup factor can also be cast in terms of computational steps: Maximum speedup is (usually) p with p processors (linear speedup). Parallel Architectures and Performance Analysis – Slide 24

Given a problem of size n on p processors let Inherently sequential computations  (n) Potentially parallel computations  (n) Communication operations  (n,p) Then: Parallel Architectures and Performance Analysis – Slide 25

Parallel Architectures and Performance Analysis – Slide 26 “elbowing out” Number of processors 

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system: Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation. Parallel Architectures and Performance Analysis – Slide 27

Since E = S(p)/p, by what we did earlier Since all terms are positive, E > 0 Furthermore, since the denominator is larger than the numerator, E < 1 Parallel Architectures and Performance Analysis – Slide 28

Parallel Architectures and Performance Analysis – Slide 29

As before since the communication time must be non-trivial. Let f represent the inherently sequential portion of the computation; then Parallel Architectures and Performance Analysis – Slide 30

Limitations Ignores communication time Overestimates speedup achievable Amdahl Effect Typically  (n,p) has lower complexity than  (n)/p So as p increases,  (n)/p dominates  (n,p) Thus as p increases, speedup increases Parallel Architectures and Performance Analysis – Slide 31

As before Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then Parallel Architectures and Performance Analysis – Slide 32

Then Parallel Architectures and Performance Analysis – Slide 33

Begin with parallel execution time instead of sequential time Estimate sequential execution time to solve same problem Problem size is an increasing function of p Predicts scaled speedup Parallel Architectures and Performance Analysis – Slide 34

Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time Both overestimate speedup or scaled speedup achievable Gene Amdahl John L. Gustafson Parallel Architectures and Performance Analysis – Slide 35

Performance terms: speedup, efficiency Model of speedup: serial, parallel and communication components What prevents linear speedup? Serial and communication operations Process start-up Imbalanced workloads Architectural limitations Analyzing parallel performance Amdahl’s Law Gustafson-Barsis’ Law Parallel Architectures and Performance Analysis – Slide 36

Based on original material from The University of Akron: Tim O’Neil, Kathy Liszka Hiram College: Irena Lomonosov The University of North Carolina at Charlotte Barry Wilkinson, Michael Allen Oregon State University: Michael Quinn Revision history: last updated 7/28/2011. Parallel Architectures and Performance Analysis – Slide 37