Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February 28 2008 Session 7.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CMSC 611: Advanced Computer Architecture

Distributed Systems CS

SE-292 High Performance Computing

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Introduction to Parallel Processing Ch. 12, Pg

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer System Architectures Computer System Software

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

CMPE 421 Parallel Computer Architecture Multi Processing 1.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February Session 6.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Some Notes on Performance Noah Mendelsohn Tufts University Web: COMP 40: Machine.

.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Copyright Hesham El-Rewini1 Speedup S = Speed(new) / Speed(old) S = Work/time(new) / Work/time(old) S = time(old) / time(new) S = time(before improvement)

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

Computer Science and Engineering Advanced Computer Architecture CSE 8383 February Session 6.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

Outline Why this subject? What is High Performance Computing?

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 13.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 11.

The University of Adelaide, School of Computer Science

Lecture 13: Multiprocessors Kai Bu

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

CS5102 High Performance Computer Systems Thread-Level Parallelism

Computer Engineering 2nd Semester

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 147 – Parallel Processing

12.4 Memory Organization in Multiprocessor Systems

CMSC 611: Advanced Computer Architecture

Multiprocessors - Flynn’s taxonomy (1966)

High Performance Computing

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSL718 : Multiprocessors 13th April, 2006 Introduction

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7

Computer Science and Engineering Copyright by Hesham El-Rewini Performance Evaluations (cont.) Shared memory Systems Cash Coherence Protocol Contents

Computer Science and Engineering Copyright by Hesham El-Rewini Speedup S = Speed(new) / Speed(old) S = Work/time(new) / Work/time(old) S = time(old) / time(new) S = time(before improvement) / time(after improvement)

Computer Science and Engineering Copyright by Hesham El-Rewini Speedup Time (one CPU): T(1) Time (n CPUs): T(n) Speedup: S S = T(1)/T(n)

Computer Science and Engineering Copyright by Hesham El-Rewini Two Important Laws Influenced Parallel Computing

Computer Science and Engineering Copyright by Hesham El-Rewini Argument Against Massively Parallel Processing. Gene Amdahl, For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of multiplicity of computers in such a manner as to permit cooperative solution.. The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor… At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.

Computer Science and Engineering Copyright by Hesham El-Rewini What does that mean? The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode cannot be used. Unparallelizable part of the code severely limits the speedup Unparallelizable part of the code severely limits the speedup.

Computer Science and Engineering Copyright by Hesham El-Rewini Walk 4 miles /hour Bike 10 miles / hour Car-1 50 miles / hour Car miles / hour Car miles /hour 200 miles 20 hours A B must walk Trip Analogy

Computer Science and Engineering Copyright by Hesham El-Rewini Speedup Analysis (4 miles /hour) Time = 70 hours (10 miles / hour) Time = 40 hours (50 miles / hour) Time = 24 hours (120 miles / hour) Time = hours S = 1.8 S = 2.9 S = 3.2 S = 3.4 (600 miles /hour) Time = hours

Computer Science and Engineering Copyright by Hesham El-Rewini S = T(1)/T(N) T(N) = T(1)  + T(1)(1-  ) N S = 1  + (1-  ) N = N  N + (1-  )  : The fraction of the program that is naturally serial (1-  ): The fraction of the program that is naturally parallel Amdahl’s Law

Computer Science and Engineering Copyright by Hesham El-Rewini 10%20%30%40%50%60%70%80%90%99% Speedup % Serial 1000 CPUs 16 CPUs 4 CPUs Amdahl’s Law

Computer Science and Engineering Copyright by Hesham El-Rewini Gustafson – Barsis Law (1988)  Gordon Bell Prize  Overcoming the conceptual barrier established by Amdahl’s law  Scale the problem to the size of the parallel system  No fixed size problem  : The fraction of the program that is naturally serial T(N) = 1 T(1) =  + (1-  ) N S = N – (N-1) 

Computer Science and Engineering Copyright by Hesham El-Rewini %20%30%40%50%60%70%80%90%99% % Serial Speedup Gustafson-Barsis Amdhal Amdahl vs. Gustafson-Barsis

Computer Science and Engineering Copyright by Hesham El-Rewini Data Parallelism – Scale up  Parallelism is in the data, not the control portion of the application  Problem size scales up to the size of the system  Data Parallelism is to the 1990’s what vector parallelism was to the 1970’s  Supercomputer  data parallel

Computer Science and Engineering Copyright by Hesham El-Rewini Problem Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk- shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 10 9 times per second? What would the diameter be if the switching requirements were time per second?

Computer Science and Engineering Copyright by Hesham El-Rewini MIMD Shared Memory Systems Interconnection Networks MMMM PPPPP

Computer Science and Engineering Copyright by Hesham El-Rewini Shared Memory Single address space Communication via read & write Synchronization via locks

Computer Science and Engineering Copyright by Hesham El-Rewini Classification Multi-port Uniform memory Access (UMA) Non-uniform Memory Access (NUMA) Cache Only Memory Architecture (COMA) M P2P2 P1P1

Computer Science and Engineering Copyright by Hesham El-Rewini Uniform Memory Access (UMA) C P C P C P C P MMMM Bus

Computer Science and Engineering Copyright by Hesham El-Rewini Non Uniform Memory Access (NUMA) M P M P M P Interconnection Network M P

Computer Science and Engineering Copyright by Hesham El-Rewini Cache Only Memory Architecture (COMA) C P D C P D C P D C P D Interconnection Network

Computer Science and Engineering Copyright by Hesham El-Rewini Bus Based & switch based SM Systems Global Memory P C P C P C P C P C P C P C MMMM

Computer Science and Engineering Copyright by Hesham El-Rewini Bus-based Shared Memory Collection of wires and connectors Only one transaction at a time Bottleneck!! How can we solve the problem? Global Memory PPPPP

Computer Science and Engineering Copyright by Hesham El-Rewini Using Caches Global Memory P1 C1 P2 C2 P3 C3 Pn Cn - Cache Coherence problem - How many processors?

Computer Science and Engineering Copyright by Hesham El-Rewini Group Activity Variables Number of processors (n) Hit rate (h) Bus Bandwidth (B) Processor speed (V) Maximum number of processors n ?

Computer Science and Engineering Copyright by Hesham El-Rewini Group Activity

Computer Science and Engineering Copyright by Hesham El-Rewini Single Processor caching P x x Memory Cache Hit: data in the cache Miss: data is not in the cache Hit rate: h Miss rate: m = (1-h)

Computer Science and Engineering Copyright by Hesham El-Rewini Cache Coherence Policies Writing to Cache in 1 processor case Write Through Write Back

Computer Science and Engineering Copyright by Hesham El-Rewini Writing in the cache P x x Before Memory Cache P x’ Write through Memory Cache P x’ x Write back Memory Cache

Computer Science and Engineering Copyright by Hesham El-Rewini Cache Coherence P1 x P2P3 x Pn x x -Multiple copies of x -What if P1 updates x?

Computer Science and Engineering Copyright by Hesham El-Rewini Cache Coherence Policies Writing to Cache in n processor case Write Update - Write Through Write Invalidate - Write Back Write Update - Write Through Write Invalidate - Write Back

Computer Science and Engineering Copyright by Hesham El-Rewini Write-invalidate P1 x P2P3 x x P1 x’ P2P3 I x’ P1 x’ P2P3 I x BeforeWrite Through Write back

Computer Science and Engineering Copyright by Hesham El-Rewini Write-Update P1 x P2P3 x x P1 x’ P2P3 x’ P1 x’ P2P3 x’ x BeforeWrite Through Write back

Computer Science and Engineering Copyright by Hesham El-Rewini Snooping Protocols Snooping protocols are based on watching bus activities and carry out the appropriate coherency commands when necessary. Global memory is moved in blocks, and each block has a state associated with it, which determines what happens to the entire contents of the block. The state of a block might change as a result of the operations Read-Miss, Read-Hit, Write-Miss, and Write-Hit.

Computer Science and Engineering Copyright by Hesham El-Rewini Write Invalidate Write Through Multiple processors can read block copies from main memory safely until one processor updates its copy. At this time, all cache copies are invalidated and the memory is updated to remain consistent.

Computer Science and Engineering Copyright by Hesham El-Rewini Write Through- Write Invalidate (cont.) StateDescription Valid [VALID] The copy is consistent with global memory Invalid [INV] The copy is inconsistent

Computer Science and Engineering Copyright by Hesham El-Rewini Write Through- Write Invalidate (cont.) EventActions Read HitUse the local copy from the cache. Read Miss Fetch a copy from global memory. Set the state of this copy to Valid. Write HitPerform the write locally. Broadcast an Invalid command to all caches. Update the global memory. Write Miss Get a copy from global memory. Broadcast an invalid command to all caches. Update the global memory. Update the local copy and set its state to Valid. ReplaceSince memory is always consistent, no write back is needed when a block is replaced.

Computer Science and Engineering Copyright by Hesham El-Rewini Example 1 C P C Q M X = 5 P reads XP reads X Q reads XQ reads X Q updates XQ updates X Q reads XQ reads X Q updates XQ updates X P updates XP updates X Q reads XQ reads X

Computer Science and Engineering Copyright by Hesham El-Rewini Complete the table (Write through write invalidate) MemoryP’sCacheQ’sCache EventXXStateXState 0 Original value 5 1 P reads X (Read Miss) 55VALID