ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
The University of Adelaide, School of Computer Science
Extra Cache Coherence Examples In the following examples there are a couple questions. You can answer these for practice by ing Colin at
SE-292 High Performance Computing
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Cache Optimization Summary
Shared Memory – Consistency of Shared Variables The ideal picture of shared memory: CPU0CPU1CPU2CPU3 Shared Memory Read/ Write The actual architecture.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
The University of Adelaide, School of Computer Science
Computer Architecture II 1 Computer architecture II Lecture 8.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Snooping Cache and Shared-Memory Multiprocessors
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Spring EE 437 Lillevik 437s06-l21 University of Portland School of Engineering Advanced Computer Architecture Lecture 21 MSP shared cached MSI protocol.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
Evaluating the Performance of Four Snooping Cache Coherency Protocols Susan J. Eggers, Randy H. Katz.
Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.
CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.
Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
Cache Coherence CS433 Spring 2001 Laxmikant Kale.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Lecture 13: Multiprocessors Kai Bu
COSC6385 Advanced Computer Architecture
תרגול מס' 5: MESI Protocol
Cache Coherence in Shared Memory Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Multiprocessor Highlights
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Distributed Shared Memory
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Prof John D. Kubiatowicz
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines

Two Parallel Architectures Shared memory machines. Distributed memory machines.

Shared Memory: Logical View proc1proc2proc3procN Shared memory space

Shared Memory Machines Small number of processors: shared memory with coherent caches (SMP). Larger number of processors: distributed shared memory with coherent caches (CC- NUMA).

SMPs 2- or 4-processors PCs are now commodity. Good price/performance ratio. Memory sometimes bottleneck (see later). Typical price (8-node): ~ $20-40k.

Physical Implementation proc1proc2proc3procN Shared memory cache1cache2cache3cacheN bus

Shared Memory Machines Small number of processors: shared memory with coherent caches (SMP). Larger number of processors: distributed shared memory with coherent caches (CC- NUMA).

CC-NUMA: Physical Implementation proc1proc2proc3procN mem2mem3memNmem1 cache2cache1cacheNcache3 inter- connect

Caches in Multiprocessors Suffer from the coherence problem: –same line appears in two or more caches –one processor writes word in line –other processors now can read stale data Leads to need for a coherence protocol –avoids coherence problems Many exist, will just look at simple one.

What is coherence? What does it mean to be shared? Intuitively, read last value written. Notion is not well-defined in a system without a global clock.

The Notion of “last written” in a Multi-processor System w(x) r(x) P0 P1 P2 P3

The Notion of “last written” in a Single-machine System w(x) r(x)

Coherence: a Clean Definition Is achieved by referring back to the single machine case. Called sequential consistency.

Sequential Consistency (SC) Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

Returning to our Example w(x) r(x) P0 P1 P2 P3

Another Way of Defining SC All memory references of a single process execute in program order. All writes are globally ordered.

SC: Example 1 w(x,1)w(y,1) r(x)r(y) Initial values of x,y are 0. What are possible final values?

SC: Example 2 w(x,1)w(y,1) r(y)r(x)

SC: Example 3 w(x,1) w(y,1) r(y)r(x)

SC: Example 4 w(x,1) w(x,2) r(x)

Implementation Many ways of implementing SC. In fact, sometimes stronger conditions. Will look at a simple one: MSI protocol.

Physical Implementation proc1proc2proc3procN Shared memory cache1cache2cache3cacheN bus

Fundamental Assumption The bus is a reliable, ordered broadcast bus. –Every message sent by a processor is received by all other processors in the same order. Also called a snooping bus –Processors (or caches) snoop on the bus.

States of a Cache Line Invalid Shared –read-only, one of many cached copies Modified –read-write, sole valid copy

Processor Transactions processor read(x) processor write(x)

Bus Transactions bus read(x) –asks for copy with no intent to modify bus read-exclusive(x) –asks for copy with intent to modify

State Diagram: Step 0 ISM

State Diagram: Step 1 ISM PrRd/BuRd

State Diagram: Step 2 ISM PrRd/BuRd PrRd/-

State Diagram: Step 3 ISM PrRd/BuRd PrRd/- PrWr/BuRdX

State Diagram: Step 4 ISM PrRd/BuRd PrRd/- PrWr/BuRdX

State Diagram: Step 5 ISM PrRd/BuRd PrRd/- PrWr/BuRdX PrWr/-

State Diagram: Step 6 ISM PrRd/BuRd PrRd/- PrWr/BuRdX PrWr/- BuRd/Flush

State Diagram: Step 7 ISM PrRd/BuRd PrRd/- PrWr/BuRdX PrWr/- BuRd/Flush BuRd/-

State Diagram: Step 8 ISM PrRd/BuRd PrRd/- PrWr/BuRdX PrWr/- BuRd/Flush BuRd/- BuRdX/-

State Diagram: Step 9 ISM PrRd/BuRd PrRd/- PrWr/BuRdX PrWr/- BuRd/Flush BuRd/- BuRdX/- BuRdX/Flush

In Reality Most machines use a slightly more complicated protocol (4 states instead of 3). See architecture books (MESI protocol).

Problem: False Sharing Occurs when two or more processors access different data in same cache line, and at least one of them writes. Leads to ping-pong effect.

False Sharing: Example (1 of 3) for( i=0; i<n; i++ ) a[i] = b[i]; Let’s assume we parallelize code: –p = 2 –element of a takes 4 words –cache line has 32 words

False Sharing: Example (2 of 3) a[0]a[1]a[2]a[3]a[4]a[5]a[6]a[7] cache line Written by processor 0 Written by processor 1

False Sharing: Example (3 of 3) P0 P1 a[0] a[1] a[2]a[4] a[3]a[5]... invdata

Summary Sequential consistency. Bus-based coherence protocols. False sharing.