Embedded Computer Architecture 5KK73 Going Multi-Core Henk Corporaal www.ics.ele.tue.nl/~heco/courses/EmbeddedComputerArchitecture TUEindhoven December.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Optimization Summary
Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Course 5MD00 Henk Corporaal January 2015
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Snooping Cache and Shared-Memory Multiprocessors
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Distributed Shared Memory Systems and Programming
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal TUEindhoven 2012.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
The University of Adelaide, School of Computer Science
Embedded Computer Architecture 5SAI0 Coherence, Synchronization and Memory Consistency (ch 5b,7) Henk Corporaal
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Embedded Computer Architecture 5KK73 Going Multi-Core
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Embedded Computer Architecture 5KK73 Going Multi-Core Henk Corporaal TUEindhoven December 2014 / January 2015

2/19/2016ACA H.Corporaal2 Trends: #transistors follows Moore but not freq. and performance/core Core i7 3GHz 100W 5 Crisis?

2/19/2016ACA H.Corporaal3 ECA - 5KK73. H.Corporaal and B. Mesman Multi-core

2/19/2016ACA H.Corporaal4 Topics Communication models of parallel systems –communicating via shared memory –communicating using message passing Challenge of parallel processing Shared memory issues: 1.Coherence problem 2.Consistency problem 3.Synchronization

2/19/2016ACA H.Corporaal5 Parallel Architecture Parallel Architecture extends traditional computer architecture with a communication network –abstractions (HW/SW interface) –organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node

2/19/2016ACA H.Corporaal6 Communication models: Shared Memory Process P1 Process P2 Shared Memory Coherence problem Memory consistency issue Synchronization problem (read, write)

2/19/2016ACA H.Corporaal7 Communication models: Shared memory Shared address space Communication primitives: –load, store, atomic swap Two varieties: Physically shared => Symmetric Multi- Processors (SMP) –usually combined with local caching Physically distributed => Distributed Shared Memory (DSM)

2/19/2016ACA H.Corporaal8 SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor can be 1 bus, N busses, or any network

2/19/2016ACA H.Corporaal9 DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network (e.g. a shared Bus) Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

2/19/2016ACA H.Corporaal10 Shared Address Model Summary Each processor can address every physical location in the machine Each process can address all data it shares with other processes Data transfer via load and store Data size: byte, word,... or cache blocks Memory hierarchy model applies: –communication moves data to local proc. cache

2/19/2016ACA H.Corporaal11 Communication models: Message Passing Communication primitives –e.g., send, receive library calls –standard MPI: Message Passing Interface Note that MP can be build on top of SM and vice versa ! Process P1 Process P2 receive send FiFO

2/19/2016ACA H.Corporaal12 Message Passing Model Explicit message send and receive operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking communication, but may use DMA HeaderDataTrailer Message structure

2/19/2016ACA H.Corporaal13 Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

2/19/2016ACA H.Corporaal14 Communication Models: Comparison Shared-Memory (used by e.g. OpenMP) –Compatibility with well-understood (language) mechanisms –Ease of programming for complex or dynamic communications patterns –Shared-memory applications; sharing of large data structures –Efficient for small items –Supports hardware caching Messaging Passing (used by e.g. MPI) –Simpler hardware –Explicit communication –Implicit synchronization (with any communication)

2/19/2016ACA H.Corporaal15 Challenges of parallel processing Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Answer: fseq = 0.25% Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? Note: study Amdahl’s and Gustafson’s laws

2/19/2016ACA H.Corporaal16 Network: Performance metrics Network Bandwidth –Need high bandwidth in communication –How does it scale with number of nodes? Communication Latency –Affects performance, since processor may have to wait –Affects ease of programming, since it requires more thought to overlap communication and computation How can a mechanism help hide latency? –overlap message send with computation, –prefetch data, –switch to other task or thread –pipelining tasks, or pipelining iterations

2/19/2016ACA H.Corporaal17 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? –how to protect access to shared data?

2/19/2016ACA H.Corporaal18 Coherence problem, in single CPU system CPU I/O a' b' b a cache memory CPU I/O a' b' b a cache memory CPU I/O a' b' b a cache memory ) CPU writes to a2) IO writes b not coherent

2/19/2016ACA H.Corporaal19 Coherence problem, in Multi-Proc system CPU-1 a' b' b a cache memory CPU-2 a'' b'' cache

2/19/2016ACA H.Corporaal20 What Does Coherency Mean? Informally: –“Any read must return the value of the most recent write (to the same address)” –Too strict and too difficult to implement Better: –A write followed by a read by the same processor P with no writes in between returns the value written –“Any write must eventually be seen by a read” If P writes to X and P' reads X then P' will see the value written by P if the read and write are sufficiently separated in time –Writes to the same location by different processors are seen in the same order by all processors ("serialization") Suppose P1 writes location X, followed by P2 writing also to X. If no serialization, then some processors would read value of P1 and others of P2. Serialization guarantees that all processors read the same sequence.

2/19/2016ACA H.Corporaal21 Two rules to ensure coherency “If P1 writes x and P2 reads it, P1’s write will be seen by P2 if the read and write are sufficiently far apart” Writes to a single location are serialized: seen in one order –‘Latest’ write will be seen –Otherwise could see writes in illogical order (could see older value after a newer value)

2/19/2016ACA H.Corporaal22 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors (or local caches) –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks, hot spots) –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

2/19/2016ACA H.Corporaal23 Example Snooping protocol 3 states for each cache line: –invalid, shared (read only), modified (also called exclusive, you may write it) FSM per cache, gets requests from processor and bus Main memoryI/O System Cache Processor Cache Processor Cache Processor Cache Processor

2/19/2016ACA H.Corporaal24 Snooping Protocol: Write Invalidate Get exclusive access to a cache block (invalidate all other copies) before writing it When processor reads an invalid cache block it is forced to fetch a new copy If two processors attempt to write simultaneously, one of them is first (bus arbitration). The other one must obtain a new copy, thereby enforcing serialization Processor activityBus activityCache CPU ACache CPU BMemory addr. X 0 CPU A reads XCache miss for X00 CPU B reads XCache miss for X000 CPU A writes 1 to XInvalidation for X1invalidated0 CPU B reads XCache miss for X111 Example: address X in memory initially contains value '0'

2/19/2016ACA H.Corporaal25 Basics of Write Invalidate Use the bus to perform invalidates –To perform an invalidate, acquire bus access and broadcast the address to be invalidated all processors snoop the bus, listening to addresses if the address is in my cache, invalidate my copy Serialization of bus access enforces write serialization Where is the most recent value? –Easy for write-through caches: in the memory –For write-back caches, again use snooping Can use cache tags to implement snooping –Might interfere with cache accesses coming from CPU –Duplicate tags, or employ multilevel cache with inclusion Memory Cache Processor Bus core i

2/19/2016ACA H.Corporaal26 Snoopy-Cache State Machine-I State machine for CPU requests for each cache block Invalid Shared (read-only) Exclusive (read/write) CPU Read CPU Write CPU Read hit Place read miss on bus Place Write Miss on bus CPU read miss Write back block CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit Cache Block State

2/19/2016ACA H.Corporaal27 Snoopy-Cache State Machine-II State machine for bus requests for each cache block Invalid Shared (read/only) Exclusive (read/write) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write miss for this block Write Back Block; (abort memory access)

2/19/2016ACA H.Corporaal28 Responds to Events Caused by Processor EventState of block in cacheAction Read hitshared or exclusiveread data from cache Read missinvalidplace read miss on bus Read missshared wrong block (conflict miss): place read miss on bus Read missexclusive conflict miss: write back block then place read miss on bus Write hitexclusivewrite data in cache Write hitshared place write miss on bus (invalidates all other copies) Write missinvalidplace write miss on bus Write missshared conflict miss: place write miss on bus Write missexclusiveconflict miss: write back block, then place write miss on bus

2/19/2016ACA H.Corporaal29 Responds to Events on Bus EventState of addressed cache block Action Read misssharedNo action: memory services read miss Read missexclusiveAttempt to share data: place block on bus and change state to shared Write misssharedAttempt to write: invalidate block Write missexclusiveAnother processor attempts to write "my" block: write back the block and invalidate it

2/19/2016ACA H.Corporaal30 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Synchronization How to synchronize processes? –how to protect access to shared data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)?

2/19/2016ACA H.Corporaal31 What's the Synchronization problem? Assume: Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */shared int balanceprivate int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance

2/19/2016ACA H.Corporaal32 Critical Section Problem n processes all competing to use some shared data Each process has code segment, called critical section, in which shared data is accessed. Problem – ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section Structure of process while (TRUE){ entry_section (); critical_section (); exit_section (); remainder_section (); } while (TRUE){ entry_section (); critical_section (); exit_section (); remainder_section (); }

2/19/2016ACA H.Corporaal33 HW support for synchronization Atomic Instructions to Fetch and Update Memory : Atomic exchange: interchange a value in a register for a value in memory –0 => synchronization variable is free –1 => synchronization variable is locked and unavailable Test-and-set: tests a value and sets it if the value passes the test –similar: Compare-and-swap Fetch-and-increment: it returns the value of a memory location and atomically increments it –0 => synchronization variable is free

2/19/2016ACA H.Corporaal34 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Synchronization How to synchronize processes? –how to protect access to shared data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)?

2/19/2016ACA H.Corporaal35 Memory Consistency: The Problem Observation: If writes take effect immediately (are immediately seen by all processors), it is impossible that both if-statements evaluate to true But what if write invalidate is delayed ………. –Should this be allowed, and if so, under what conditions? Process P1Process P2 A = 0;... A = 1; L1: if (B==0)... B = 0;... B = 1; L2: if (A==0)...

2/19/2016ACA H.Corporaal36 Sequential Consistency: Definition Lamport (1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the (memory) operations of all processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program This means that all processors 'see' all loads and stores happening in the same order !!

2/19/2016ACA H.Corporaal37 How to implement Sequential Consistency Delay the completion of any memory access until all invalidations caused by that access are completed Delay next memory access until previous one is completed –delay the read of A and B (A==0 or B==0 in the example) until the write has finished (A=1 or B=1) Note: Under sequential consistency, we cannot place the (local) write in a write buffer and continue

2/19/2016ACA H.Corporaal38 Sequential consistency overkill? Schemes for faster execution then sequential consistency Observation: Most programs are synchronized –A program is synchronized if all accesses to shared data are ordered by synchronization operations Example: P2 acquire (s) {lock}... read(x) P1 write (x)... release (s) {unlock}... ordered

2/19/2016ACA H.Corporaal39 Cost of Sequential Consistency (SC) Enforcing SC can be quite expensive –Assume write miss = 40 cycles to get ownership, –10 cycles = to issue an invalidate and –50 cycles = to complete and get acknowledgement –Assume 4 processors share a cache block, how long does a write miss take for the writing processor if the processor is sequentially consistent? Waiting for invalidates : each write =sum of ownership time + time to complete invalidates – = 40 cycles to issue invalidate –40 cycles to get ownership + 50 cycles to complete –=130 cycles  very long ! Solutions: –Exploit latency-hiding techniques –Employ relaxed consistency

2/19/2016ACA H.Corporaal40 Relaxed Memory Consistency Models Key: (partially) allow reads and writes to complete out-of-order Orderings that can be relaxed: –relax W  R ordering allows reads to bypass earlier writes (to different memory locations) called processor consistency or total store ordering –relax W  W allow writes to bypass earlier writes called partial store order –relax R  W and R  R weak ordering, release consistency, Alpha, PowerPC Note, seq. consistency means satisfying all orderings: –W  R, W  W, R  W, and R  R

What did you learn so far?

MPSoCs Integrating multiple processors –Often hybrid architectures Two communication models –message passing –shared memory Various pros and cons of both models –huge systems can be locally shared, and globally message passing Shared memory model has 3 fundamental issues: –synchronization => atomica operations needed –coherence => cache coherence protocols –snooping –directory based –consistency from sequential (strong) consistency to weak consistency models