Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing
Henk Corporaal TUEindhoven 2007

Topics Why Parallel Processors Communication models
Challenge of parallel processing Coherence problem Consistency problem Synchronization Fundamental design issues Interconnection networks Book: Chapter 4, appendix E, H 4/12/2019 ACA H.Corporaal

Which parallelism are we talking about
Which parallelism are we talking about? Classification: Flynn Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Systolic arrays / stream based processing SIMD (Single Instruction Multiple Data = DLP) Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal (Philips), Imagine (Stanford), Vector machines, Cell architecture (Sony) Simple programming model Low overhead Now applied as sub-word parallelism !! MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin, Multi-core Pentiums, and many more…. NoCs (Networks-on-Chip) Flexible Use off-the-shelf processor cores 4/12/2019 ACA H.Corporaal

Why parallel processing
Performance drive Diminishing returns for exploiting ILP and OLP Multiple processors fit easily on a chip Cost effective (just connect existing processors or processor cores) Low power: parallelism may allow lowering Vdd However: Parallel programming is hard 4/12/2019 ACA H.Corporaal

Low power through parallelism
Sequential Processor Switching capacitance C Frequency f Voltage V P1 = fCV2 Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P2 = f/2 2C V’2 = fCV’2 < P1 CPU CPU1 CPU2 4/12/2019 ACA H.Corporaal

Parallel Architecture
Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node 4/12/2019 ACA H.Corporaal

Communication models: Shared Memory
(read, write) (read, write) Process P2 Process P1 Coherence problem Memory consistency issue Synchronization problem 4/12/2019 ACA H.Corporaal

Communication models: Shared memory
Shared address space Communication primitives: load, store, atomic swap Two varieties: Physically shared => Symmetric Multi-Processors (SMP) usually combined with local caching Physically distributed => Distributed Shared Memory (DSM) Models: 1st is easy, still useful: workstations within a building (entertainment) 4/12/2019 ACA H.Corporaal

SMP: Symmetric Multi-Processor
Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor Main memory I/O System 4/12/2019 ACA H.Corporaal

DSM: Distributed Shared Memory
Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M Interconnection Network Main memory I/O System 4/12/2019 ACA H.Corporaal

Shared Address Model Summary
Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word, ... or cache blocks Memory hierarchy model applies: communication moves data to local proc. cache 4/12/2019 ACA H.Corporaal

Communication models: Message Passing
Communication primitives e.g., send, receive library calls Note that MP can be build on top of SM and vice versa Process P1 Process P2 receive send FiFO 4/12/2019 ACA H.Corporaal

Message Passing Model Explicit message send and receive operations
Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking communication, but may use DMA Message structure Header Data Trailer 4/12/2019 ACA H.Corporaal

Message passing communication
Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DMA DMA DMA DMA Network interface Network interface Network interface Network interface Interconnection Network 4/12/2019 ACA H.Corporaal

Communication Models: Comparison
Shared-Memory Compatibility with well-understood (language) mechanisms Ease of programming for complex or dynamic communications patterns Shared-memory applications; sharing of large data structures Efficient for small items Supports hardware caching Messaging Passing Simpler hardware Explicit communication Implicit synchronization (with any communication) 4/12/2019 ACA H.Corporaal

Network: Performance metrics
Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes? Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation How can a mechanism help hide latency? overlap message send with computation, prefetch data, switch to other task or thread 4/12/2019 ACA H.Corporaal

Challenges of parallel processing
Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? 4/12/2019 ACA H.Corporaal

Three fundamental issues for shared memory multiprocessors
Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? how to protect access to shared data? 4/12/2019 ACA H.Corporaal

Coherence problem, in single CPU system
I/O a' b' b a cache memory 100 440 200 CPU CPU I/O a' b' b a cache memory 550 100 200 not coherent cache a' 100 b' 200 memory not coherent a 100 b 200 I/O IO writes b CPU writes to a 4/12/2019 ACA H.Corporaal

Coherence problem, in Multi-Proc system
CPU-1 CPU-2 cache cache a' 550 a'' 100 b' 200 b'' 200 memory a 100 b 200 4/12/2019 ACA H.Corporaal

What Does Coherency Mean?
Informally: “Any read must return the most recent write” Too strict and too difficult to implement Better: “Any write must eventually be seen by a read” All writes are seen in proper order (“serialization”) 4/12/2019 ACA H.Corporaal

Two rules to ensure coherency
“If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” Writes to a single location are serialized: seen in one order Latest write will be seen Otherwise could see writes in illogical order (could see older value after a newer value) 4/12/2019 ACA H.Corporaal

Potential HW Coherency Solutions
Snooping Solution (Snoopy Bus): Send all requests for data to all processors (or local caches) Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes 4/12/2019 ACA H.Corporaal

Example Snooping protocol
3 states for each cache line: invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Cache Processor Cache Processor Cache Processor Cache Processor Main memory I/O System 4/12/2019 ACA H.Corporaal

Cache coherence protocal
Write invalidate protocol for write-back cache Showing state transitions for each block in the cache 4/12/2019 ACA H.Corporaal

Synchronization problem
Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */ shared int balance shared int balance private int amount private int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance 4/12/2019 ACA H.Corporaal

Critical Section Problem
n processes all competing to use some shared data Each process has code segment, called critical section, in which shared data is accessed. Problem – ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section Structure of process while (TRUE){ entry_section (); critical_section (); exit_section (); remainder_section (); } 4/12/2019 ACA H.Corporaal

Attempt 1 – Strict Alternation
Process P0 Process P1 shared int turn; while (TRUE) { while (turn!=0); critical_section(); turn = 1; remainder_section(); } shared int turn; while (TRUE) { while (turn!=1); critical_section(); turn = 0; remainder_section(); } Two problems: Satisfies mutual exclusion, but not progress (works only when both processes strictly alternate) Busy waiting 4/12/2019 ACA H.Corporaal

Attempt 2 – Warning Flags
Process P0 Process P1 shared int flag[2]; while (TRUE) { flag[0] = TRUE; while (flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; while (TRUE) { flag[1] = TRUE; while (flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } Satisfies mutual exclusion P0 in critical section: flag[0]!flag[1] P1 in critical section: !flag[0]flag[1] However, contains a deadlock (both flags may be set to TRUE !!) 4/12/2019 ACA H.Corporaal

Software solution: Peterson’s Algorithm
(combining warning flags and alternation) Process P0 Process P1 shared int flag[2]; shared int turn; while (TRUE) { flag[0] = TRUE; turn = 0; while (turn==0&&flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; shared int turn; while (TRUE) { flag[1] = TRUE; turn = 1; while (turn==1&&flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } Software solution is slow ! 4/12/2019 ACA H.Corporaal

Hardware solution for Synchronization
For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization Hardware primitives needed all solutions based on "atomically inspect and update a memory location" Higher level synchronization solutions can be build in top 4/12/2019 ACA H.Corporaal

Uninterruptable Instructions to Fetch and Update Memory
Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Test-and-set: tests a value and sets it if the value passes the test (also Compare-and-swap) Fetch-and-increment: it returns the value of a memory location and atomically increments it 4/12/2019 ACA H.Corporaal

Build a 'spin-lock' using exchange primitive
Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock LI R2,#1 ;load immediate lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: LI R2,#1 ;load immediate lockit: LW R3,0(R1) ;load var BNEZ R3,lockit ;not free=>spin EXCH R2,0(R1) ;atomic exchange BNEZ R2,try ;already locked? 4/12/2019 ACA H.Corporaal

Alternative to Fetch and Update
Hard to have read & write in 1 instruction: use 2 instead Load Linked (or load locked) + Store Conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise Example doing atomic swap with LL & SC: try: OR R3,R4,R0 ; R4=R3 LL R2,0(R1) ; load linked SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R3=0) Example doing fetch & increment with LL & SC: try: LL R2,0(R1) ; load linked ADDUI R3,R2,#1 ; increment SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R2=0) 4/12/2019 ACA H.Corporaal

Another MP Issue: Memory Consistency
What is consistency? When must a processor see a new memory value? Example: P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ... Seems impossible for both if-statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? 4/12/2019 ACA H.Corporaal

Sequential Consistency (SC)
result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => finish assignments before if-statements above SC: delay all memory accesses until all invalidates done 4/12/2019 ACA H.Corporaal

Sequential consistency overkill?
Schemes for faster execution then sequential consistency Most programs are synchronized A program is synchronized if all accesses to shared data are ordered by synchronization operations example: P1 write (x) ... release (s) {unlock} ... P2 acquire (s) {lock} ... read(x) ordered 4/12/2019 ACA H.Corporaal

Relaxed Memory Consistency Models
Several Relaxed Models for Memory Consistency since most programs are synchronized; Key: (partially) allow reads and writes to complete out-of-order Models are characterized by their attitude towards: W  R : total store ordering W  W : partial store ordering R  W and R  R : weak ordering, and others to different addresses Note, seq. consistency means: W  R, W  W, R  W and R  R 4/12/2019 ACA H.Corporaal

Fundamental MP design decision
We have already discussed: Shared memory versus Message passing Coherence, Consistency and Synchronization issues Other extremely important decisions: Processing units: Homogeneous versus Heterogeneous? Generic versus Application specific ? Interconnect: Bus versus Network ? Type (topology) of network What types of parallelism to support ? Focus on Performance, Power or Cost ? Memory organization ? 4/12/2019 ACA H.Corporaal

Homogeneous or Heterogeneous
Homogenous: replication effect memory dominated any way solve realization issues once and for all less flexible 4/12/2019 ACA H.Corporaal

better fit to application domain smaller increments 4/12/2019 ACA H.Corporaal

Middle of the road approach Flexibile tiles Fixed tile structure at top level 4/12/2019 ACA H.Corporaal

Bus (shared) or Network (switched)
claimed to be more scalable no bus arbitration point-to-point connections but router overhead node R Example: NoC with 2x4 mesh routing network 4/12/2019 ACA H.Corporaal

Network design parameters
Important network design space: topology, degree routing algorithm path, path control, collision resolvement, network support, deadlock handling, livelock handling virtual layer support flow control, buffering QoS guarantees error handling etc, etc. 4/12/2019 ACA H.Corporaal

Switch / Network Topology
Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the network into two halves Bisection bandwidth = link bandwidth x bisection 4/12/2019 ACA H.Corporaal

Common Topologies Type Degree Diameter Ave Dist Bisection
1D mesh 2 N-1 N/3 1 2D mesh (N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh (N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N N2/2 N = number of nodes, n = dimension 4/12/2019 ACA H.Corporaal

Topology examples Hypercube Grid/Mesh Torus Assume 64 nodes: Criteria
Bus Ring Mesh 2Dtorus 6-cube Fully connected Performance Bisection bandwidth 1 2 8 16 32 1024 Cost Ports/switch Total #links 3 128 5 176 192 7 256 64 2080 4/12/2019 ACA H.Corporaal

Butterfly or Omega Network
All paths equal length Unique path from any input to any output Try to avoid conflicts 8 x 8 butterfly switch N/2 Butterfly How to make a bigger butterfly network? 4/12/2019 ACA H.Corporaal

Multistage Fat Tree A multistage fat tree (CM-5) avoids congestion at the root node Randomly assign packets to different paths on way up to spread the load Increase degree near root, decrease congestion 4/12/2019 ACA H.Corporaal

Old (off-chip) MP Networks
Name Number Topology Bits Clock Link Bis. BW Year nCube/ten cube MHz iPSC/ cube MHz MP D grid MHz 3 1, Delta 540 2D grid MHz CM fat tree MHz 20 10, CS fat tree MHz 50 50, Paragon D grid MHz 200 6, T3D D Torus MHz , MBytes/s No standard topology! However, for on-chip: mesh and torus are in favor ! 4/12/2019 ACA H.Corporaal

QoS: Quality-of-Service
Hard and Soft Real-time applications require QoS guarantees Predicatable delays Guaranteed throughput Issues: Resource manager interface between applications and platform resources (processing elements, network, memory, i/o) Do we allow caches software controlled Different traffic service types, including GT (guaranteed throughput / latency traffic) 4/12/2019 ACA H.Corporaal

Generic or Specialized? Intrinsic computational efficiency
4/12/2019 ACA H.Corporaal

Which types of parallelism to support?
ILP/OLP : instruction/operation level parallelism DLP: data level parallelism special case: subword SIMD TLP: task level parallelism heavy pipelining 4/12/2019 ACA H.Corporaal

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing

Similar presentations

Presentation on theme: "Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing

Similar presentations

Presentation on theme: "Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing"— Presentation transcript:

Similar presentations

About project

Feedback