Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II.

Similar presentations

Presentation on theme: "EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II."— Presentation transcript:

1 EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II

2 EE 382 Processor DesignWinter 98/99Michael Flynn 2 Illinois

3 EE 382 Processor DesignWinter 98/99Michael Flynn 3 Write-invalidate

4 EE 382 Processor DesignWinter 98/99Michael Flynn 4 Synchronization/coherency Synchronization....means to insure that multiple processors have the same (coherent) view of critical values in memory that ensures that value returned after a read is the same value as the latest write to which (or part of memory over which) coherency is maintained

5 EE 382 Processor DesignWinter 98/99Michael Flynn 5 Consistency of memory ops Sequential consistency (strong ordering) –all memory ops execute in some sequential order. Memory ops of each processor appear in program order. Processor consistency ( buffered writes) –LD sequences appear in program order also ST sequences, but LD may proceed ST. –different processors may see different op order –require explicit synchronization

6 EE 382 Processor DesignWinter 98/99Michael Flynn 6 Weak consistency Other forms possible, e.g. weak ordering –all pending mem ops are completed before a synchronization op (forced completion is called a fence op) –synch ops are completed before any other memory ops. –synch ops are sequentially consistent.

7 EE 382 Processor DesignWinter 98/99Michael Flynn 7 Outline Partitioning –Granularity –Overhead and efficiency Multi-threaded MP Shared Bus –Coherency –Synchronization –Consistency Scalable MP –Cache directories –Interconnection networks –Trends and tradeoffs Additional References –Hennessy and Patterson, CAQA, Chapter 8 –Culler, Singh, Gupta, Parallel Computer Architecture A Hardware/Software Approach http://HTTP.CS.Berkeley.EDU /~culler/book.alpha/index.html

8 EE 382 Processor DesignWinter 98/99Michael Flynn 8 Scalable MP Bandwidth for single bus limits scalability –Can use two (or more buses) for even/odd cache lines –Extends system size incrementally at substantial cost –Use low-degree MP on shared bus as a cluster with scalable interconnect

9 EE 382 Processor DesignWinter 98/99Michael Flynn 9 Coherency for Scalable MP Maintain single, coherent memory address space –There is no longer a shared bus accessed by all processors for synchronization and communication through memory –Use a directory to track processors using memory lines central directory: with memory module distributed directory: with individual caches –Shared lines can be invalidated or updated on write –4 possible protocols: CD-INV, CD-UP, DD-INV, DD-UP CD-INV and DD-INV (Scalable Coherent Interconnect) are most common

10 EE 382 Processor DesignWinter 98/99Michael Flynn 10 Central Directory

11 EE 382 Processor DesignWinter 98/99Michael Flynn 11 Central Directory Typically use a bit vector stored with each line in memory –Each bit indicates whether the corresponding cluster has cached a copy of the line –Various optimizations to reduce storage overhead are possible –Used in Stanford DASH/FLASH, MIT Alewife, SGI Origin When a processor needs to write a line it does not own –It requests the line from memory –CD sends invalidates to all caches that hold the line –All relevant caches invalidated the line and acknowledge –Requesting processor is allowed to take ownership and modify the line

12 EE 382 Processor DesignWinter 98/99Michael Flynn 12 Distributed Directory (Part I) Linked-list used to keep track of caches holding a line –S ingly- or doubly-linked (SCI) lists used –P ointer to head of list is stored with line in memory –Used in IEEE-SCI and Sequent NUMA-Q When a processor (P) needs to write a line it does not own –If P holds a shared copy in its cache, P removes itself from the linked list of caches for the line –P notifies the memory of its intention to write the line and becomes the head of the list –P sends an invalidation signal to the next cache on the list The next cache invalidates the line and returns an acknowledge to P along with a pointer to the next cache on the list –When all the caches have been invalidated, P can take ownership and write the line

13 EE 382 Processor DesignWinter 98/99Michael Flynn 13 Distributed Directory (Part II)

14 EE 382 Processor DesignWinter 98/99Michael Flynn 14 Distributed Directory (Part II) Performance Issues –Linked lists generally short for shared data being modified –When data is shared, important to minimize synchronization and communication overhead Queue on Lock Bit (QOLB) –Hardware maintains queue of caches waiting on lock –Software spins on shadow copy of line in local cache –Lock and data stored in same cache line –Single line transfer required for each processor to synchronize/communicate –“An Analysis of Synchronization Mechanisms in Shared-Memory Multiprocessors”, Woest and Goodman, URL: Efficient algorithms can be quite complex –FLASH uses programmable protocol processor

15 EE 382 Processor DesignWinter 98/99Michael Flynn 15 Interconnect Networks Each network node consists of processor, cache, and part of global memory May also include switch (direct). –For indirect networks switches are removed from nodes Networks may be static (fixed links between nodes) or dynamic (switches configure path) Only direct-static and indirect-dynamic commonly used.

16 EE 382 Processor DesignWinter 98/99Michael Flynn 16 Interconnect Networks DirectIndirect

17 EE 382 Processor DesignWinter 98/99Michael Flynn 17 Static, Direct Networks Includes ring, linear array, star, mesh,... We consider only hypertorus (k,n) topologies –n-dimensions, k-elements per dimension –k-ary n cubes with end around connection Terms –distance smallest no. links/hops between 2 nodes –diameter largest distance between 2 nodes –number of nodes N = k n for a (k,n) network

18 EE 382 Processor DesignWinter 98/99Michael Flynn 18 Static, Direct Networks Linear Array 2D torus Ring Grid (2D Mesh)

19 EE 382 Processor DesignWinter 98/99Michael Flynn 19 Links (Channels) and Nodes Link characteristics –cycle time: T ch =1/BW of a link wire –width of link: w = no. wires in the link –directionality:unidirectional or bidirectional links Node buffering (static networks) –Store and Forward –Wormhole (cut-through) routing

20 EE 382 Processor DesignWinter 98/99Michael Flynn 20 Links (Channels) and Nodes Store and Forward Wormhole

21 EE 382 Processor DesignWinter 98/99Michael Flynn 21 Communication Latency for Static Network Assume a (k,n) network with dimensional closure and bidirectional links; if message has H header bits and l “payload” bits, number of channel cycles to transmit message over one link is ( l + H)/w. If the distance between source and destination nodes is d links and h= H/w, then T store-and-forward = T ch [d ( l + H)/w] = T ch [d ( l /w) +d h] For wormhole routing, once a message header is received at a node the message proceeds to an output channel and is transmitted, so T wormhole = T ch [d h + ( l /w)] Note: Both formulas above refer to communication latency in the absence of contention (i.e., no queuing delay).

22 EE 382 Processor DesignWinter 98/99Michael Flynn 22 Dynamic, Indirect Networks Switches are separate from the nodes and centralized as a MIN (Multistage Interconnection Network) –A switch is a k x k crossbar with no storage –An N-node (1 channel/node) network has (N/k)w switches per stage. –Min. no stages to connect N to N is [log k N]

23 EE 382 Processor DesignWinter 98/99Michael Flynn 23 Dynamic, Indirect Networks Multi-Stage NetworkCrossbar Switch

24 EE 382 Processor DesignWinter 98/99Michael Flynn 24 Baseline Dynamic Network Destination node address sets switch routing for each stage Simpler baseline network we can have message blocking –No storage in the switch Cost for a baseline network is w x (N/k) x [log k N] in k x k switches Assume each switch has a delay of one channel cycle = T ch

25 EE 382 Processor DesignWinter 98/99Michael Flynn 25 Baseline Dynamic Network

26 EE 382 Processor DesignWinter 98/99Michael Flynn 26 Other Dynamic Networks Other MIN configurations include additional stages and switches for less blocking (redundant paths ) but more cost Dynamic networks generally have Uniform Memory Access (UMA) –Equal time to access any part of memory Can optimize for memory local to processor –Static networks are generally NUMA

27 EE 382 Processor DesignWinter 98/99Michael Flynn 27 Other Dynamic Networks

28 EE 382 Processor DesignWinter 98/99Michael Flynn 28 Network Tradeoffs –Direct Networks  Enables placement for communication affinity (NUMA)  Low incremental costs for small systems and expansion  Requires closely-coupled processor/switch design  High-dimensional networks have inefficient mapping to physical wiring

29 EE 382 Processor DesignWinter 98/99Michael Flynn 29 Network Tradeoffs Indirect Networks  Can be built from standard processors and switches  Large fixed cost in switches, even for small systems Trend is Toward Direct Networks With Low Dimensionality

30 EE 382 Processor DesignWinter 98/99Michael Flynn 30 Dynamic Network Analysis Time to transmit message without contention (T c ) –n is number of stages –T c = n + ( l /w) +1 (for h = 1) usually n + ( l /w) >>1 so –T c = n + ( l /w ) network cycles Model contention with M B /D/1 –p =  /k (going to k inputs) –  =  (probability that processor is sending a message) –  = m x ( l /w) (service time = l /w) –m = prob(a particular node makes a request in a cycle)

31 EE 382 Processor DesignWinter 98/99Michael Flynn 31 Dynamic Network Analysis Queing Delay: T dynamic = T c + T w –T w = (  l /w)(1 - 1/k)/(2(1 -  )) –T c = n + l /w –All expressed in network cycles = T ch

32 EE 382 Processor DesignWinter 98/99Michael Flynn 32 Static Network Analysis For a static (k,n) network –let k d be average no of network hops for message to transit a single dimension for bidirectional network with closure k d = k/4, (k even) Time to transmit message without contention (T c ) –T c = n x k d + ( l /w) in network cycles (for h = 1)

33 EE 382 Processor DesignWinter 98/99Michael Flynn 33 Static Network Analysis Model contention with M/G/1 for k large (k > 8) and M/D/1 for k smaller –  = mnk d (nk d is the average no. hops for a message) –  = 2nw/ l (each node has 2n channels) –  = mk d ( l /2w) – For M/G/1 T w = (  /1-  )( l /w)((k d -1)/k d 2 )(1+1/n) – For M/D/1 T w = (  /2(1-  )( l /w)

34 EE 382 Processor DesignWinter 98/99Michael Flynn 34 Static vs. Dynamic Network Example With localityWithout locality N = 1024 processing elements l = 200 bits Pins per switch = 64 (fan-in + fan-out)

35 EE 382 Processor DesignWinter 98/99Michael Flynn 35 Bisection Width Bisection Width is the minimum no. of wires cut when a network is divided into two equal halves If links (rather than nodes) dominate cost then network comparisons should be based on equivalent bisection width, B. –For static (k,n) B(k,n) = 2wN/k –For dynamic with kxk = 2x2; B = wN So higher-dimensional static networks have shorter “virtual” latency (no. hops) than lower-dimensional networks, but the planar (or even 3D) realization of physical wiring reduces performance –w is reduced for the same no. interconnect layers –wires are longer/slower

36 EE 382 Processor DesignWinter 98/99Michael Flynn 36 Hotspots and Combining Network traffic (especially synchronization) may be directed to a single location in memory creating a hotspot Hotspots can be mitigated by adding logic to the switch –Fetch and Add instructions directed to a hotspot can be combined and later the fetched result updated and split in the switch t = fraction of references going to hotspot

37 EE 382 Processor DesignWinter 98/99Michael Flynn 37 Multiprocessing Summary Multi-Threaded –Area of research and potential future practical application –Driven by diminishing returns of single-threaded performance/cost and emerging programming environments Shared-Bus –Established mainstream technology for all but the most cost-sensitive applications –Building block for scalable MP Scalable –Technology in stages of advanced research and early adoption –Static, direct networks with low dimensionality are winning Massively Parallel –Remains the “holy grail”

Download ppt "EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II."

Similar presentations

Ads by Google