ECE669 L25: Final Exam Review May 6, 2004 ECE 669 Parallel Computer Architecture Lecture 25 Final Exam Review.

Slides:



Advertisements
Similar presentations
Comparison Of Network On Chip Topologies Ahmet Salih BÜYÜKKAYHAN Fall.
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
SE-292 High Performance Computing
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Interconnection Networks Direct Indirect Shared Memory Distributed Memory (Message passing)
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
ECE669 L12: Interconnection Network Performance March 9, 2004 ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
CS 258 Parallel Computer Architecture Lecture 4 Network Topology and Routing February 4, 2008 Prof John D. Kubiatowicz
CS 258 Parallel Computer Architecture Lecture 5 Routing February 6, 2008 Prof John D. Kubiatowicz
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks (con’t) March 15 th, 2010 John Kubiatowicz Electrical Engineering and Computer.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 258 Parallel Computer Architecture Lecture 3 Introduction to Scalable Interconnection Network Design January 30, 2008 Prof John D. Kubiatowicz
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
EECS 570: Fall rev1 1 Chapter 10: Scalable Interconnection Networks.
1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
Interconnection Network Topology Design Trade-offs
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
CS252/Patterson Lec /28/01 CS162 Computer Architecture Lecture 16: Multiprocessor 2: Directory Protocol, Interconnection Networks.
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
John Kubiatowicz Electrical Engineering and Computer Sciences
Interconnect Network Topologies
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 14 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences.
CS252 Graduate Computer Architecture Lecture 15 Multiprocessor Networks March 12 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 3, 2000 Topics Network design issues Network Topology.
ECE669 L21: Routing April 15, 2004 ECE 669 Parallel Computer Architecture Lecture 21 Routing.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Super computers Parallel Processing
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Lecture 14: Interconnection Networks Topics: dimension vs. arity, deadlock.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Parallel Architecture
Lecture 23: Interconnection Networks
Interconnection Network Routing, Topology Design Trade-offs
John Kubiatowicz Electrical Engineering and Computer Sciences
CMSC 611: Advanced Computer Architecture
Interconnection Network Design Lecture 14
Static Interconnection Networks
Interconnection Networks Contd.
Embedded Computer Architecture 5SAI0 Interconnection Networks
CS 6290 Many-core & Interconnect
Static Interconnection Networks
Networks: Routing and Design
Presentation transcript:

ECE669 L25: Final Exam Review May 6, 2004 ECE 669 Parallel Computer Architecture Lecture 25 Final Exam Review

ECE669 L25: Final Exam Review May 6, 2004 Bandwidth & Latency Latency Average latency k d : Bandwidth per node - more complex - if all near neighbor messages -If average travel dist then? -Let Direct k nk  worst case n (k  1) 2  torus  1 dir channels ~n k 4  torus  2 dir channels ~n k 3  no end conns  2 dir channels 1 B Avg DIST  nk 3 Msg size  B

ECE669 L25: Final Exam Review May 6, 2004 Analogy –If each student takes 8 years to graduate –And if a Prof. can support 10 students max at any time How many new students can Prof. take on in a year? Each year take on °Similarly: Network has channels # of flits it can sustain = # of msgs it can concurrently sustain= Each msg flit uses channels to dest So Nn N x nk 3  Nn B or BW per node  x  3 kB Nn B nk 3

ECE669 L25: Final Exam Review May 6, 2004 Another way of getting BW is: Max # msgs in net at any time These take cycles to get delivered, during which time no new mgs can get in I.e. we can inject or # injected per node per cycle °Note: We have not considered contention thus far. °In practice, latency shoots up much before we achieve the theoretical due to contention. = kn 3  Nn B B msgs every kn 3 cycles = Nn B · 3 kn · 1 N = 3 Bk

ECE669 L25: Final Exam Review May 6, 2004 Deriving U is not easy °Notice m = Probability of a message on a useful processor cycle m eff = m U = probability of msg on any cycle T = f (m eff ) = network delay as a function of m or # of useful processor cycles depends on T and m eff °Cyclic dependence! T BW m U T T1T1 m1m1 m2m2 m0m0 U 1 m 2 U 0 m 1 =U 0 m 0 T0T0 T0T0 T1T1 U= mTmT

ECE669 L25: Final Exam Review May 6, 2004 Linear Arrays and Rings °Linear Array Diameter? Average Distance? Bisection bandwidth? Route A -> B given by relative address R = B-A °Torus? °Examples: FDDI, SCI, KSR1

ECE669 L25: Final Exam Review May 6, 2004 Multidimensional Meshes and Tori °n-dimensional k-ary mesh: N = k n k = n  N described by n-vector of radix k coordinate °n-dimensional k-ary torus (or k-ary n-cube)? 2D Grid 3D Cube

ECE669 L25: Final Exam Review May 6, 2004 Trees °Diameter and ave distance logarithmic k-ary tree, height d = log k N address specified d-vector of radix k coordinates describing path down from root °Fixed degree °H-tree space is O(N) with O(  N) long wires °Bisection BW?

ECE669 L25: Final Exam Review May 6, 2004 Fat-Trees °Fatter links (really more of them) as you go up, so bisection BW scales with N

ECE669 L25: Final Exam Review May 6, 2004 Butterflies °Tree with lots of roots! °N log N (actually N/2 x logN) °Exactly one route from any source to any dest °Bisection N/2 16 node butterfly building block

ECE669 L25: Final Exam Review May 6, 2004 Hypercubes °Also called binary n-cubes. # of nodes = N = 2 n. °O(logN) Hops °Good bisection BW °Complexity Out degree is n = logN correct dimensions in order with random comm. 2 ports per processor 0-D1-D2-D3-D 4-D 5-D !

ECE669 L25: Final Exam Review May 6, 2004 Toplology Summary °All have some “bad permutations” many popular permutations are very bad for meshs (transpose) ramdomness in wiring or routing makes it hard to find a bad one! TopologyDegreeDiameterAve DistBisectionD (D P=1024 1D Array2N-1N / 31huge 1D Ring2N/2N/42 2D Mesh42 (N 1/2 - 1)2/3 N 1/2 N 1/2 63 (21) 2D Torus4N 1/2 1/2 N 1/2 2N 1/2 32 (16) k-ary n-cube2nnk/2nk/4nk/415 Hypercuben =log Nnn/2N/210 (5)

ECE669 L25: Final Exam Review May 6, 2004 Example Cache Coherence Problem Processors see different values for u after event 3 With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when -Processes accessing main memory may see very stale value Unacceptable to programs, and frequent! I/O devices Memory P 1 $$ $ P 2 P 3 5 u = ? 4 u u :5 1 u 2 u 3 u = 7

ECE669 L25: Final Exam Review May 6, 2004 Snoopy Cache-Coherence Protocols °Bus is a broadcast medium & Caches know what they have °Cache Controller “snoops” all transactions on the shared bus relevant transaction if for a block it contains take action to ensure coherence -invalidate, update, or supply value depends on state of the block and the protocol State Address Data

ECE669 L25: Final Exam Review May 6, A= 1 x= LOOP:If (x= = 0) GOTO LOOP; b=A C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0 Does caching violate this model? If b = = 0 at the end, sequential consistency is violated

ECE669 L25: Final Exam Review May 6, 2004 If b = = 0 at the end, sequential consistency is violated A= 1 x= A= 1 x= b=0!VIOLATION! 5 3 inv x 4 x delay LOOP:If (x= = 0) GOTO LOOP; b=A C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0 Does caching violate this model?

ECE669 L25: Final Exam Review May 6, 2004 Does caching violate this model? LOOP:If (x= = 0) GOTO LOOP; b=A b= 1 ! o.k. C P1P1 P2P2 M A=0 M x=0 C A=0 x=0 A=0 x=0 x= 1 3 delay... A= 1 ACK A= 1 x= 1 A= 1 x= fence inv x 3

ECE669 L25: Final Exam Review May 6, 2004 Coherence in small machines: Snooping Caches Broadcast address on shared write Everyone listens (snoops) on bus to see if any of their own addresses match How do you know when to broadcast, invalidate... -State associated with each cache line cache directory cache directory cache snoop Dual ported M a a a a Broadcast write Processor Purge Match a 5

ECE669 L25: Final Exam Review May 6, 2004 State diagram for ownership protocols For each shared data cache block Ownership In ownership protocol: writer owns exclusive copy invalid write-dirtyread-clean Local Read Remote Write (Replace) Remote Write (Replace) Remote Read Local Write Local Write Broadcast a

ECE669 L25: Final Exam Review May 6, 2004 Network °LimitLESS directories: Limited directories Locally Extended through Software Support Trap processor when 5th request comes Processor extends directory into local memory M M M CCC P PP... 5th req Trap 2 3 1

ECE669 L25: Final Exam Review May 6, 2004 Zero pointer LimitLESS: All software coherence mem cache proc comm directory trap always mem cache proc comm remote mem cache proc comm local

ECE669 L25: Final Exam Review May 6, 2004 Hierarchical - E.g. KSR (actually has rings...) cache P PP PPP P disks? MEMORY RD A A A A A A A?

ECE669 L25: Final Exam Review May 6, 2004 Include network latency °Each request suffers T cycles of latency °Processor utilization = Processor utilization Network bandwidth also wasted because of lost issue opportunities! FIX? Proc. Net. t BK d Processor idle Latency T

ECE669 L25: Final Exam Review May 6, 2004 One solution Overlap communication with computation. Multithread” the processor And/or allow multiple outstanding requests -- non-blocking memory Net. T BK d Processor utilization or Context switch interval Z  pt t  T ifpt  (t  T)

ECE669 L25: Final Exam Review May 6, 2004 Synchronization delays °If no multithreading Fault Wasted processor cycles Satisfied Process 1 Process 2 Process 3 Synchronization fault 1 Synchronization fault 1 satisfied

ECE669 L25: Final Exam Review May 6, Full system simulation (coupled) Processor Memory Inter- connection network # # Wait, traps Acknowledgements, responses 4000x slowdown per node Very accurate Compiled application Memory requests Synchronization requests Network requests Processor simulator Cache and memory systems simulator Interconnection network simulator Measures: Speedup, runtime proc. util. net latency cache miss rate......

ECE669 L25: Final Exam Review May 6, 2004 Active Messages °User-level analog of network transaction transfer data packet and invoke handler to extract it from the network and integrate with on-going computation °Request/Reply °Event notification: interrupts, polling, events? °May also perform memory-to-memory transfer Request handler Reply

ECE669 L25: Final Exam Review May 6, 2004 Deadlock Freedom °How can it arise? necessary conditions: -shared resource -incrementally allocated -non-preemptible think of a channel as a shared resource that is acquired incrementally -source buffer then dest. buffer -channels along a route °How do you avoid it? constrain how channel resources are allocated ex: dimension order °How do you prove that a routing algorithm is deadlock free

ECE669 L25: Final Exam Review May 6, 2004 More examples: °Consider other topologies butterfly? tree? fat tree? °Any assumptions about routing mechanism? amount of buffering? °What about wormhole routing on a ring?

ECE669 L25: Final Exam Review May 6, 2004 Deadlock free wormhole networks? °Basic dimension order routing techniques don’t work for k-ary n-cubes only for k-ary n-arrays (bi-directional) °Idea: add channels! provide multiple “virtual channels” to break the dependence cycle good for BW too! Do not need to add links, or xbar, only buffer resources

ECE669 L25: Final Exam Review May 6, 2004 Breaking deadlock with virtual channels

ECE669 L25: Final Exam Review May 6, 2004 Turn Restrictions in X,Y °XY routing forbids 4 of 8 turns and leaves no room for adaptive routing °Can you allow more turns and still be deadlock free

ECE669 L25: Final Exam Review May 6, 2004 How do you build a crossbar

ECE669 L25: Final Exam Review May 6, 2004 Examples °Short Links °long links several flits on the wire

ECE669 L25: Final Exam Review May 6, 2004 Modulo Unrolling – Smart Memory °Loop unrolling relies on dependencies °Allow maximum parallelism °Minimize communication