COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

Slides:



Advertisements
Similar presentations
Shantanu Dutt Univ. of Illinois at Chicago
Advertisements

MESI cache coherence protocol
SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Communication operations Efficient Parallel Algorithms COMP308.
Parallel Computing Platforms
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Interconnection Network Topologies
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
CS252/Patterson Lec /28/01 CS162 Computer Architecture Lecture 16: Multiprocessor 2: Directory Protocol, Interconnection Networks.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Computer Science Department
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
PPC Spring Interconnection Networks1 CSCI-4320/6360: Parallel Programming & Computing (PPC) Interconnection Networks Prof. Chris Carothers Computer.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computer Architecture and Interconnect 1b.1.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Lecture 3 Innerconnection Networks for Parallel Computers
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 January Session 4.
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
1 Lecture 13: LRC & Interconnection Networks Topics: LRC implementation, interconnection characteristics.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Birds Eye View of Interconnection Networks
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Introduction ELG 6158 Digital Systems Architecture Miodrag Bolic.
Super computers Parallel Processing
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Interconnection Networks Communications Among Processors.
The University of Adelaide, School of Computer Science
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Parallel Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Lecture 23: Interconnection Networks
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.
The University of Adelaide, School of Computer Science
Parallel Architectures Based on Parallel Computing, M. J. Quinn
Communication operations
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Birds Eye View of Interconnection Networks
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
The University of Adelaide, School of Computer Science
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

Recap Multistage Omega Network Completely Connected Star Connected Networks Linear Arrays, Meshes, and k-d Meshes Hypercubes 2

Tree-Based Networks Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network. Trees can be laid out in 2D with no wire crossings. 3

Tree Properties The distance between any two nodes is no more than 2logp. ? Links higher up the tree potentially carry more traffic than those at the lower levels. ? Q1: Solution ? fat-tree, fattens the links as we go up the tree. 4

Fat Trees A fat tree network of 16 processing nodes. 5

Evaluating Static Interconnection Networks Diameter: The distance between the farthest two nodes in the network. Bisection Width: The minimum number of wires you must cut to divide the network into two equal parts. Cost: The number of links or switches (whichever is asymptotically higher) is a meaningful measure of the cost. 6 Q2: What is the purpose for each metric?

Evaluating Static Interconnection Networks NetworkDiameter Bisection Width Arc Connectivity Cost (No. of links) Completely-connected Star Complete binary tree Linear array 2-D mesh, no wraparound 2-D wraparound mesh Hypercube Wraparound k-ary d-cube 7

Evaluating Dynamic Interconnection Networks NetworkDiameter Bisection Width Arc Connectivity Cost (No. of links) Crossbar Omega Network Dynamic Tree 8 Q3: Which network provides the best tradeoff?

Cache Coherence in Multiprocessor Systems Hardware is required to coordinate access to data that might have multiple copies in the network. 9 I/O devices Memory P 1 $$ $ P 2 P 3 5 u = ? 4 u u :5 1 u 2 u 3 u = 7

Cache Coherence: Two Strategies Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables. 10 Q4: Can you compare the two?

Cache Coherence: Update and Invalidate Protocols If a processor just reads a value once and does not need it again, an update protocol may generate significant overhead. If two processors make interleaved test and updates to a variable, an update protocol is better. Both protocols suffer from false sharing overheads (two words that are not shared, however, they lie on the same cache line). Most current machines use invalidate protocols. 11

Maintaining Coherence Using Invalidate Protocols Each copy of a data item is associated with a state. One example of such a set of states is: shared, invalid, or dirty. 12 Q5: What candidate states can you propose?

The Invalidate Protocol In shared state, there are multiple valid copies of the data item (and therefore, an invalidate would have to be generated on an update). In dirty state, only one copy exists and therefore, no invalidates need to be generated. In invalid state, the data copy is invalid; therefore, a read generates a data request (and associated state changes). 13

The Invalidate Protocol Q6: Design a state diagram of a simple three-state coherence protocol. 14

Example: the Invalidate Protocol Example of parallel program execution with the simple three-state coherence protocol. 15

Snoopy Cache Systems How are invalidates sent to the right processors? There is a broadcast media that listens to all invalidates and read requests and performs appropriate coherence operations locally. A simple snoopy bus based cache coherence system. 16

Performance of Snoopy Caches Once copies of data are tagged dirty, all subsequent operations can be performed locally on the cache without generating external traffic. If a data item is read by a number of processors, it transitions to the shared state in the cache and all subsequent read operations become local. If processors read and update data at the same time, they generate coherence requests on the bus - which is ultimately bandwidth limited. 17

Directory Based Systems An inherent limitation of snoopy caches: each coherence operation is sent to all processors. Q7: Solution? – Send coherence requests to only those processors that need to be notified – This is done using a directory, which maintains a presence vector for each data item (cache line) along with its global state. 18

Directory Based Systems (a) a centralized directory (b) a distributed directory. 19 Q8: What is the problems of the centralized directory?

Performance of Directory Based Schemes The need for a broadcast media is replaced by the directory. The additional bits to store the directory may add significant overhead. The directory is a point of contention, therefore, distributed directory schemes must be used. The underlying network must be able to carry all the coherence requests. 20

Communication Costs in Parallel Machines Communication is a major overhead in parallel programs. The cost of communication is dependent on a variety of features – programming model semantics – network topology – data handling and routing – software protocols 21

Summary Tree-Based Networks Cache Coherence: Update vs. Invalidate Protocols A state diagram of a simple three-state coherence protocol Directory Based Systems 22