COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Slides:



Advertisements
Similar presentations
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
Advertisements

Cache Optimization Summary
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Parallel System Performance CS 524 – High-Performance Computing.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
Parallel System Performance CS 524 – High-Performance Computing.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
Connecting LANs, Backbone Networks, and Virtual LANs
Switching, routing, and flow control in interconnection networks.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Introduction to IT and Communications Technology Justin Champion C208 – 3292 Ethernet Switching CE
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Computer Science Department
Chapter 2 – X.25, Frame Relay & ATM. Switched Network Stations are not connected together necessarily by a single link Stations are typically far apart.
PPC Spring Interconnection Networks1 CSCI-4320/6360: Parallel Programming & Computing (PPC) Interconnection Networks Prof. Chris Carothers Computer.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
OSI Model. Switches point to point bridges two types store & forward = entire frame received the decision made, and can handle frames with errors cut-through.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Super computers Parallel Processing
Ad Hoc On-Demand Distance Vector Routing (AODV) ietf
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Lecture 23: Interconnection Networks
Course Outline Introduction in algorithms and applications
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
SWITCHING Switched Network Circuit-Switched Network Datagram Networks
Routing.
ECE 544 Protocol Design Project 2016
Transport Layer Unit 5.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Chapter 3: Open Systems Interconnection (OSI) Model
The University of Adelaide, School of Computer Science
Switching, routing, and flow control in interconnection networks
Multiprocessor Introduction and Characteristics of Multiprocessor
Communication operations
ECE453 – Introduction to Computer Networks
The Network Layer Network Layer Design Issues:
PRESENTATION COMPUTER NETWORKS
Switching Techniques.
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
High Performance Computing
Lecture: Interconnection Networks
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
CS 6290 Many-core & Interconnect
The University of Adelaide, School of Computer Science
Advanced Computer and Parallel Processing
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Routing.
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Recap Tree-Based Networks A state diagram of a simple three-state coherence protocol 2

The Invalidate Protocol 3

Example: the Invalidate Protocol Example of parallel program execution with the simple three-state coherence protocol. 4

Snoopy Cache Systems How are invalidates sent to the right processors? There is a broadcast media that listens to all invalidates and read requests and performs appropriate coherence operations locally. A simple snoopy bus based cache coherence system. 5

Performance of Snoopy Caches Once copies of data are tagged dirty, all subsequent operations can be performed locally on the cache without generating external traffic. If a data item is read by a number of processors, it transitions to the shared state in the cache and all subsequent read operations become local. If processors read and update data at the same time, they generate coherence requests on the bus - which is ultimately bandwidth limited. 6

Directory Based Systems An inherent limitation of snoopy caches: each coherence operation is sent to all processors. Q1: Solution? – Send coherence requests to only those processors that need to be notified – This is done using a directory, which maintains a presence vector for each data item (cache line) along with its global state. 7

Directory Based Systems (a) a centralized directory (b) a distributed directory. 8 Q2: What is the problems of the centralized directory?

Performance of Directory Based Schemes The need for a broadcast media is replaced by the directory. The additional bits to store the directory may add significant overhead. The directory is a point of contention, therefore, distributed directory schemes must be used. The underlying network must be able to carry all the coherence requests. 9

Communication Costs in Parallel Machines Communication is a major overhead in parallel programs. – programming model semantics – network topology – data handling and routing – software protocols 10 Q3: What features does the cost of communication depend on?

Message Passing Costs in Parallel Computers The total time to transfer a message over a network – Startup time (t s ): spent at sending and receiving nodes (executing the routing algorithm, programming routers). – Per-hop time (t h ): is a function of number of hops and includes switch latencies, network delays. – Per-word transfer time (t w ): overheads that are determined by the length of the message. 11

Store-and-Forward Routing The total communication cost for a message of size m words to traverse l communication links is In most platforms, t h is small and we have 12

Routing Techniques 13 Q4: What is the problem of store-and-forward routing? Solution?

Packet Routing The total communication time for packet routing is approximated by: Compare: The total communication time for store-and-forward routing: 14

Cut-Through Routing Takes packet routing to an extreme: further dividing messages into basic units called flits. Force all flits to take the same path. Flits are typically small, the header information must be minimized. A tracer message first programs all intermediate routers. All flits then take the same route. Error checks are performed on the entire message, as opposed to flits. No sequence numbers are needed. 15 Q5: Why the same path? Any benefit?

Cut-Through Routing The total communication time for cut- through routing is approximated by: This is identical to packet routing t w is typically much smaller. 16 Q6: What’s new?

Routing Mechanisms for Interconnection Networks How does one compute the route that a message takes from source to destination? – Must prevent deadlocks – Must avoid hot-spots 17

Routing Mechanisms for Interconnection Networks Routing a message from node P s (010) to node P d (111) in a three- dimensional hypercube using E-cube routing. 18

Mapping Techniques for Graphs Embed a known communication pattern into a given interconnection topology. We may have an algorithm designed for one network, which we are porting to another topology. Why mapping between graphs is important? 19

Mapping Techniques for Graphs: Metrics Mapping a graph G(V,E) into G’(V’,E’) The maximum number of edges mapped onto any edge in E’ is called the congestion of the mapping. The maximum number of links in E’ that any edge in E is mapped onto is called the dilation of the mapping. The ratio of the number of nodes in the set V’ to that in set V is called the expansion of the mapping. 20

Summary An Example of the Invalidate Protocol Message Passing Costs in Parallel Computers Introduction to Mapping Techniques for Graphs 21