ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

Slides:

Advertisements

Similar presentations

MPI Message Passing Interface

Advertisements

CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con’t) March 16 th, 2011 John Kubiatowicz Electrical Engineering and Computer.

A Coherent and Managed Runtime for ML on the SCC KC SivaramakrishnanLukasz Ziarek Suresh Jagannathan Purdue University SUNY Buffalo Purdue University.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Lecture Objectives: 1)Explain the limitations of flash memory. 2)Define wear leveling. 3)Define the term IO Transaction 4)Define the terms synchronous.

Multiple Processor Systems

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

CS252 Graduate Computer Architecture Lecture 17 Multiprocessor Networks (con’t) March 31 th, 2010 John Kubiatowicz Electrical Engineering and Computer.

Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.

ECE669 L25: Final Exam Review May 6, 2004 ECE 669 Parallel Computer Architecture Lecture 25 Final Exam Review.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.

General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)

Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

CS 258 Parallel Computer Architecture Lecture 8 Network Interface Design February 20, 2008 Prof John D. Kubiatowicz

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.

CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.

User Datagram Protocol (UDP) Chapter 11. Know TCP/IP transfers datagrams around Forwarded based on destination’s IP address Forwarded based on destination’s.

Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

SPL/2010 Reactor Design Pattern 1. SPL/2010 Overview ● blocking sockets - impact on server scalability. ● non-blocking IO in Java - java.niopackage ●

Communication in Distributed Systems. . The single most important difference between a distributed system and a uniprocessor system is the interprocess.

Chapter 13: I/O Systems.

Module 12: I/O Systems I/O hardware Application I/O Interface

John Kubiatowicz Electrical Engineering and Computer Sciences

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Alternative system models

CMSC 611: Advanced Computer Architecture

Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Architectural Interactions in High Performance Clusters

Operating System Concepts

13: I/O Systems I/O hardwared Application I/O Interface

CS703 - Advanced Operating Systems

Computer Science Division

By Brian N. Bershad, Thomas E. Anderson, Edward D

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Lecture 24: Multiprocessors

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Lecture 18: Coherence and Synchronization

Module 12: I/O Systems I/O hardwared Application I/O Interface

Multiprocessors and Multi-computers

Presentation transcript:

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing

ECE669 L20: Evaluation and Message Passing April 13, 2004 Performance Evaluation °Why? Evaluate tradeoffs Estimate machine performance Measure application behavior °How? Ask an expert --- hire a consultant? Measure existing machines --- (What?!) Build simulators Analytical models Hybrids --- combination of 2-3-4

ECE669 L20: Evaluation and Message Passing April 13, 2004 In the lab Parallel Traces Network Model Cache coherence simulator Parallel Address Trace Processor Utilization Network Model Msg probability, size Event frequencies Compute

ECE669 L20: Evaluation and Message Passing April 13, 2004 Example 1. Trace simulation, awk filter 2. Compute network parameters m,B 3. Compute processor utilization Given m,B and k d, n, N Derive U. Event %Probout-msgout-sizin-msgin-sizflits read miss in1.2507yes4yes* (see write mode formula) other msgs2.8483yes4yes * : in-siz: 4+ (block size/flit size)

ECE669 L20: Evaluation and Message Passing April 13, 2004 Evaluation Barriers implemented using distributed trees Read-only sharing marked Processor utilization U Weather pointers U Speech pointers No coherence 1 chain 2 chain Write through = limited dir No coherence 1 chain 2 chain Write through = limited dir LimitLESS

ECE669 L20: Evaluation and Message Passing April 13, Processor Memory Inter- connection network # # Evaluation

ECE669 L20: Evaluation and Message Passing April 13, Full system simulation (coupled) Processor Memory Inter- connection network # # Wait, traps Acknowledgements, responses 4000x slowdown per node Very accurate Compiled application Memory requests Synchronization requests Network requests Processor simulator Cache and memory systems simulator Interconnection network simulator Measures: Speedup, runtime proc. util. net latency cache miss rate......

ECE669 L20: Evaluation and Message Passing April 13, Processor Memory Inter- connection network # # Trace-driven simulation (coupled) Port ready Acknowledgements, responses Address trace Network requests Sequential address trace Cache and memory systems simulator Network simulator Trace scheduler E.g.From M.I.T. many parallel traces exist Proc0 Addr. Proc 1 Addr. Proc2 Addr x slowdown per node Very accurate

ECE669 L20: Evaluation and Message Passing April 13, Hybrid Network Model Processor Memory Inter- connection network # # Wait Response, latency Compiled application Parallel address trace Requests Processor simulator Cache and memory system simulator Network model Time window average 800x slowdown per node Fair accuracy

ECE669 L20: Evaluation and Message Passing April 13, 2004 Many hybrids possible 24 Processor Memory Inter- connection network # # Parallel address traces Events counts Cache and memory system simulator Analytical network model Request rate, message size 200x slowdown per node Fair accuracy # traces Network model

ECE669 L20: Evaluation and Message Passing April 13, Processor Memory Inter- connection network # # Compiled application Memory requests Network requests, time window ave Cache and memory system simulator Network model Hybrid - Network model Direct execution - Round robin/process - Switch on mem. req. Responses, latency Switch Fair accuracy 100x slowdown (30x with threads)

ECE669 L20: Evaluation and Message Passing April 13, 2004 Trace-driven (decoupled) 24 Processor Memory Inter- connection network # # Address trace Network requests Cache and memory system simulator Network simulator Responses No trace scheduler! No feedback Synchronization constraints may be violated Garbage!

ECE669 L20: Evaluation and Message Passing April 13, 2004 Message passing °Bulk transfers °Complex synchronization semantics more complex protocols More complex action °Synchronous Send completes after matching recv and source data sent Receive completes after data transfer complete from matching send °Asynchronous Send completes after send buffer may be reused

ECE669 L20: Evaluation and Message Passing April 13, 2004 Synchronous Message Passing °Constrained programming model. °Deterministic! What happens when threads added? °Destination contention very limited. °User/System boundary? Processor Action?

ECE669 L20: Evaluation and Message Passing April 13, 2004 Asynchronous Message Passing: Optimistic °More powerful programming model °Wildcard receive => non-deterministic °Storage required within msg layer? Source Destination Time Send (P dest, local VA, len) (1) Initiate send (2) Address translation (4) Send data Recv P src, local VA, len Data-xfer req Tag match Allocate buffer (3) Local/remote check (5) Remote check for posted receive; on fail, allocate data buffer

ECE669 L20: Evaluation and Message Passing April 13, 2004 Asynchronous Message Passing: Conservative °Where is the buffering? °Contention control? Receiver initiated protocol? °Short message optimizations Source Destination Time Send P dest, local VA, len Send-rdy req Tag check (1) Initiate send (2) Address translation on P dest (4) Send-ready request (6) Receive-ready request Return and compute Recv P src local VA, len Recv-rdy req Data-xfer reply (3) Local/remote check (5) Remote check for posted receive (assume fail); record send-ready (7) Bulk data reply Source VA Dest VA or ID

ECE669 L20: Evaluation and Message Passing April 13, 2004 Key Features of Message Passing Abstraction °Source knows send data address, dest. knows receive data address after handshake they both know both °Arbitrary storage “outside the local address spaces” may post many sends before any receives non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors -fine print says these are limited too °Fundamentally a 3-phase transaction includes a request / response can use optimisitic 1-phase in limited “Safe” cases -credit scheme

ECE669 L20: Evaluation and Message Passing April 13, 2004 Active Messages °User-level analog of network transaction transfer data packet and invoke handler to extract it from the network and integrate with on-going computation °Request/Reply °Event notification: interrupts, polling, events? °May also perform memory-to-memory transfer Request handler Reply

ECE669 L20: Evaluation and Message Passing April 13, 2004 Common Challenges °Input buffer overflow N-1 queue over-commitment => must slow sources reserve space per source(credit) -when available for reuse? –Ack or Higher level Refuse input when full -backpressure in reliable network -tree saturation -deadlock free -what happens to traffic not bound for congested dest? Reserve ack back channel drop packets Utilize higher-level semantics of programming model

ECE669 L20: Evaluation and Message Passing April 13, 2004 Summary °Evaluation – important to understand intermediate messages in cache protocol °Message sizes may vary based on function °Two main types of message passing protocols Synchronous and asynchronous °Active messages involve remote operations °Message techniques depend on network reliability