Latency Tolerance: what to do when it just won’t go away

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Fundamental Design Issues Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Simultaneous Multithreading
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
The University of Adelaide, School of Computer Science
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
CS203 – Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CMSC 611: Advanced Computer Architecture
Computer Architecture: Multithreading (I)
The University of Texas at Austin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Levels of Parallelism within a Single Processor
Hardware Multithreading
Architectural Interactions in High Performance Clusters
CS 213: Parallel Processing Architectures
Computer Science Division
Introduction to Multiprocessors
Parallel Processing Architectures
Research Opportunities in IP Wide Area Storage
Adapted from slides by Sally McKee Cornell University
Mattan Erez The University of Texas at Austin
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
/ Computer Architecture and Design
Fine-grained vs Coarse-grained multithreading
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
CS 3410, Spring 2014 Computer Science Cornell University
Levels of Parallelism within a Single Processor
Chapter 4 Multiprocessors
Hardware Multithreading
An Overview of MIMD Architectures
Cache Performance Improvements
The University of Adelaide, School of Computer Science
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Pipelining.
CSC Multiprocessor Programming, Spring, 2011
Interconnection Network and Prefetching
Presentation transcript:

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley CS258 S99

Reducing Communication Cost Reducing effective latency Avoiding Latency Tolerating Latency communication latency vs. synchronization latency vs. instruction latency sender initiated vs receiver initiated communication 2/24/2019 CS258 S99

Communication pipeline 2/24/2019 CS258 S99

Approached to Latency Tolerance block data transfer make individual transfers larger precommunication generate comm before point where it is actually needed proceeding past an outstanding communication event continue with independent work in same thread while event outstanding multithreading - finding independent work switch processor to another thread 2/24/2019 CS258 S99

Fundamental Requirements extra parallelism bandwidth storage sophisticated protocols 2/24/2019 CS258 S99

How much can you gain? Overlaping computation with all communication with one communication event at a time? Overlapping communication with communication let C be fraction of time in computation... Let L be latency and r the run length of computation between messages, how many messages must be outstanding to hide L? What limits outstanding messages overhead occupancy bandwidth network capacity? 2/24/2019 CS258 S99

Communication Pipeline Depth 2/24/2019 CS258 S99

Block Data Transfer Message Passing Shared Address space fragmention local coherence problem global coherence problem 2/24/2019 CS258 S99

Benefits Under CPS Scaling 2/24/2019 CS258 S99

Proceeding Past Long Latency Events MP posted sends, recvs SAS write buffers and consistency models non-blocking reads 2/24/2019 CS258 S99

Precommunication Prefetching HW vs SW Non-blocking loads and speculative execution 2/24/2019 CS258 S99

Multithreading in SAS Basic idea Approaches multiple register sets in the processor fast context switch Approaches switch on miss switch on load switch on cycle hybrids switch on notice simultaneous multithreading 2/24/2019 CS258 S99

What is the gain? let R be the run length between switches, C the switch cost and L the latency? what’s the max proc utilization? what the speedup as function of multithreading? tolerating local latency vs remote 2/24/2019 CS258 S99

Latency/Volume Matched Networks K-ary d-cubes? Butterflies? Fat-trees? Sparse cubes? 2/24/2019 CS258 S99

What about synchronization latencies? 2/24/2019 CS258 S99

Burton Smith’s MT Architectures Denelcor HEP Horizon Tera 2/24/2019 CS258 S99

Tera System Pipeline 64-bit, 3-way LIW 128 threads per proc 32 regs per thread 8 branch trarget regs ~16 deep inst pipe 70 cycle mem upto 8 concurrent mem refs per thread look-ahead field limit speculative loads/ poison regs full/empty bits on every mem word 2/24/2019 CS258 S99

A perspective on Mem Hierarchies regs sram local dram Global dram Designing for the hard case has huge value! Speedup due to multithreading is proportional to number of msgs outstanding Each takes state capacity of a given level limits the number (as does occupancy and frequency) achieve more tolerance by discarding lower-level huge opportunity cost Putting it back in enhances ‘local’ performance and reduces frequency Natural inherent limits on speedup 2/24/2019 CS258 S99