Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter Hardwired vs Microprogrammed Control Multithreading
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
The University of Adelaide, School of Computer Science
CS203 – Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
CMSC 611: Advanced Computer Architecture
Computer Architecture: Multithreading (I)
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Levels of Parallelism within a Single Processor
Hardware Multithreading
Architectural Interactions in High Performance Clusters
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Computer Science Division
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CS 3410, Spring 2014 Computer Science Cornell University
Latency Tolerance: what to do when it just won’t go away
Levels of Parallelism within a Single Processor
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Cache Performance Improvements
The University of Adelaide, School of Computer Science
CSC Multiprocessor Programming, Spring, 2011
Interconnection Network and Prefetching
Presentation transcript:

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

6/2/2015CS258 S992 Reducing Communication Cost Reducing effective latency Avoiding Latency Tolerating Latency communication latency vs. synchronization latency vs. instruction latency sender initiated vs receiver initiated communication

6/2/2015CS258 S993 Communication pipeline

6/2/2015CS258 S994 Approached to Latency Tolerance block data transfer –make individual transfers larger precommunication –generate comm before point where it is actually needed proceeding past an outstanding communication event –continue with independent work in same thread while event outstanding multithreading - finding independent work –switch processor to another thread

6/2/2015CS258 S995 Fundamental Requirements extra parallelism bandwidth storage sophisticated protocols

6/2/2015CS258 S996 How much can you gain? Overlaping computation with all communication –with one communication event at a time? Overlapping communication with communication –let C be fraction of time in computation... Let L be latency and r the run length of computation between messages, how many messages must be outstanding to hide L? What limits outstanding messages –overhead –occupancy –bandwidth –network capacity?

6/2/2015CS258 S997 Communication Pipeline Depth

6/2/2015CS258 S998 Block Data Transfer Message Passing –fragmention Shared Address space –local coherence problem –global coherence problem

6/2/2015CS258 S999 Benefits Under CPS Scaling

6/2/2015CS258 S9910 Proceeding Past Long Latency Events MP –posted sends, recvs SAS –write buffers and consistency models –non-blocking reads

6/2/2015CS258 S9911 Precommunication Prefetching –HW vs SW Non-blocking loads and speculative execution

6/2/2015CS258 S9912 Multithreading in SAS Basic idea –multiple register sets in the processor –fast context switch Approaches –switch on miss –switch on load –switch on cycle –hybrids »switch on notice »simultaneous multithreading

6/2/2015CS258 S9913 What is the gain? let R be the run length between switches, C the switch cost and L the latency? what’s the max proc utilization? what the speedup as function of multithreading? tolerating local latency vs remote

6/2/2015CS258 S9914 Latency/Volume Matched Networks K-ary d-cubes? Butterflies? Fat-trees? Sparse cubes?

6/2/2015CS258 S9915 What about synchronization latencies?

6/2/2015CS258 S9916 Burton Smith’s MT Architectures Denelcor HEP Horizon Tera

6/2/2015CS258 S9917 Tera System Pipeline 64-bit, 3-way LIW 128 threads per proc 32 regs per thread 8 branch trarget regs ~16 deep inst pipe 70 cycle mem upto 8 concurrent mem refs per thread –look-ahead field limit –speculative loads/ poison regs full/empty bits on every mem word

6/2/2015CS258 S9918 A perspective on Mem Hierarchies Designing for the hard case has huge value! Speedup due to multithreading is proportional to number of msgs outstanding Each takes state –capacity of a given level limits the number (as does occupancy and frequency) –achieve more tolerance by discarding lower-level –huge opportunity cost Putting it back in enhances ‘local’ performance and reduces frequency Natural inherent limits on speedup P regs sram local dram Global dram