Dataflow Processors ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Operating Systems Session 7: – Virtual Memory organization Operating Systems.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Prof. Onur Mutlu Carnegie Mellon University
CSC 4250 Computer Architectures
Simultaneous Multithreading
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
Out of Order Processors
Levels of Parallelism within a Single Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Lecture 10: Branch Prediction and Instruction Delivery
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Levels of Parallelism within a Single Processor
Dynamic Hardware Prediction
Samira Khan University of Virginia Jan 23, 2019
Conceptual execution on a processor which exploits ILP
10/18: Lecture Topics Using spatial locality
Spring 2019 Prof. Eric Rotenberg
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Dataflow Processors ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use in ECE/CS 757, no part of these notes may be reproduced, stored in a retrieval system, or transmitted,in any form or by any means, electronic, mechanical, photocopying,recording, or otherwise, without prior written permission from the author.

(C) 2007 J. E. Smith ECE/CS 757 Dataflow Architectures  Overview  Static dataflow  Dynamic dataflow  Explicit tag store  Recent Case Study: Wavescalar

(C) 2007 J. E. Smith ECE/CS 757 Overview  A global version of Tomasulo's algorithm (Tomasulo's alg came first)  Example: input a,b y := (a+b)/x x := (a*(a+b))+b output y,x note ordering of statements in program is irrelevant

(C) 2007 J. E. Smith ECE/CS 757 Loop Example input y,x n := 0 while y<x do y := y + x n := n + 1 end output y,n (Using arrays was intentionally avoided)

(C) 2007 J. E. Smith ECE/CS 757 Static Dataflow  Combine control and data into a template like a reservation station except they are held in memory can inhibit parallelism among loop iterations re-use of template  acks

(C) 2007 J. E. Smith ECE/CS 757 Dynamic Dataflow  Separate data tokens and control Token: labeled packet of information  Allows multiple iterations to be simultaneously active shared control (instruction) separate data tokens A data token can carry a loop iteration number  Match tokens’ tags in matching store via assoc. search if match not found, make entry, wait for partner  When there is a match, fetch corresponding instruction from program memory  Requires large associative search to match tags  Adds “structure storage” access via select function – index and structure descriptor as inputs

(C) 2007 J. E. Smith ECE/CS 757 Dataflow: Advantages/Disadvantages  Advantages: no program counter data-driven execution inhibited only by true data dependences stateless / side-effect free further enhances parallelism  Disadvantages no program counter leads to very long fetch/execute latency spatial locality in i-fetch hard to exploit requires matching (e.g., via associative compares) stateless / side-effect free no shared data structures no pointers into data structures (implies state) In theory take entire data structure as input “token” and emit a new version I/O difficult – depends on state Virtual memory??

(C) 2007 J. E. Smith ECE/CS 757 Explicit Token Store  Used in Monsoon  Implement matching store with RAM  Frame is associated with an instruction block e.g. loop iteration at loop invocation, spin off multiple frames  Tag contains instruction pointer (IP) and frame pointer (FP)  Instructions include frame offset to give location for token "rendezvous"  An instruction indicates destination instructions when a token arrives at instruction unit, instruction is fetched. instruction indicates frame offset for token rendezvous frame location is full/empty empty => no match, put token in frame, set full full => match, feed both tokens to instruction for execution

(C) 2007 J. E. Smith ECE/CS 757 Hybrid Solution  Used in *T  Use dataflow at high level Dataflow functions are coarser grain operations  Use conventional control flow to implement coarse grain operations

(C) 2007 J. E. Smith ECE/CS 757 Wavescalar  Recent research project at the other UW  Goal: to minimize communication costs transistors get faster faster than wires get faster  Keep von Neumann memory model State + side effects “solves” data structure problem (but doesn’t get the side effect free advantage) Memory ordering must be explicitly expressed  Control still follows dataflow model Essentially what is done with Tomasulo’s algorithm, but on a much larger scale Attempts to exploit “dataflow locality” in instruction stream Predictability of data dependences Avoid register file/bypass network complexity Example: 92% of inputs come from one of the last two producers of the input Claim: register renaming destroys df locality

(C) 2007 J. E. Smith ECE/CS 757 Basic Concepts  Intelligent instruction words Each instruction in memory has its own functional unit (in theory)  In practice Use WaveCaches near functional units to execute working set of instructions near the functional units  Control flow broken into “waves” Maximal loop-free section of dataflow graph Single entry point May contain internal branches and joins  Each dynamic wave has a wave number Every value has a wave number Like tagged tokens Same wave number for all instructions in a wave WAVE-ADVANCE instruction at top of wave Wave numbers then percolate through rest of wave  Indirect Jumps For dynamic linking, procedure returns, etc. INDIRECT SEND instructions Takes instruction address as one of its arguments

(C) 2007 J. E. Smith ECE/CS 757 Memory Ordering  Implements wave-ordered memory  Each memory operation is annotated with ordering info Wave number Compiler-generated sequence no. – relative to wave Traverse control flow graph breadth- first Assign sequential numbers to instructions w/in a basic block Label with  Memory system reconstructs “correct” order  May need memory no-op inst. to fill “gaps”

(C) 2007 J. E. Smith ECE/CS 757 WaveCache  WaveCache is the processor implementation  Grid of approx. 2K PEs in clusters of 16 each  Each PE has buffering and storage for 8 instructions Total 16K instructions (similar capacity to 64KB I-cache)  Input queues 16 deep; indexed relative to current wave Full/empty bits Updated by match unit at input

(C) 2007 J. E. Smith ECE/CS 757 Memory Hierarchy  L1 Data Cache per 4 clusters  L2 cache, shared conventional  Interconnect Set of shared busses w/in cluster Dynamically routed network for inter-cluster comm.  Store Buffers Distributed Each set of four clusters shares a store buffer Each dynamic wave is bound to a store buffer MEM-COORDINATE instruction acquires a store buffer for a wave and passes number to memory insts. in wave OR Hash table managed in main memory

(C) 2007 J. E. Smith ECE/CS 757 Example

Example: Compiler Output  Compiler steps: Decompose df graph into waves Insert memory seq. nos. and MEMORY-NOPs Insert WAVE-ADVANCE instructions Convert branch instructions into predicates (de-mux)

(C) 2007 J. E. Smith ECE/CS 757 Instructions  Encoding Opcode List of targets address of destn. inst. input number to which output values are directed Three inputs, two outputs  Loading, finding, placing insts. Miss causes instruction to be loaded into WaveCache Rewrite targets to point to locations in WaveCache If target not in WaveCache, mark as “not-loaded” Access to “not-loaded” instruction will cause additional miss(es) Cache miss is heavyweight Must move state queue state, mark input insts. w/ “not-loaded”, etc. “Idle” instructions are easier to replace (empty queue state) Eventually instruction working set stabilizes  Program termination OS must forcibly remove instructions belonging to a process

(C) 2007 J. E. Smith ECE/CS 757 Example: Stable Working Set

(C) 2007 J. E. Smith ECE/CS 757 Performance Evaluation  Use Wavescalar system as described earlier  Compare with absurd superscalar 16-way issue, 1024 issue buffer, 1024 physical registers, etc. with realistic branch predictor No memory dependence speculation (in either system)  Use ideal L1 data caches  Separate store addresses/data (as often done in superscalar procs.)  No speculation on memory dependences Authors give results for perfect predicate prediction and memory disambiguation BUT, don’t discuss how it would be implemented  Benchmarks vpr, twolf, mcf, equake, art, mpeg2, fft select functions that consume 95% of execution time As measured on what machine? A good way to beat Amdahl’s law  Performance results in Alpha equivalent instructions per cycle

(C) 2007 J. E. Smith ECE/CS 757 Performance Comparison vs. Supersc.  Avg. 3.1 faster than SS over 4x on vectorizable apps (equake and fft) (other may be vectorizable as well) Claim this is because wavescalar is not constrained by control dependences Might get same effect in SS by using Java-like convergent path rules (research topic)  % static instruction count overhead  16 PE clusters work well fewer than 15% of values leave the cluster of origin fewer than 10% go more than one cluster away

(C) 2007 J. E. Smith ECE/CS 757 Performance as fcn of Cluster Size  From 1 PE to 256 PE clusters 128 to 32K instructions  Intra-cluster communication goes up for larger clusters (1 to 24 cycles)  Performance peaks and drops off