Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Microprocessor Performance, Phase II Yale Patt The University of Texas at Austin STAMATIS Symposium TU-Delft September 28, 2007.

Instruction-Level Parallelism (ILP)

Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.

Ken Michelson David Sunderland Jared Wilkins Chris Fisher WaveScalar Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter Hardwired vs Microprogrammed Control Multithreading

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 13 Reduced Instruction Set Computers (RISC) CISC – Complex Instruction Set Computer RISC – Reduced Instruction Set Computer.

ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

Advanced Computer Architectures

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Lecture 2: Computer Architecture: A Science ofTradeoffs.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Background Computer System Architectures Computer System Software.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Topics to be covered Instruction Execution Characteristics

Multiscalar Processors

Prof. Onur Mutlu Carnegie Mellon University

5.2 Eleven Advanced Optimizations of Cache Performance

Scalable Processor Design

Hyperthreading Technology

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Instruction Level Parallelism (ILP)

/ Computer Architecture and Design

WaveScalar: the Executive Summary

The Vector-Thread Architecture

Mattan Erez The University of Texas at Austin

Levels of Parallelism within a Single Processor

Chapter 12 Pipelining and RISC

How to improve (decrease) CPI

The University of Adelaide, School of Computer Science

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington

Spring 2003CSE P5482 Worries to Keep You up at Night In ,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion may be broken (if one flaw breaks a chip).

Spring 2003CSE P5483 WaveScalar’s Solution: Utilize Die Capability A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects

Spring 2003CSE P5484 WaveScalar Processing Element

Spring 2003CSE P5485 WaveScalar’s Solution: Short Wires Dataflow execution model each processor executes when it’s operands have arrived same principle as out-of-order execution but applies to the processor & includes fetching no single program counter short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches

Spring 2003CSE P5486 WaveScalar’s Solution: Short Wires Dataflow execution model, cont’d. differs from original dataflow computers distributed tag management (matching between renamed producer-consumer registers) special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution all instructions in a “wave” execute on data with the same wave number

Spring 2003CSE P5487 WaveScalar’s Solution: Short Wires Dataflow execution model differs from original dataflow computers explicit wave-ordered memory compiler assigns sequence number to each memory operation in a bread-first manner sequence number for an operation, its predecessor & successor all sent with produced data wave & sequence numbers provide a total order on memory operations through any traversal of a wave + normal memory semantics + no need for special dataflow languages; C & C++ programs execute just fine

Spring 2003CSE P5488 WaveScalar’s Solution: Short Wires Nearest-neighbor communication code placement to locate consumers near their producers short, fast node-to-node links rather than slow broadcast networks exploits dataflow locality: probability of producing a value for a particular consumer instruction & therefore register (register renaming can destroy this) instructions can dynamically migrate toward their neighbors during execution

Spring 2003CSE P5489 Dynamic Optimization The common case has higher costs, and the branch can detect this… Common Case Rare Case Branch Join

Spring 2003CSE P54810 Dynamic Optimization …and fix it, by moving. The join can do the same. Common Case Rare Case Branch Join

Spring 2003CSE P54811 WaveScalar’s Solution: Short Wires PE Domain

Spring 2003CSE P54812 WaveScalar’s Solution: Short Wires Cluster

Spring 2003CSE P54813 WaveScalar’s Solution: Creative Use of Untapped Parallelism Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though a straw) place instructions with the processing elements out-of-order execution on a grand scale Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads

Spring 2003CSE P54814 WaveScalar’s Solution: The I-Cache is the Processor Model is processor-in-memory (PIM) processing element associated with each instruction WaveScalar version processing elements placed in the I-cache to reduce latency

Spring 2003CSE P54815 WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity Fewer design & implementation errors from the grid of simple, uniform design Route around processors with flaws decentralized control dynamic instruction migration

Spring 2003CSE P54816 Research Agenda: Architecture WaveScalar ISA Microarchitecture design node design domain size cache-coherence across clusters cluster arrangement Control & memory speculation WaveScalar instruction management hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement

Spring 2003CSE P54817 Research Agenda: Architecture Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting instructions) System-level design Application to non-silicon designs

Spring 2003CSE P54818 Research Agenda: Compilers Instruction placement Revisit classic optimizations code savings vs. communication costs cache pollution vs. loop parallelism New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions

Spring 2003CSE P54819 Research Agenda: OS & Networking Tension between facilitating short routines & poor instruction locality The software side of thread management A bunch of stuff I don’t know about optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines

Spring 2003CSE P54820 Putting It All Together Grid of hundreds (maybe thousands) of simple, data-flow processing nodes no centralized control; scalable few design errors; increase in yield Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers short, point-to-point links Instructions can dynamically migrate reduce latency to hot consumers map around defects 3X performance without any prediction mechanisms more with them

Spring 2003CSE P54821