CS252 Graduate Computer Architecture Fall 2015 Lecture 5: Out-of-Order Processing Krste Asanovic krste@berkeley.edu http://inst.eecs.berkeley.edu/~cs252/fa15.

Slides:



Advertisements
Similar presentations
Complex Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring
Advertisements

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 5 CS252 Graduate Computer Architecture Spring 2014 Lecture 5: Out-of-Order Processing Krste Asanovic.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
2/28/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 8 Instruction-Level Parallelism – Part 1 Benjamin Lee Electrical and Computer Engineering Duke.
COMP25212 Advanced Pipelining Out of Order Processors.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP I Steve Ko Computer Sciences and Engineering University at Buffalo.
March 2, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer.
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer Sciences University of California.
CS 152 Computer Architecture and Engineering Lecture 12 - Complex Pipelines Krste Asanovic Electrical Engineering and Computer Sciences University of California.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
© Krste Asanovic, 2015CS252, Fall 2015, Lecture 6 CS252 Graduate Computer Architecture Fall 2015 Lecture 6: Out-of-Order Processors Krste Asanovic
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering.
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction John Wawrzynek Electrical Engineering.
CS252 Graduate Computer Architecture Fall 2015 Lecture 4: Pipelining
Central Processing Unit Architecture
Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
Out of Order Processors
CS203 – Advanced Computer Architecture
Prof. Sirer CS 316 Cornell University
Exceptions & Multi-cycle Operations
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Electrical and Computer Engineering
Krste Asanovic Electrical Engineering and Computer Sciences
CS 704 Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo Organization
Computer Architecture
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.
Prof. Sirer CS 316 Cornell University
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 4 – Pipelining Part II Krste Asanovic Electrical Engineering.
September 20, 2000 Prof. John Kubiatowicz
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 11 – Out-of-Order Execution Krste Asanovic Electrical Engineering.
CSC3050 – Computer Architecture
Lecture 7 Dynamic Scheduling
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

CS252 Graduate Computer Architecture Fall 2015 Lecture 5: Out-of-Order Processing Krste Asanovic krste@berkeley.edu http://inst.eecs.berkeley.edu/~cs252/fa15

Last Time in Lecture 4 Iron Law of processor performance Pipelining: reduce cycle time, try to keep CPI low Hazards: Structural hazards: interlock or more hardware Data hazards: interlocks, bypass, speculate Control hazards: interlock, speculate Precise traps/interrupts for in-order pipeline CS252 S05

IBM 7030 “Stretch” (1954-1961) Original goal was to use new transistor technology to give 100x performance of tube-based IBM 704. Design based around 4 stages of “lookahead” pipelining More than just pipelining, a simple form of decoupled execution with indexing and branch operations performed speculatively ahead of data operations Also had a simple store buffer Very complex design for the time, difficult to explain to users performance of pipelined machine When finally delivered, was benchmarked at only 30x 704 and embarrassed IBM, causing withdrawal after initial deliveries

Simple vector-vector add code example # for(i=0; i<N; i++) # A[i]=B[i]+C[i]; loop: fld f0, 0(x2) // x2 points to B fld f1, 0(x3) // x3 points to C fadd.d f2, f0, f1 fsd f2, 0(x1) // x1 points to A add x1, 8 // Bump pointer add x2, 8 // Bump pointer add x3, 8 // Bump pointer bne x1, x4, loop // x4 holds end

Simple Pipeline Scheduling Can reschedule code to try to reduce pipeline hazards loop: fld f0, 0(x2) // x2 points to B fld f1, 0(x3) // x3 points to C add x3, 8 // Bump pointer add x2, 8 // Bump pointer fadd.d f2, f0, f1 add x1, 8 // Bump pointer fsd f2, -8(x1) // x1 points to A bne x1, x4, loop // x4 holds end Long latency loads and floating-point operations limit parallelism within a single loop iteration

Loop Unrolling Can unroll to expose more parallelism loop: fld f0, 0(x2) // x2 points to B fld f1, 0(x3) // x3 points to C fld f10, 8(x2) fld f11, 8(x3) add x3, 16 // Bump pointer add x2, 16 // Bump pointer fadd.d f2, f0, f1 fadd.d f12, f10, f11 add x1, 16 // Bump pointer fsd f2, -16(x1) // x1 points to A fsd f12, -8(x1) bne x1, x4, loop // x4 holds end Unrolling limited by number of architectural registers Unrolling increases instruction cache footprint More complex code generation for compiler, has to understand pointers Can also software pipeline, but has similar concerns

Decoupling (lookahead, runahead) in µarchitecture Can separate control and memory address operations from data computations: loop: fld f0, 0(x2) // x2 points to B fld f1, 0(x3) // x3 points to C fadd.d f2, f0, f1 fsd f2, 0(x1) // x1 points to A add x1, 8 // Bump pointer add x2, 8 // Bump pointer add x3, 8 // Bump pointer bne x1, x4, loop // x4 holds end The control and address operations do not depend on the data computations, so can be computed early relative to the data computations, which can be delayed until later.

Simple Decoupled Machine Integer Pipeline F D X M W {Load Data Writeback µOp} {Compute µOp} µOp Queue {Store Data Read µOp} Load Data Queue D X1 X2 X3 W Floating-Point Pipeline Check Load Address Load Data Store Address Queue Store Data Queue

Decoupled Execution Send load to memory, queue up write to f0 fld f0 Send load to memory, queue up write to f0 Many writes to f0 can be in queue at same time fld f1 Send load to memory, queue up write to f1 fadd.d Queue up fadd.d fsd f2 Queue up store address, wait for store data add x1 Bump pointer Check load address against queued pending store addresses add x2 Bump pointer add x3 Bump pointer bne Take branch fld f0 Send load to memory, queue up write to f0 fld f1 Send load to memory, queue up write to f1 fadd.d Queue up fadd.d fsd f2 Queue up store address, wait for store data …

Supercomputers Definitions of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer

CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words 128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 3/10/2009 CS252 S05

CDC 6600: A Load/Store Architecture Separate instructions to manipulate three types of reg. 8x60-bit data registers (X) 8x18-bit address registers (A) 8x18-bit index registers (B) All arithmetic and logic instructions are register-to-register Only Load and Store instructions refer to memory! Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store - very useful for vector operations 6 3 3 3 opcode i j k Ri  Rj op Rk 6 3 3 18 opcode i j disp Ri M[Rj + disp] CS252 S05

CDC 6600: Datapath Central Memory 128K words, 32 banks, 1µs cycle Operand Regs 8 x 60-bit operand 10 Functional Units result Central Memory 128K words, 32 banks, 1µs cycle IR Address Regs Index Regs 8 x 18-bit 8 x 18-bit Inst. Stack 8 x 60-bit operand addr result addr CS252 S05

CDC6600 ISA designed to simplify high-performance implementation Use of three-address, register-register ALU instructions simplifies pipelined implementation Only 3-bit register specifier fields checked for dependencies No implicit dependencies between inputs and outputs Decoupling setting of address register (Ar) from retrieving value from data register (Xr) simplifies providing multiple outstanding memory accesses Software can schedule load of address register before use of value Can interleave independent instructions inbetween CDC6600 has multiple parallel but unpipelined functional units E.g., 2 separate multipliers Follow-on machine CDC7600 used pipelined functional units Foreshadows later RISC designs CS252 S05

CDC6600: Vector Addition B0  - n loop: JZE B0, exit A0  B0 + a0 load X0 A1  B0 + b0 load X1 X6  X0 + X1 A6  B0 + c0 store X6 B0  B0 + 1 jump loop Ai = address register Bi = index register Xi = data register CS252 S05

CDC6600 Scoreboard Instructions dispatched in-order to functional units provided no structural hazard or WAW Stall on structural hazard, no functional units available Only one pending write to any register Instructions wait for input operands (RAW hazards) before execution Can execute out-of-order Instructions wait for output register to be read by preceding instructions (WAR) Result held in functional unit until register free

[© IBM]

IBM Memo on CDC6600 Thomas Watson Jr., IBM CEO, August 1963: “Last week, Control Data ... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers... Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer.” To which Cray replied: “It seems like Mr. Watson has answered his own question.” CS252 S05

IBM 360/91 Floating-Point Unit R. M. Tomasulo, 1967 Regfile 1 2 3 4 5 6 p tag/data load buffers (from memory) ... instructions 1 2 3 4 p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data Distribute instruction templates by functional units 1 2 3 p tag/data p tag/data 1 p tag/data p tag/data p tag/data p tag/data 2 p tag/data p tag/data p tag/data p tag/data Adder Mult < tag, result > Common bus ensures that data is made available immediately to all the instructions waiting for it. Match tag, if equal, copy value & set presence “p”. p tag/data store buffers (to memory) p tag/data p tag/data CS252 S05

IBM ACS Second supercomputer project (Y) started at IBM in response to CDC6600 Multiple Dynamic instruction Scheduling invented by Lynn Conway for ACS Used unary encoding of register specifiers and wired-OR logic to detect any hazards (similar design used in Alpha 21264 in 1995!) Seven-issue, out-of-order processor Two decoupled streams, each with DIS Cancelled in favor of IBM360-compatible machines

Precise Traps and Interrupts This was the remaining challenge for early out-of-order machines Technology scaling meant plenty of performance improvement with simple in-order pipelining and cache improvements Out-of-order machines disappeared from 60s until 90s

Acknowledgements This course is partly inspired by previous MIT 6.823 and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: Arvind (MIT) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB)