Drinking from the Firehose Decode in the Mill™ CPU Architecture

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Computer Organization and Architecture
Computer architecture
Adding the Jump Instruction
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Instruction Level Parallelism (ILP) Colin Stevens.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Multiscalar processors
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.
GCSE Computing - The CPU
Computer Architecture & Operations I
CS 352H: Computer Systems Architecture
Computer Organization
15-740/ Computer Architecture Lecture 3: Performance
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Simultaneous Multithreading
Morgan Kaufmann Publishers
Announcements MP 3 CS296 (Chase Geigle
Embedded Systems Design
Chapter 14 Instruction Level Parallelism and Superscalar Processors
ECE232: Hardware Organization and Design
Drinking from the Firehose
COMP4211 : Advance Computer Architecture
Flow Path Model of Superscalars
Instruction Level Parallelism and Superscalar Processors
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Yingmin Li Ting Yan Qi Zhao
The Processor Lecture 3.6: Control Hazards
Guest Lecturer TA: Shreyas Chand
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
COMS 361 Computer Organization
Computer Instructions
COMP541 Datapaths I Montek Singh Mar 18, 2010.
CSC3050 – Computer Architecture
CPU Structure CPU must:
GCSE Computing - The CPU
William Stallings Computer Organization and Architecture
Presentation transcript:

Drinking from the Firehose Decode in the Mill™ CPU Architecture Stanford EE380 5/29/2013 Drinking from the Firehose Decode in the Mill™ CPU Architecture

Instructions - addsx(b2, b5) Format and decoding The Mill Architecture New to the Mill: Dual code streams No-parse instruction shifting Double-ended decode Zero-length no-ops In-line constants to 128 bits addsx(b2, b5)

What chip architecture is this? cores: 4 cores issuing: 4 operations clock rate: 3300 MHz power: 130 Watts performance: 52.8 Gips price: $885 dollars ? general-purpose out-of-order superscalar (Intel XEON E5-2643)

What chip architecture is this? cores: 1 core issuing: 8 operations clock rate: 456 MHz power: 1.1 Watts performance: 3.6 Gips price: $17 dollars ? in-order VLIW signal processor (Texas Instruments TMS320C6748)

performance per Watt per dollar Which is better? cores: 4 cores issuing: 4 operations clock rate: 3300 MHz power: 130 Watts performance: 52.8 Gips price: $885 dollars out-of-order superscalar performance per Watt per dollar cores: 1 core issuing: 8 operations clock rate: 456 MHz power: 1.1 Watts performance: 3.6 Gips price: $17 dollars in-order VLIW DSP

out-of-order superscalar 0.46 mips/W/$ Which is better? cores: 4 cores issuing: 4 operations clock rate: 3300 MHz power: 130 Watts performance: 52.8 Gips price: $885 dollars out-of-order superscalar 0.46 mips/W/$ cores: 1 core issuing: 8 operations clock rate: 456 MHz power: 1.1 Watts performance: 3.6 Gips price: $17 dollars in-order VLIW DSP 195 mips/W/$

signal processing ≠ general-purpose Which is better? Why 400X difference? 32 vs. 64 bit 3,600 mips vs. 52,800 mips incompatible workloads signal processing ≠ general-purpose goal – and technical challenge: DSP numbers - on general-purpose workloads

Our result: cores: 2 cores issuing: 33 operations clock rate: 1200 MHz power: 28 Watts performance: 79.3 Gips price: $85 dollars OOTBC Mill Gold.x2 33 Mips/W/$ superscalar 0.46 DSP 195

33 operations per cycle peak ??? Why? Which is better? 80% of code is in loops Pipelined loops have unbounded ILP DSP loops are software-pipelined But – few general-purpose loops can be piped (at least on conventional architectures) Solution: pipeline (almost) all loops throw function hardware at pipe Result: loops now < 15% of cycles

Which is better? 33 operations per cycle peak ??? How? Biggest problem is decode Fixed length instructions: Easy to parse Instruction size: 32 bits X 33 ops = 132 bytes. Ouch! Instruction cache pressure. 32k iCache = only 248 instructions Ouch!!

Which is better? 33 operations per cycle peak ??? How? Variable length instructions: Hard to parse – x86 heroics gets 4 ops Instruction size: Mill ~15 bits X 33 ops = 61 bytes. Ouch! Instruction cache pressure. 32k iCache = only 537 instructions Ouch!! Biggest problem is decode

A stream of instructions Logical model inst inst inst inst inst inst inst Program counter decode execute bundle Physical model inst inst inst inst inst inst inst execute Program counter decode execute execute

Are easy! (and BIG) Fixed-length instructions bundle inst inst inst Program counter decode decode decode execute execute execute Are easy! (and BIG)

Where does the next one start? Variable-length instructions bundle inst inst inst inst inst ? ? ? Program counter decode decode decode execute execute execute Where does the next one start? Polynomial cost!

Two bundles of length N are much easier than one bundle of length 2N Polynomial cost bundle inst inst inst inst inst inst inst inst inst inst inst inst OK if N=3, not if N=30 BUT… Two bundles of length N are much easier than one bundle of length 2N Program counter So split each bundle in half, and have two streams of half-bundles

But – how do you branch two streams? Two streams of half-bundles half bundle inst inst inst inst inst inst inst inst inst inst inst inst Two physical streams Program counter decode execute Program counter decode One logical stream inst inst inst inst inst inst inst inst inst inst inst inst But – how do you branch two streams? half bundle

Extended Basic Blocks (EBBs) Group each stream into Extended Basic Blocks, single-entry multiple-exit sequences of bundles. Branches can only target EBB entry points; it is not possible to jump into the middle of an EBB. EBB EBB EBB Program counter EBB EBB chain branch Program counter EBB chain EBB EBB EBB

lower memory higher memory Take two half-EBBs lower memory higher memory EBB head bundle execution order bundle EBB head

Take two half-EBBs Reverse one in memory lower memory higher memory bundle EBB head lower memory higher memory execution order bundle EBB head Two halves of each instruction have same color Two halves of each instruction have same color execution order execution order bundle EBB head

lower memory higher memory Reverse one in memory And join them head-to-head bundle EBB head lower memory higher memory bundle EBB head

lower memory higher memory And join them head-to-head lower memory higher memory entry point bundle bundle EBB head EBB head

lower memory higher memory And join them head-to-head Take a branch… lower memory higher memory entry point bundle bundle … add … load … jump loop Effective address

lower memory higher memory Take a branch… program counter program counter lower memory higher memory entry point bundle bundle … add … load … jump loop Effective address

Take a branch… higher lower addresses addresses program counter bundle bundle bundle bundle bundle bundle bundle bundle bundle bundle decode decode execute

Take a branch… higher lower addresses addresses program counter bundle bundle bundle bundle bundle bundle bundle bundle decode decode execute

Take a branch… higher lower addresses addresses program counter bundle bundle bundle bundle bundle bundle bundle bundle bundle bundle bundle bundle decode decode execute

After a branch Transfers of control set both XPC and FPC to the entry point cycle 0 cycle n memory memory Flow code FPC FPC EBB entry point XPC Program counters: XPC = Exucode FPC = Flowcode XPC moves forward FPC moves backwards Exu code XPC increasing addresses increasing addresses

Physical layout Conventional Mill iCache critical distance iCache decode decode exec exec critical distance decode iCache critical distance iCache decode exec

Generic Mill bundle format The Mill issues one instruction (two half-bundles) per cycle. That one instruction can call for many independent operations, all of which issue together and execute in parallel. byte boundary alignment hole byte boundary header block 1 block 2 block n-1 block n variable length blocks Each instruction bundle begins with a fixed-length header, followed by blocks of operations; all operations in a block use the same format. The header contains the byte count of the whole bundle and an operation count for each block. Parsing reduces to isolating blocks.

Generic instruction decode cycle 0 byte boundary instruction buffer header block 1 block 2 Byte count to instruction shifter block 1 count to block2 shifter & block 1 decode to block 2 shifter to block 1 decode cycle 1 byte boundary bundle buffer hole block 3 header block 1 Bundle is parsed from both ends toward the middle. Two blocks are isolated per cycle to block 3 decode block 2 buffer block 2 to block 2 decode

Elided No-ops Sometimes a cycle has work only for Exu, only for Flow, or neither. The number of cycles to skip is encoded in the alignment hole of the other code stream. Exucode: Flowcode: hole hole head head op op op op head op op head op 1 op no-op no-op head op 2 op no-op head head op op op op head head op op op op head op op Rarely, explicit no-ops must still be used when there are not enough hole bits to use. Otherwise, no-ops cost nothing.

Mill pipeline phase/cycles mem/L2 prefetch <varies> L1 I$ lines F0-F2 L0 I$ shifter D0 bundles decode D0-D2 issue <none> operations execute X0-X4+ retire <none> results reuse 4 cycle mispredict penalty from top cache

Split-stream, double-ended encoding One Mill thread has: Two program counters Following two instruction half-bundle streams Drawn from two instruction caches Feeding two decoders One of which runs backwards And each half-bundle is parsed from both ends For each side: Instruction size: Mill ~15 bits X 17 ops = 32 bytes Instruction cache pressure. 32k iCache = 1024 instructions Decode rate: 30+ operations per cycle

Want more? ootbcomp.com USENIX Vail, June 23-26 Belt machines – performance with no registers IEEE Computer Society SVC, September 10 Sequentially consistent, stall-free, in-order memory access Sign up for technical announcements, white papers, etc.: ootbcomp.com