CSE502: Computer Architecture Superscalar Decode.

Slides:

Advertisements

Similar presentations

CSE502: Computer Architecture Basic Instruction Decode.

Advertisements

SE 292 (3:0) High Performance Computing L2: Basic Computer Organization R. Govindarajan

COMP375 Computer Architecture and Organization Senior Review.

Execution Cycle. Outline (Brief) Review of MIPS Microarchitecture Execution Cycle Pipelining Big vs. Little Endian-ness CPU Execution Time 1 IF ID EX.

CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.

Topics Left Superscalar machines IA64 / EPIC architecture

Instruction Set Design

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

Computer Architecture Lecture 2 Abhinav Agarwal Veeramani V.

RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

ELEN 468 Advanced Logic Design

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Chapter 12 Pipelining Strategies Performance Hazards.

Data Manipulation Computer System consists of the following parts:

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

11/11/05ELEC CISC (Complex Instruction Set Computer) Veeraraghavan Ramamurthy ELEC 6200 Computer Architecture and Design Fall 2005.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Appendix A Pipelining: Basic and Intermediate Concepts

CSE502: Computer Architecture Instruction Decode.

Cisc Complex Instruction Set Computing By Christopher Wong 1.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.

Lecture 6: Superscalar Decode and Other Pipelining.

RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Multiple-Cycle Hardwired Control Digital Logic Design Instructor: Kasım Sinan YILDIRIM.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Lecture 7: 9/17/2002CS149D Fall CS149D Elements of Computer Science Ayman Abdel-Hamid Department of Computer Science Old Dominion University Lecture.

COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.

Ch. 10 Central Processing Unit Designs - CISC. Two CPU designs CISC –Non-pipelined datapath with a micro- programmed control unit RISC –Pipelined datapath.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Advanced Microarchitecture

Topics to be covered Instruction Execution Characteristics

Electrical and Computer Engineering University of Cyprus

15-740/ Computer Architecture Lecture 3: Performance

Variable Word Width Computation for Low Power

ELEN 468 Advanced Logic Design

CMSC 611: Advanced Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University

Prof. Sirer CS 316 Cornell University

Single Clock Datapath With Control

Flow Path Model of Superscalars

Superscalar Pipelines Part 2

Computer Structure S.Abinash 11/29/ _02.

Ka-Ming Keung Swamy D Ponpandi

Prof. Sirer CS 316 Cornell University

Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

CSE502: Computer Architecture Superscalar Decode

CSE502: Computer Architecture Superscalar Decode for RISC ISAs Decode X insns. per cycle (e.g., 4-wide) – Just duplicate the hardware – Instructions aligned at 32-bit boundaries 32-bit inst Decoder decoded inst decoded inst scalar Decoder 32-bit inst Decoder decoded inst decoded inst superscalar 4-wide superscalar fetch 32-bit inst decoded inst decoded inst decoded inst decoded inst decoded inst decoded inst 1-Fetch

CSE502: Computer Architecture uop Limits How many uops can a decoder generate? – For complex x86 insts, many are needed (10s, 100s?) – Makes decoder horribly complex – Typically theres a limit to keep complexity under control One x86 instruction 1-4 uops Most instructions translate to uops What if a complex insn. needs more than 4 uops?

CSE502: Computer Architecture UROM/MS for Complex x86 Insts UROM (microcode-ROM) stores the uop equivalents – Used for nasty x86 instructions Complex like > 4 uops (PUSHA or STRREP.MOV) Obsolete (like AAA) Microsequencer (MS) handles the UROM interaction

CSE502: Computer Architecture UROM/MS Example (3 uop-wide) ADD STORE SUB Cycle 1 ADD STA STD SUB Cycle 2 REP.MOV ADD [ ] Fetch- x86 insts Decode - uops UROM - uops SUB LOAD STORE REP.MOV ADD [ ] XOR Cycle 3 INC mJCC LOAD Cycle 4 Complex instructions, get uops from mcode sequencer Complex instructions, get uops from mcode sequencer REP.MOV ADD [ ] XOR STORE INC mJCC Cycle 5 REP.MOV ADD [ ] XOR … … Cycle n … … … … REP.MOV ADD [ ] XOR mJCC LOAD Cycle n+1 ADD XOR … … … …

CSE502: Computer Architecture Superscalar CISC Decode Instruction Length Decode (ILD) – Where are the instructions? Limited decode – just enough to parse prefixes, modes Shift/Alignment – Get the right bytes to the decoders Decode – Crack into uops Do this for N instructions per cycle!

CSE502: Computer Architecture ILD Recurrence/Loop PCi= X PCi+1= PCi + sizeof( Mem[PCi] ) PCi+2= PCi+1 + sizeof( Mem[PCi+1] ) = PCi + sizeof( Mem[PCi] ) + sizeof( Mem[PCi+1] ) Cant find start of next insn. before decoding the first Must do ILD serially – ILD of 4 insns/cycle implies cycle time will be 4x Critical component not pipeline-able

CSE502: Computer Architecture Bad x86 Decode Implementation Left Shifter Decoder 3 Cycle 3 Decoder 2 Decoder 1 Instruction Bytes (ex. 16 bytes) ILD (limited decode) Length 1 Length Cycle 1 Inst 1 Inst 2 Inst 3 Remainder Cycle Length 3 bytes decoded ILD (limited decode) ILD dominates cycle time; not scalable

CSE502: Computer Architecture Hardware-Intensive Decode Decode from every possible instruction starting point! Giant MUXes to select instruction bytes ILD Decoder ILD Decoder

CSE502: Computer Architecture ILD in Hardware-Intensive Approach 6 bytes 4 bytes 3 bytes + + Total bytes decode = 11 Previous: 3 ILD + 2 add Now: 1 ILD + 2 (mux+add) Previous: 3 ILD + 2 add Now: 1 ILD + 2 (mux+add) Decoder

CSE502: Computer Architecture Predecoding ILD loop is hardware intensive – Impacts latency – Consumes substantial power If instructions A, B and C are decoded –... lengths for A, B, and C will still be the same next time – No need to repeat ILD Possible to cache the ILD work

CSE502: Computer Architecture Decoder Example: AMD K5 (1/2) Predecode Logic From Memory b0 b1 b2 … … b7 8 bytes I$ b0 b1 b2 … … b7 8 bits +5 bits 13 bytes Decode 16 (8-bit inst + 5-bit predecode) 8 (8-bit inst + 5-bit predecode) Up to 4 ROPs Compute ILD on fetch, store ILD in the I$

CSE502: Computer Architecture Decoder Example: AMD K5 (2/2) Predecode makes decode easier by providing: – Instruction start/end location (ILD) – Number of ROPs needed per inst – Opcode and prefix locations Power/performance tradeoffs – Larger I$ (increase array by 62.5%) Longer I$ latency, more I$ power consumption – Remove logic from decode Shorter pipeline, simpler logic Cache and reused decode work less decode power – Longer effective I-L1 miss latency (ILD on fill)

CSE502: Computer Architecture Decoder Example: Intel P-Pro Only Decoder 0 interfaces with the uROM and MS If insn. in Decoder 1 or Decoder 2 requires > 1 uop 1)do not generate output 2)shift to Decoder 0 on the next cycle 16 Raw Instruction Bytes Decoder 0 Decoder 0 Decoder 1 Decoder 1 Decoder 2 Decoder 2 Branch Address Calculator Branch Address Calculator Fetch resteer if needed mROM 4 uops1 uop

CSE502: Computer Architecture Fetch Rate is an ILP Upper Bound Instruction fetch limits performance – To sustain IPC of N, must sustain a fetch rate of N per cycle If you consume 1500 calories per day, but burn 2000 calories per day, then you will eventually starve. – Need to fetch N on average, not on every cycle N-wide superscalar ideally fetches N insns. per cycle This doesnt happen in practice due to: – Instruction cache organization – Branches – … and interaction between the two

CSE502: Computer Architecture Instruction Cache Organization To fetch N instructions per cycle... – L1-I line must be wide enough for N instructions PC register selects L1-I line A fetch group is the set of insns. starting at PC – For N-wide machine, [PC,PC+N-1] Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst Cache Line PC

CSE502: Computer Architecture Fetch Misalignment (1/2) If PC = xxx01001, N=4: – Ideal fetch group is xxx01001 through xxx01100 (inclusive) Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst PC: xxx Line width Fetch group Misalignment reduces fetch width

CSE502: Computer Architecture Fetch Misalignment (2/2) Now takes two cycles to fetch N instructions – ½ fetch bandwidth! Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst PC: xxx Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst PC: xxx Cycle 1 Cycle 2 Might not be ½ by combining with the next fetch

CSE502: Computer Architecture Reducing Fetch Fragmentation (1/2) Make |Fetch Group| < |L1-I Line| Decoder Tag Inst Tag Inst Tag Inst Cache Line PC Can deliver N insns. when PC > N from end of line

CSE502: Computer Architecture Reducing Fetch Fragmentation (2/2) Needs a rotator to decode insns. in correct order Decoder Tag Inst Tag Inst Tag Inst Rotator Aligned fetch group PC