Finishing out EECS 470 A few snapshots of the real world.

Slides:

Advertisements

Similar presentations

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

Advertisements

COMP375 Computer Architecture and Organization Senior Review.

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

Lecture 9: R10K scheme, Tclk

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Last lecture Some misc. stuff An older real processor Class review/overview.

Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture: Out-of-order Processors

Instruction Level Parallelism

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

PowerPC 604 Superscalar Microprocessor

CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Power-Aware Operand Delivery

Some misc. stuff An older real processor Class review/overview.

Lecture 6: Advanced Pipelines

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

ECE 2162 Reorder Buffer.

Comparison of Two Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

* From AMD 1996 Publication #18522 Revision E

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

ECE 721 Modern Superscalar Microarchitecture

Presentation transcript:

Finishing out EECS 470 A few snapshots of the real world

Real processors: How they are different than your project. What we’ve talked about so far isn’t grounded by the real world in any meaningful way. – That is, we haven’t really looked at how real processors do things Today we’ll look at two processors – We’ll start with a 2003 core from AMD Lots of details available, close to your project – Jump to the latest Intel core. Look at performance issue

AMD 64-bit core Most taken from

Bit-interleaved busses running “North-South”

Integer Decode/Dispatch 3 types of instructions – Direct path RISC-like – Vector path Broken into smaller instructions via micro code. – Double 128-bit instructions which can be broken into 2 64-bit independent instructions are (called Double) Others are done via microcode Most 128-bit SSE and SSE2 are made into doubles.

RS Each cycle an instruction is issued into one of 3 lanes. – Each lane has 8 RSs 1 ALU 1 AGU (Address Generation Unit) – Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.

Rename Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB) – 72 in-flight instructions are kept in the RoB The other structure is the IFFRF: Integer Future File and Register File – 16 registers of committed state – 16 “future registers” – 8 scratch-pad registers

Future file In the P6 scheme we had to look 3 places for the data – The PRF – The RoB – The CDB (later) Here we look in the FF or the CDB-like-things later. – The FF holds the speculative value if it is known. – At execution complete instructions check to see if they were the last thing to dispatch that writes to a given physical register. This is done by tagging the FF with the RoB number. – If they were the last to have that AR as a destination, they update the FF.

How does the At issue we: – Check the FF for source operands – Reserve a spot in the RoB – Place our tag (RoB number) in the FF – Mark the FF entry as invalid At EX complete we: – Send RoB number and data to the CDB – Send data to the RoB – Update FF if tag matches At retire – update ARF value (from RoB) At mispredict – Copy ARF value into FF.

What did the FF buy us? P6-like advantages – No free-list for PRF – Can just clear the RAT on mis-predict. But no need to access the RoB looking for data – RoB data only written once (EX complete) and only read once (Commit) Some pain – Early branch resolution looks hard

ROB It uses an 8-bit descriptor for 72 entries.

Re-Order-Buffer Tag definition wrap bit Instruction In Flight Number re-order buffer index sub-index 0..2 bit 7bit 6bit 5bit 4bit 3bit 2bit 1bit 0 1) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.

More on the RoB What is basically happening is that we have three RoBs – Each one size 24 – We cycle through each one so that none get ahead of the other. – Reduces read/write ports!

Mispredictions It looks like they wait until retirement to resolve all exceptions. – Mispredictions are treated as exceptions! They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF

More details. Each x86 instruction can launch both an ALU and an AGU operation – Because x86 has lots of memory operations this makes sense. ALUs broadcast result tag one cycle early – So RS can launch data to the ALU before data arrives.

8 Lane

Intel’s Haswell Latest Intel microarchtecture – 22nm process – 4-wide OoO processor – x86 An evolution, not revolution – Very similar to architectures from the last 8 years.

Intel

Basics Converts x86 instructions into microops – RISC-like instructions – Even more basic than RISC in some cases Loads and Stores generally turn into two instructions – Address compute and memory access

What’s interesting? Seeing how things have changed compared to previous microarchitectures Transactional support Power issues

The three recent frontends

Buffer sizes 192 RoB entries 60 RS 72 Loads 42 stores

Other key features Transactional synchronization – Execute lock-protected section – Don’t acquire lock – If someone else is doing the same thing at the same time Undo all memory accesses Do again with locks. Why? New sleep states – More like handheld devices.

Microarchitecture and performance void tightloop() { unsigned j; for (j = 0; j < N; ++j) counter += j; } void foo() { } void loop_with_extra_call() { unsigned j; for (j = 0; j < N; ++j) { __asm__("call foo"); counter += j; } tightloop() runs in.68 sec loop_with_extra_call runs in.60 sec Why

: : xor %eax,%eax : nopw 0x0(%rax,%rax,1) : mov 0x200b01(%rip),%rdx # f: add %rax,%rdx : add $0x1,%rax : cmp $0x17d78400,%rax 40054c: mov %rdx,0x200aed(%rip) # : jne : repz retq : nopw 0x0(%rax,%rax,1) : : repz retq : : xor %eax,%eax : nopw 0x0(%rax,%rax,1) : callq d: mov 0x200abc(%rip),%rdx # : add %rax,%rdx : add $0x1,%rax 40058b: cmp $0x17d78400,%rax : mov %rdx,0x200aa8(%rip) # : jne a: repz retq 40059c: nopl 0x0(%rax)