AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Computer Organization and Architecture
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 12 Pipelining Strategies Performance Hazards.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
Finishing out EECS 470 A few snapshots of the real world.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Pipelining and Parallelism Mark Staveley
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture ILP and Speculation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Protection in Virtual Mode
Data Prefetching Smruti R. Sarangi.
Instruction Level Parallelism
/ Computer Architecture and Design
CSC 4250 Computer Architectures
Computer Structure Multi-Threading
PowerPC 604 Superscalar Microprocessor
/ Computer Architecture and Design
CMSC 611: Advanced Computer Architecture
The Microarchitecture of the Pentium 4 processor
Superscalar Processors & VLIW Processors
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Comparison of Two Processors
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Data Prefetching Smruti R. Sarangi.
CSC3050 – Computer Architecture
Chapter 11 Processor Structure and function
Presentation transcript:

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Introduction AMD Opteron – Focuses on Barcelona Barcelona is AMD’s 65nm 4-core CPU

Fetch Fetches 32B from L1 cache to pre-decode/Pick buffer For simplicity, the Barcelona uses pre-decode information to mark the end of an instruction.

Inst. Decode The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle – Inserts pre-decode information from the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions Instructions are then passed through the sideband stack optimizer – x86 includes instructions to directly manipulate the stack of each thread – AMD introduced a side-band stack optimizer to remove these stack manipulations from the instruction stream – Thus, many stack operations can be processed in parallel Frees up the reservation stations, re-order buffers, and regular ALUs for other work

Branch Prediction Branch selector chooses between a bi-modal predictor and a global predictor – The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information – The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register Tracks last 12 branches with a 16K entry prediction table containing 2 bit saturating counters – The branch target address calculator (BTAC) checks the targets for relative branches Can correct mis-predictions with a two cycle penalty. Barcelona uses an indirect predictor – Specifically designed to handle branches with multiple targets (e.g. switch or case statements) Return address stack has 24 entries

Pipeline Uses a 12 stage pipeline

OO (ROB) The Pack Buffer (post-decoding buffer) sends groups of 3 micro-ops to the re-order buffer (ROB) – The re-order buffer contains 24 entries, with 3 lanes per entry Holds a total of 72 instructions – Instructions can be moved between lanes to avoid a congested reservation station or to observe issue restrictions From the ROB, instructions issue to the appropriate scheduler

ROB

Integer Future File and Register File (IFFRF) The IFFRF contains 40 registers broken up into three distinct sets – The Architectural Register File Contains 16x64 bit non-speculative registers Instructions can only modify the Architectural Register File until they are committed – Speculative instructions read from and write to the Future File Contains the most recent speculative state of the 16 architectural instructions – The last 8 registers are scratchpad registers used by the microcode. Should a branch mis-prediction or an exception occur, the pipeline rolls back, and architectural register file overwrites the contents of the Future File There are three reservation stations, i.e. schedulers, within the integer cluster – Each station is tied to a specific lane in the ROB and holds 8 instructions

Integer Execution Barcelona uses three symmetric ALUs which can execute almost any integer instruction Three full featured ALUs require more die area and power Can provide higher performance for certain edge cases Enables a simpler design for the ROB and schedulers.

Floating Point Execution Floating Point operations are first sent to the FP Mapper and Renamer In the Renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 FP register file entries. Once the micro-ops have been renamed, they may be issued to the three FP schedulers Operands can be obtained from either the FP register file, or the forwarding network

Floating Point Execution (SIMD) The FPUs are 128 bits wide so that Streaming SIMD Extension (SSE) instructions can execute in a single pass. Similarly, the load-store units, and the FMISC unit load 128 bit wide data, to improve SSE performance.

Memory Overview

Memory Hierarchy 4 separate 128KB 2-way set associative L1 cache – Latency = 3 cycles – Write-back to L2 – The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive) 4 separate 512KB 16-way set associative – Latency = 12 cycles – Line size is 64B

L3 Cache Shared 2MB 32-way set associative L3 – Latency = 38 cycles – Uses 64B lines – The L3 cache was designed with data sharing in mind When a line is requested, if it is likely to be shared, then it will remain in the L3 – This leads to duplication which would not happen in an exclusive hierarchy In the past, a pseudo-LRU algorithm would evict the oldest line in the cache. – In Barcelona’s L3, the replacement algorithm has been changed to prefer evicting unshared lines Access to the L3 must be arbitrated since the L3 is shared between four different cores – A round-robin algorithm is used to give access to one of the four cores each cycle. Each core has 8 data prefetchers (a total of 32 per device) – Fill the L1D cache – Can have up to 2 outstanding fetches to any address

Memory Controllers Each memory controller supports independent 64B transactions Integrated DDR2 Memory controller ensures that L3 cache miss is resolved in less than 60 nanoseconds

TLB Barcelona offers non-speculative memory access re-ordering in the form of Load Store Units (LSU) – Thus, some memory operations can be issued out-of-order In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB During this translation, the lower 12 bits of the load operation’s address are tested against previously stored addresses – If they are different, then the load proceeds ahead of the store – If they are the same, load-store forwarding occurs Should a miss in the L1 DTLB occur, the L2 DTLB will be checked – Once the load or store has located address in the cache, the operation will move on to LSU2. LSU2 holds up to 32 memory accesses, where they stay until they are removed – The LSU2 handles any cache or TLB misses via scheduling and probing – In the case of a cache miss, the LSU2 will then look in the L2, L3 and then memory – In the case of TLB misses, it will look in the L2 TLB and then main memory – The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness – Thus, the LSU2 reduces the majority of the complexity in the memory pipeline

Hypertransport Barcelona has four HyperTransport 3.0 lanes for inter-processor communications and I/O devices HyperTransport 3.0 adds a feature called ‘unganging’ or lane-splitting The HT3.0 links are composed of two 16 bit lanes ( in both directions) – Each can be split up into a pair of independent 8- bit wide links

Shanghai The latest model of the Opteron series Several improvements over Barcelona – 45nm – 6MB L3 cache – Improved clock speeds – A host of other improvements