Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour.

Slides:

Advertisements

Similar presentations

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CS1104: Computer Organisation School of Computing National University of Singapore.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

1 Microprocessor-based Systems Course 4 - Microprocessors.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 27: Single-Cycle CPU Datapath Design Instructor: Sr Lecturer SOE Dan Garcia

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Architecture Basics ECE 454 Computer Systems Programming

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

Computer Systems Organization CS 1428 Foundations of Computer Science.

COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

Computer Architecture 2 nd year (computer and Information Sc.)

Pipelining and Parallelism Mark Staveley

Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Chapter 3: Computer Organization Fundamentals

CSIT 301 (Blum)1 Instructions at the Lowest Level Some of this material can be found in Chapter 3 of Computer Architecture (Carter)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Real-World Pipelines Idea Divide process into independent stages

Protection in Virtual Mode

Computer Organization CS224

Stalling delays the entire pipeline

William Stallings Computer Organization and Architecture 8th Edition

/ Computer Architecture and Design

Instructions at the Lowest Level

Flow Path Model of Superscalars

TigerSHARC processor General Overview.

Lecture 11: Memory Data Flow Techniques

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Control unit extension for data hazards

Guest Lecturer TA: Shreyas Chand

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.

Control unit extension for data hazards

Control unit extension for data hazards

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour of easier visibility.

April Agenda Pipelines Branch Prediction Superscalar execution Out-Of-Order execution

April Understanding CPUs Optimizing performance is pretty much tied to understanding what a CPU actually does. VTune, Thread Profiler and other tools are mere helpers in order to understand the operation of the CPU. The following gives some illustrative ideas on how the most important concepts of modern CPU design work. Not in any case this corresponds to the actual implementation, but rather outlines some basic ideas.

April Architecture Block Diagram Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Retire Re-Order Buffer (ROB) – 128 entry IA Register Set To L2 Cache/Memory Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data This is too complicated for the beginning! The best way to understand it is to construct the CPU yourself

April Our CPU Construction CPU Data Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out

April Modern Processor Technology In order to make use of modern CPUs some concepts must be clear 1.Pipelines 2.Prediction 3.Superscalarity 4.Out-Of-Order Execution 5.Vectorization We will go step by step through these concepts and make contact to indicators for sub-optimal performance on the Nehalem architecture

April Pipelines - What a CPU really understands … A CPU can distinguish between different concepts Instructions – what to do with the arguments Arguments - Usually registers that contain numbers or addresses A register can be thought of as a piece of memory directly in the CPU. It is as big as the architecture is wide (64bit). Specialized registers can be wider (SSE registers: 128bit). The ALU can only operate on registers!

April Pipelines - What a CPU really understands … d 63 db 4a 8d 04 9f 4f 8d 1c 9a 0f 28 c ff f ed 85 f6 movslq %r11d,%r11 lea (%rdi,%r11,4),%rax lea (%r10,%r11,4),%r11 movaps %xmm0,%xmm1 xor %r15d,%r15d xor %r14d,%r14d xor %r13d,%r13d test %esi,%esi This is the output of „objdump –d a.out“ for some binary a.out. Try yourself … 1. FETCH: get a binary number from memory 2. DECODE: translated a binary number into a meaningful instruction

April Pipelines – Execution and Memory Access So we can fetch the next number and translate it into a meaningful instruction. What now? If the instruction is arithmetic, just execute it, e.g. mul %r14d,%r15d (multiply register 14 with register 15) If it is a memory reference, load the data, e.g. mov,%r15d Some memory references involve computation (arithmetic) lea (%r10,%r11,4),%r11 3. EXECUTE: Execute the instruction 4. MEMORY ACCESS: Load and store data into memory, maybe compute the address in step 3.

April Pipelines – Write back This is the final step of our primitve pipeline Once the the result is computed or loaded, write it into the register specified, e.g. mul %r14d,%r15d Put the result here The need for this step is not immediately clear, but will become important later on … 5. WRITE BACK: Save the result as specified

April Our CPU construction – functional parts 1. Fetch a new instruction 2. Decode Instruction 3. Exectution 4. Memory Access 5. Write Register Data Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out

April … a CPU with 5 pipeline stages Inst Data Instr. Fetch Instr. Decode Execute Memory Access Write Back Data

April Pipelining 1 – No Pipelining Execution time = N instructions * T p Fetch Instruction Decode Instruction Execute Write Back Memory Access FDEMW T p

April Processor Technology Pipelining 1 – No Pipelining The concept of a non-pipelinging CPU is very inefficient! While, e.g. some comparison is performed in the ALU, the other functional units are running idle. With very little effort we can use all functional units at the same time and increase the performance tremendously Pipeling let‘s different commands use different functional units at the same time. There are problems connected to pipelined operation. We will address them later.

April Pipelining 2 - Pipelining Fetch Instruction Decode Instruction Execute Write Back Memory Access FDEMW T p Execution time = N instructions * T s T s

April Our Core2 Construction Inst Data Instr. Fetch Instr. Decode ALU Memory Access Write Registers Data

April Pipelining 3 – Pipeline Stalls 1 FDEMW 32 R Example code:... c=a/b d=c/b e=c/d f=e/d g=f/e... Pipeline Stalls can have two reasons – competition for resources and data dependencies

April Pipelining 4 – Branches FDEM W 3 JZ5JZ5 CMPCMP 45 Branches are a consicerable obstacle for performance! F Compare two values and write result in F Jump if Flag F is 0

April Processor Technology Avoiding Pipeline Stalls – Branch Prediction Pipeline stalls due to branches can be avoided by introducing a functional unit the „guesses“ the next branch. 00 No jump predicte d No jump predicte d Jump predicte d 11 Jump predicte d jump no jump jump no jump jump no jump

April Pipelining 4 – Branch Prediction FDEM W 3 JZ5JZ5 CMPCMP 45 Branch prediction units can predict with a probability of >>90%. At a wrongly predicted branch the pipeline needs to be flushed! F Compare two values and write result in F Jump if Flag F is 0 Branch Target Buffer

April Our CPU Construction Inst Data Instr. Fetch DecodeALU Memory Access Write Registers Data Branch Target Buffer

April Superscalar Processors 1 The superscalar concept is an extension of the pipeline concept Pipelining alows for „hiding“ the latency of an instruction Pipelining idealy achieves 1 instruction per clock cycle Higher ratest can only be achieved if more than one instruction can be executed at the same time 1.Superscalar architecture - Xeon 2.Other idea: Very long instruction word (VLIW) – EPIC/Itanium

April Superscalar Processors 2 Fetch Instruction + Decode Dispatch ALU Write Register Memory Access FDEMW Execution time = N instructions * T s / N Pipelines Problems of the Pipeline remain!

April Processor Technology Superscalar Processors 3 The superscalar concept dramatically improves performance (2x at best) The different pipelines don‘t need to have the same functionalities The superscalar concept can be extended easily

April Our CPU Construction Inst Data Instr. Fetch + Decode Dispatch ALU Memory Access Write Registers Data Branch Target Buffer ALU Read Registers Write Registers

April Processor Technology Out-Of-Order Execution Speculative computing provides an important way of reducing pipeline stalls in case of branches. Still data dependencies cause pipeline stalls There is no good reason why the CPU shouldn‘t execute instruction which don‘t have dependencies or for which the dependencies are already computed. Out-Of-Order a way to minimize the pipeline stalls by reordering the instructions at entry to the pipeline and restore the original order when exeting the pipeline

April Processor Technology Out-Of-Order Execution 1 FRAM W 3 2 R 45 Inst. Queue Reorder Buffer

April Our Construction Inst Data Instr. Fetch + Decode Dispatch ALU Memory Access Data Branch Target Buffer ALU Read Registers Reorder Buffer Retire - Write Registers

April Processor Technology Register Allocation IA32 provides a very limited number of registers for local storage This is only due to definition in the instruction set architecture (ISA) Internally there are more registers available The CPU can internally re-assign the registers in order make best use of the given resources. This re-allocation is tracked in the Register Allocation Table

April Our CPU Construction Inst Data Instr. Fetch Dispatch ALU Memory Access Data Branch Target Buffer ALU Read Registers Reorder Buffer Retire - Write Registers Register Allocation Now let‘s compare to the original...

April Block Diagram Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Retire (Write back) Re-Order Buffer (ROB) – 128 entry IA Register Set To L2 Cache Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data Not very different, right? Memory Dispatch

April Summary Nehalem can be thought of very much in the same as standard text books present a CPU 1.Instruction Fetch and Decode (Frontend) 2.Dispatch 3.Execute 4.Memory Access 5.Retire (Write Back) If you want to dive deeper into the subject: Hennessy and Patterson: Computer Architecture – A quantitave approach

April