Download presentation
Presentation is loading. Please wait.
Published byGeorge Small Modified over 9 years ago
1
Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour of easier visibility.
2
April 29 2008 Agenda Pipelines Branch Prediction Superscalar execution Out-Of-Order execution
3
April 29 2008 Understanding CPUs Optimizing performance is pretty much tied to understanding what a CPU actually does. VTune, Thread Profiler and other tools are mere helpers in order to understand the operation of the CPU. The following gives some illustrative ideas on how the most important concepts of modern CPU design work. Not in any case this corresponds to the actual implementation, but rather outlines some basic ideas.
4
April 29 2008 Architecture Block Diagram Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Retire Re-Order Buffer (ROB) – 128 entry IA Register Set To L2 Cache/Memory Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data This is too complicated for the beginning! The best way to understand it is to construct the CPU yourself
5
April 29 2008 Our CPU Construction CPU Data Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out
6
April 29 2008 Modern Processor Technology In order to make use of modern CPUs some concepts must be clear 1.Pipelines 2.Prediction 3.Superscalarity 4.Out-Of-Order Execution 5.Vectorization We will go step by step through these concepts and make contact to indicators for sub-optimal performance on the Nehalem architecture
7
April 29 2008 Pipelines - What a CPU really understands … A CPU can distinguish between different concepts Instructions – what to do with the arguments Arguments - Usually registers that contain numbers or addresses A register can be thought of as a piece of memory directly in the CPU. It is as big as the architecture is wide (64bit). Specialized registers can be wider (SSE registers: 128bit). The ALU can only operate on registers!
8
April 29 2008 Pipelines - What a CPU really understands … 10011010110001111011011 1001010100011010000010010011111 1001111100011010001110010011010 11110010100011001000 10001010011001111111111 10001010011001111110110 10001010011001111101101 1000010111110110 4d 63 db 4a 8d 04 9f 4f 8d 1c 9a 0f 28 c8 45 33 ff 45 33 f6 45 33 ed 85 f6 movslq %r11d,%r11 lea (%rdi,%r11,4),%rax lea (%r10,%r11,4),%r11 movaps %xmm0,%xmm1 xor %r15d,%r15d xor %r14d,%r14d xor %r13d,%r13d test %esi,%esi This is the output of „objdump –d a.out“ for some binary a.out. Try yourself … 1. FETCH: get a binary number from memory 2. DECODE: translated a binary number into a meaningful instruction
9
April 29 2008 Pipelines – Execution and Memory Access So we can fetch the next number and translate it into a meaningful instruction. What now? If the instruction is arithmetic, just execute it, e.g. mul %r14d,%r15d (multiply register 14 with register 15) If it is a memory reference, load the data, e.g. mov,%r15d Some memory references involve computation (arithmetic) lea (%r10,%r11,4),%r11 3. EXECUTE: Execute the instruction 4. MEMORY ACCESS: Load and store data into memory, maybe compute the address in step 3.
10
April 29 2008 Pipelines – Write back This is the final step of our primitve pipeline Once the the result is computed or loaded, write it into the register specified, e.g. mul %r14d,%r15d Put the result here The need for this step is not immediately clear, but will become important later on … 5. WRITE BACK: Save the result as specified
11
April 29 2008 Our CPU construction – functional parts 1. Fetch a new instruction 2. Decode Instruction 3. Exectution 4. Memory Access 5. Write Register Data Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out
12
April 29 2008 … a CPU with 5 pipeline stages Inst Data Instr. Fetch Instr. Decode Execute Memory Access Write Back Data
13
April 29 2008 Pipelining 1 – No Pipelining 1 3245 Execution time = N instructions * T p Fetch Instruction Decode Instruction Execute Write Back Memory Access FDEMW T p
14
April 29 2008 Processor Technology Pipelining 1 – No Pipelining The concept of a non-pipelinging CPU is very inefficient! While, e.g. some comparison is performed in the ALU, the other functional units are running idle. With very little effort we can use all functional units at the same time and increase the performance tremendously Pipeling let‘s different commands use different functional units at the same time. There are problems connected to pipelined operation. We will address them later.
15
April 29 2008 Pipelining 2 - Pipelining Fetch Instruction Decode Instruction Execute Write Back Memory Access FDEMW 32145 T p Execution time = N instructions * T s T s
16
April 29 2008 Our Core2 Construction Inst Data Instr. Fetch Instr. Decode ALU Memory Access Write Registers Data
17
April 29 2008 Pipelining 3 – Pipeline Stalls 1 FDEMW 32 R Example code:... c=a/b d=c/b e=c/d f=e/d g=f/e... Pipeline Stalls can have two reasons – competition for resources and data dependencies
18
April 29 2008 Pipelining 4 – Branches FDEM W 3 JZ5JZ5 CMPCMP 45 Branches are a consicerable obstacle for performance! F Compare two values and write result in F Jump if Flag F is 0
19
April 29 2008 Processor Technology Avoiding Pipeline Stalls – Branch Prediction Pipeline stalls due to branches can be avoided by introducing a functional unit the „guesses“ the next branch. 00 No jump predicte d 01 2. No jump predicte d 10 2. Jump predicte d 11 Jump predicte d jump no jump jump no jump jump no jump
20
April 29 2008 Pipelining 4 – Branch Prediction FDEM W 3 JZ5JZ5 CMPCMP 45 Branch prediction units can predict with a probability of >>90%. At a wrongly predicted branch the pipeline needs to be flushed! F Compare two values and write result in F Jump if Flag F is 0 Branch Target Buffer
21
April 29 2008 Our CPU Construction Inst Data Instr. Fetch DecodeALU Memory Access Write Registers Data Branch Target Buffer
22
April 29 2008 Superscalar Processors 1 The superscalar concept is an extension of the pipeline concept Pipelining alows for „hiding“ the latency of an instruction Pipelining idealy achieves 1 instruction per clock cycle Higher ratest can only be achieved if more than one instruction can be executed at the same time 1.Superscalar architecture - Xeon 2.Other idea: Very long instruction word (VLIW) – EPIC/Itanium
23
April 29 2008 Superscalar Processors 2 Fetch Instruction + Decode Dispatch ALU Write Register Memory Access FDEMW 3 21 4 Execution time = N instructions * T s / N Pipelines Problems of the Pipeline remain!
24
April 29 2008 Processor Technology Superscalar Processors 3 The superscalar concept dramatically improves performance (2x at best) The different pipelines don‘t need to have the same functionalities The superscalar concept can be extended easily
25
April 29 2008 Our CPU Construction Inst Data Instr. Fetch + Decode Dispatch ALU Memory Access Write Registers Data Branch Target Buffer ALU Read Registers Write Registers
26
April 29 2008 Processor Technology Out-Of-Order Execution Speculative computing provides an important way of reducing pipeline stalls in case of branches. Still data dependencies cause pipeline stalls There is no good reason why the CPU shouldn‘t execute instruction which don‘t have dependencies or for which the dependencies are already computed. Out-Of-Order a way to minimize the pipeline stalls by reordering the instructions at entry to the pipeline and restore the original order when exeting the pipeline
27
April 29 2008 Processor Technology Out-Of-Order Execution 1 FRAM W 3 2 R 45 Inst. Queue Reorder Buffer
28
April 29 2008 Our Construction Inst Data Instr. Fetch + Decode Dispatch ALU Memory Access Data Branch Target Buffer ALU Read Registers Reorder Buffer Retire - Write Registers
29
April 29 2008 Processor Technology Register Allocation IA32 provides a very limited number of registers for local storage This is only due to definition in the instruction set architecture (ISA) Internally there are more registers available The CPU can internally re-assign the registers in order make best use of the given resources. This re-allocation is tracked in the Register Allocation Table
30
April 29 2008 Our CPU Construction Inst Data Instr. Fetch Dispatch ALU Memory Access Data Branch Target Buffer ALU Read Registers Reorder Buffer Retire - Write Registers Register Allocation Now let‘s compare to the original...
31
April 29 2008 Block Diagram Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Retire (Write back) Re-Order Buffer (ROB) – 128 entry IA Register Set To L2 Cache Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data Not very different, right? Memory Dispatch
32
April 29 2008 Summary Nehalem can be thought of very much in the same as standard text books present a CPU 1.Instruction Fetch and Decode (Frontend) 2.Dispatch 3.Execute 4.Memory Access 5.Retire (Write Back) If you want to dive deeper into the subject: Hennessy and Patterson: Computer Architecture – A quantitave approach
33
April 29 2008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.