MAMAS – Computer Architecture Pentium® 4 Processor Based on The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001 Hyper-Threading Technology.

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Lecture 12 Reduce Miss Penalty and Hit Time

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Organization and Architecture

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Intel Labs Labs Copyright © 2000 Intel Corporation. Fall 2000 Inside the Pentium ® 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture.

Chapter 12 Pipelining Strategies Performance Hazards.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CMPE 421 Parallel Computer Architecture

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Hyper-Threading Technology Architecture and Microarchitecture

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Intel Pentium® 4 processor 1 MAMAS – Computer Architecture Pentium® 4 Processor Based on The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS161 – Design and Architecture of Computer

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Protection in Virtual Mode

Advanced Architectures

Computer Architecture Chapter (14): Processor Structure and Function

Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

Computer Architecture Advanced Topics

CSC 4250 Computer Architectures

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Computer Structure Multi-Threading

INTEL HYPER THREADING TECHNOLOGY

PowerPC 604 Superscalar Microprocessor

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

/ Computer Architecture and Design

Hyperthreading Technology

Instruction Level Parallelism and Superscalar Processors

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

The Microarchitecture of the Pentium 4 processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Computer Architecture

Ka-Ming Keung Swamy D Ponpandi

Alpha Microarchitecture

How to improve (decrease) CPI

Advanced Computer Architecture

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

ARM ORGANISATION.

Chapter 11 Processor Structure and function

Ka-Ming Keung Swamy D Ponpandi

MAMAS – Computer Architecture Pentium® 4 Processor Based on The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001 Hyper-Threading Technology.

Presentation transcript:

MAMAS – Computer Architecture Pentium® 4 Processor Based on The Microarchitecture of the Pentium® 4 Processor – ITJ Q1/2001 Hyper-Threading Technology Architecture and Microarchitecture – ITJ 02/2002 The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology – ITJ 02/2004 Dr. Lihu Rappoport

Netburst® Microarchitecture Trace Cache Out-of-order core Rapid Execution Store-to-load forwarding enhancements

Pentium® 4 Willamette Northwood Prescott transistors 42M 55M 125M process 180 nm 130 nm 90 nm Die size 217 mm2 145 mm2 112 mm2 Peak power ~70 Watt ~50 Watt ~100 Watt Freq ≤ 2.0 GHz ≤ 3.4 GHz 2.8 – 3.8 GHz Bus 400 MHz 400/533/800 MHz 533/800 MHz L1 cache 8KB 4-way 16KB 8-way L2 cache 256KB 512KB 1MB HT No Yes yes Architecture MMX, SSE, SSE2 MMX, SSE, SSE2, SSE3

Pentium® 4 Block Diagram 3 L1 D-Cache (16KB, 8 Way) and D-TLB Mem Sched Store AGU Load Integer RF / Bypass Network Trace Cache (12K uops) Rename/Alloc Mem uop Q uCode ROM Decoder Instruction TLB & Prefetcher L2 Cache 1MB 8 way 108 GB/sec Uop Queue FE BTB 4K entries TC BTB 2K entries Int/FP uop Queue Slow/Gen FP Simple FP Fast Sched Fast ALU Simple Inst Slow Complex Fmul Fadd MMX SSE FP Move FP RF / Bypass Network 256 bit Bus Interface Unit Quad Pumped 6.4 GB/sec System 64 bit

Block Diagram Explained BTB and iTLB determines next instructions to be fetched from L2 cache in case of a TC miss The Instruction decoder Takes bytes delivered from the L2 cache and decodes them into uops The Trace Cache Caches uops from the instruction decoder Used as L1 Instruction cache Delivers 3 uops/clock The micro-code ROM has the complex micro-code sequences Uop Queue Holds uops from TC, ucode ROM, or decode logic Decouples the FE from the OOO Execution Engine

Block Diagram Explained (cont.) The Renamer Maps arch registers onto 128 deep physical register file The Allocator assigns all the necessary hardware buffers in the machine for this uop to execute The Schedulers determine when a uop is ready to execute Execution Units up to 4 integer arithmetic operations per clock cycle 1 Floating point operation (including 128-bit SSE) per clock cycle A memory load and store operation (up to 128 bit) each clock cycle 3.2 GB/sec system bus

Basic Pentium® 4 Processor Pipeline Basic P6 Pipeline 1 2 3 4 5 6 7 8 9 10 Fetch Decode Rename ROB Rd Rdy/Sch Dispatch Exec Intro at 733MHz 0.18µ Basic Pentium® 4 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch 13 14 Disp 15 16 17 18 19 20 RF Ex Flgs Br Ck Intro at 1.4GHz 0.18µ

Trace Cache Decoding several IA-32 inst/clock at high frequency is difficult Instructions have a variable length and have many different options Requires a significant amount of logic and intermediate state Takes several pipeline stages, which adds to the branch mis-prediction penalty The decode logic takes instruction bytes from the streaming buffers and decodes them into uops The Trace-caches caches the uops of previously decoded instructions Bypasses the instruction decoder most of the time, and provides most of the uops Decoding is only needed for instructions that miss the TC Allows the decoder to be simplified: decodes 1 instruction per cycle Reduces branch misprediction penalty

Trace Cache (cont.) The Trace Cache is the primary (L1) instruction cache Holds 12K uops 8-way set associative with LRU replacement Hit rate similar to that of an 8K~16KB conventional instruction cache Delivers 3 uops/clock Most uops are fetched and executed from the TC Only when the TC misses, fetch & decode instructions from L2 cache

Trace Cache (cont.) Conventional instruction caches cannot provide instructions before and after a taken branch in the same cycle Fetch bandwidth is limited to basic blocks The TC builds traces: program-ordered sequences of uops Allows the target of a branch to be included in the same TC line as the branch itself Jump into the line Jump out of the line jmp jmp

Trace Cache (cont.) Traces have variable length Broken into trace lines, six uops per trace line There can be many trace lines in a single trace The TC has its own branch predictor (Trace BTB) Predicts branches that hit in the TC Directs where instruction fetching needs to go next in the TC Prescott encodes more types of uops inside the trace cache If an instruction uses a uop that cannot be encoded in the trace cache All uops for the instruction have to be sequenced from the Microcode ROM Requires transitions to the Microcode ROM, which reduces average FE BW New instructions that can now be encoded in the trace cache include Indirect calls with a register source operand software prefetch instructions

Microcode ROM Used for complex x86 instructions such as string move, and for fault and interrupt handling When a complex instruction is encountered, the TC jumps into the microcode ROM After the microcode ROM finishes sequencing uops for the current x86 instruction The front end resumes fetching uops from the TC Uops from the TC and the microcode ROM are buffered in a simple, in-order uop queue helps smooth the flow of uops going to the out-of-order execution engine

iTLB and Front-End BTB If there is a TC miss FE BTB predicts the next instruction IP iTLB translates the next IP linear address to a physical address Also performs page-level protection checking Instruction bytes are fetched from the L2 cache placed into streaming buffers, until they can be decoded Decoded into uops to be placed in the TC Hardware instruction pre-fetching logic Fetches x86 instruction bytes from the L2 cache that are predicted to be executed next Guided by the front-end BTB to know what to fetch next Attempts to keep the instruction decoder fed with the next x86 instructions the program needs to execute

Static Branch Prediction at Decode If the decoder realizes that an instruction is a branch that was missed by the BTB, a static branch prediction is made Allows for a faster restart, than waiting for execution time detection The simplest static prediction algorithm Predict a backward branch to be taken Predict a forward branch to be not taken Correctly predicts taken for the first iteration of most loops Not all backwards branches are loop-ending branches if the distance between a backwards branch and its target is larger than some threshold, the branch is unlikely to be a loop-ending branch A backward branch is statically predict taken, only if the branch distance is less than this threshold Branches with certain conditions are more often not taken, regardless of their direction and distance The conditions that they used are not common loop-ending conditions The static prediction for branches with these conditions is not taken

OOO Core: RAT and Alloc RAT renames architectural registers onto 128 physical regs Tracks the latest version of each architectural register Tell the next instruction where to get its input operands Allocator allocates uops from the uop queue into 126 ROB entries, 128 integer and 128 floating-point physical registers 48 load buffer entries 32 store buffer entries (up from 24 in previous P4’s) After allocation and renaming, uops are placed in one of 2 Qs Memory instruction queue: for memory operations (loads and stores) General instruction queue: for all other operations The above two queues send uops to five scheduler queues Each scheduler has its own 8–12 entry queue from which it selects uops to be exe RAT/ Alloc Uop Queue Gen inst Q Mem inst Q Sched Q 1 Sched Q 2 Sched Q 3 Sched Q 4 Sched Q 5

OOO Core – Schedulers Five uop schedulers schedule different uop types to execution units Collectively, they schedulers can dispatch up to 6 uops/cycle Twice the rate at which the OOO core can receive uops from the Trace Cache Allows higher flexibility to issue ready uops on the different ports Schedulers determine when uops are ready to execute based on The readiness of their dependent input register operands, and The availability of the execution unit resources The 5 schedulers connected to 4 dispatch ports When multiple schedulers have uops ready to execute to the same port, schedulers arbitrate for a dispatch port

OOO Core – Dispatch Ports 1st half of cycle: can dispatch a uop either to the FP Move unit or the fast ALU 2nd half of cycle: can dispatch one uop to the fast ALU Port 1 1st half of cycle: can dispatch a uop either to complex FP/INT or to fast ALU Port 2: can dispatch one load per cycle Port 3: can dispatch one store data per cycle FP/ Media Move FP/SSE: Reg Move Store Data FXch Load Store Store Addr All Loads LEA SW Prefetch ALU double speed Add/Sub Logic Branch Some of shift/ rotate Complex FP/SSE Add FP/SSE Mul FP/SSE Div MMX Int Shift Rotate Int Mul Exec Port 0 Port 1 Port

OOO Core – Retirement After execution, uops are placed in the ROB, waiting to be retired Retirement logic commits the architecture state in program order Once stores have retired, store data needs to be written into the L1 D-cache

Memory System Store instructions are broken up into two uops Store address: calculates the address of the store Store data: stores the data into the Store Data buffer The actual write to memory is done when the store retires Separating store-data & store-address is important for mem disambiguation Allows store-address to dispatch earlier, even before the stored data is known Address conflicts resolved earlier  opens the memory pipeline for other loads store-data and store-address can be issued to execution units in parallel Store-address dispatched to AGU when its sources (base and index reg) are ready Store-data is dispatched to the store data buffer unit independently, when its source operand is available

Memory System (cont.) Loads are scheduled asynchronously to store data uops A load can execute before a store data uop, even if it needs to get data from a store The load would have to be re-executed after the store data uop had finally executed This causes two problems Can add latency because the minimum latency between a store data uop and a dependent load was not the common case latency for loads that had been re-executed Re-executing a load wastes precious load bandwidth Added a simple predictor Marks whether a load uop is likely to receive forwarded data, and, if so, from which store A load that is predicted to forward, is held in the load scheduler until the store data uop that produces the data it depends on is scheduled Both of the performance penalties mentioned above are significantly reduced

SW prefetch instructions Get data into L2 cache before data actually used Opportunistically look up the L2 cache on a miss initiate a data prefetch cycle on the front-side bus Data filled only to L2 cache not to pollute the small L1 D- cache If the prefetch access misses the DTLB In previous P4’s: prefetch operations is dropped In Prescott: initiate a page table walk and fill the DTLB If the prefetch has a page fault, it is silently dropped fault not reported to OS, and prefetch not performed

Hardware Prefetch Reduce the time waiting for DRAM Can prefetch both code and data streams Data stream can be accessed by loads and/or stores The hardware prefetcher looks for streams of data Tries to predict what data will be needed next by the processor and proactively tries to fetch these data Can be superior to software prefetching Requires no effort from the programmer Can improve performance on code that has no software prefetch instructions Prescott improved the prefetecher

Low Latency L1 Data Cache 16K-byte, 8-way set-associative, 64 bytes per cache line virtually addressed and physically tagged Write-through Writes to L1 D-cache are always copied into the L2 cache One load and one store per clock cycle Load latency is a key aspect of processor performance IA-32 have many loads and stores due to small number of registers The L1 D-cache has a 2-clock load-use latency for integer loads Time from load scheduler to exe, is longer than the load exe latency itself Minimize load-to-use latency by using a partial virtual address match to detect early in the pipeline whether a load is likely to hit or miss in the cache

Low Latency L1 Data Cache (cont.) The scheduler assumes that the load will hit the L1 D-cache It dispatches dependent operations before the parent load has finished executing This is a form of data speculation If the load eventually misses the L1 D-cache Dependent operations that have started execution, use incorrect data A replay logic tracks and re-executes instructions that use incorrect data Only dependent operations are replayed: independent ops allowed to complete ≤4 outstanding L1 D-cache load misses pending at any one time

Store-to-Load Forwarding Stores write to memory (L1 D-cache) only when they retire A long time can pass from when a store has valid data to when the store retires and data is written into the cache due to the deep pipeline Stores that are ready to retire often have to wait for previous stores to retire If a load dependant on a Store would have to wait for a store to commit, before it can get its data, there would be a significant performance reduction Each store is allocated an entry in a Store Forwarding Buffer (SFB) SFB has 32 entries  ≤32 stores can be in the pipeline (in various stages) Once the store data is known, it writes the data into its SFB entry For a Load Load address matched against all older stores in the SFB Done in parallel with the load’s L1 D-cache access If matches, the load gets its data directly from the SFB Does not need to wait for the store to retire

Store-to-Load Forwarding (cont.) Forwarding is allowed if The load hits the same address as a preceding completed store in the SFB The load has the same size or is smaller than the pending store The load have the same beginning physical address as the store Otherwise, the load gets its data from the cache Cannot complete until the store has committed its state to the cache Forwarding mechanism has same latency as cache lookup SFB don’t have time to do a full address match and access size check The load address is only partially matches against all store addresses Can cause wrong forwarding Later in the pipeline, the MOB performs a full address match Verifies the load got the data from the most recent dependent store If forwarding was incorrect, the load re-executes after the dependent store writes to the L1 cache, and gets its data from the cache

Bus Memory requests not satisfied by the cache hierarchy are serviced by the bus logic The bus logic includes The interrupt controller (APIC) Off-chip system memory and I/O space Bus logic also deals with Snooping: cache coherency of requests originated by other external bus agents Incoming interrupt request delivery via the local APICs.

Hyper Threading Technology Based on Hyper-Threading Technology Architecture and Micro-architecture Intel Technology Journal

Thread-Level Parallelism Multiprocessor systems have been used for many years There are known techniques to exploit multiprocessors Software trends Applications consist of multiple threads or processes that can be executed in parallel on multiple processors On-line transaction processing Web services Threads can be From the same application From different applications running simultaneously From operating system services From operating system threads doing background maintenance Thread-level parallelism (TLP) Gains a better performance vs. transistor count and power ratio Increasing single thread performance become harder and harder and is less and less power efficient

Chip Multi-Processing (CMP) In CMP two processors are put on a single die Each processor has a full set of execution and architectural resources The processors may or may not share a large on-chip cache CMP is orthogonal to multi-processor systems Can have multiple CMP processors in a multiprocessor configuration A CMP chip is significantly larger than a single-core chip And therefore more expensive to manufacture However, process technology enables this CMP chips were announced or being developed by all major companies

Multi-Threading Multi-threading: a single processor executes multiple threads Time-slice multithreading The processor switches between software threads after a fixed period Can effectively minimize the effects of long latencies to memory Might result in wasted execution slots Switch-on-event multithreading Switch threads on long latency events such as cache misses Works well for server applications that have many cache misses and where the two threads are executing similar tasks Both time-slice and switch-on-event do not optimally overlap the usage of inefficiently used resource E.g., branch mis-predictions, instruction dependencies Simultaneous multi-threading (SMT) Multiple threads execute on a single processor simultaneously w/o switching Makes the most effective use of processor resources Maximizes performance vs. transistor count and power

Hyper-threading (HT) Technology HT is SMT Makes a single processor appear as multiple logical processors Each logical processor keeps a its own copy of the architecture state Takes a small fraction of the die Each logical processor has its own interrupt controller Interrupts sent to a specific logical processor are handled only by it The architecture state consists of registers including General-purpose registers Control registers Some machine state registers From a software or architecture perspective OS view the logical processors as physical processors Schedule threads to logical processors as in a multiprocessor system From a micro-architecture perspective Logical processors share a single set of physical resources caches, execution units, branch predictors, control logic, and buses

Two Important Goals When one logical processor is stalled the other logical processor can continue to make progress A logical processor may be temporarily stalled for a variety of reasons Cache misses, branch mispredictions, waiting for results from previous instructions Buffering queues separate major pipeline logic blocks Independent progress ensured by either Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues A single active thread running on a processor with HT runs at the same speed as without HT Partitioned resources are recombined when only one thread is active

A high-level view of the pipeline Fetch Decode Queue TC / MS-ROM Rename / Allocate OOO Scheduler / Execute Retirement Arch State Physical Registers APIC Shared Duplicated Partitioned

Front End Each logical processor manages its own next-instruction-pointer TC Hit TC Miss

TC and Ucode-ROM Logical processors arbitrate TC access every cycle If both want to access the TC access granted in alternating cycles If one logical processor is stalled or is unable to use the TC, the other logical processor can use the full TC bandwidth TC entries are tagged with thread-ID TC entries are dynamically allocated as needed Allows one logical processor to have more entries than the other Both logical processors share the ucode ROM Each manages its own ucode-ROM pointer Ucode-ROM access alternates between logical processors just as with the TC

BPU and IFU Branch prediction structures are either duplicated or shared The return stack buffer is duplicated Very small structure Call/Return pairs are better predicted per threads Global history is tracked for each logical processor The large global history array is a shared Entries are tagged with a logical processor ID Each logical processor has its own ITLB The logic in charge of sending fetch requests to the L2 cache Arbitrates on a first-come first-served basis Always reserving at least one request slot for each logical processor Both logical processors can have fetches pending simultaneously Two 64-byte streaming buffers per logical processor Hold instruction bytes in preparation for instruction decode small structures  low die size cost of duplicating

Decoder and Uop Queue Both logical processors share the same decoder logic if only one needs the decode logic, it gets the full decode bandwidth The state needed by the decodes is duplicated Decode several instructions for one logical processor before switching to the other Decoded instructions are written into the TC and forwarded to the uop queue Uop queue is hard partitioned Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls

Out-of-order Execution Engine Allocation, register renaming, scheduling, execution

Allocator and RAT ROB and MOB are hard partitioned Allocator Enforce fairness and prevent deadlocks Allocator If there are uops for both logical processors in the uop queue Alternates between logical processors every cycle to assign resources If a logical processor reached the limit in one of the buffers Continues to assign resources for the other logical processor If the uop queue only contains uops for one logical processor Assigns resources for that logical processor every cycle the resource limits are still enforced Register renaming There are two RATs, one for each logical processor Renaming done in parallel to the allocator logic For the same uops to which the allocator is assigning resources Registers are renamed to a shared physical register pool

Instruction Scheduling After allocation and renaming uops are placed in one of 2 Qs Memory instruction queue and general instruction queue The two queues are hard partitioned Send uops to the scheduler Qs as fast as they can Alternating between the two logical processors every clock cycle The schedulers choose uops regardless of their logical processor The schedulers are oblivious to logical processor Uops are evaluated only based on dependent inputs and availability of execution resources Uops from the two logical processors can be dispatched in the same cycle To avoid deadlock and ensure fairness There is a limit on the number of active entries that a logical processor can have in each scheduler’s queue This limit is dependent on the size of the scheduler queue

Execution Units The execution core and memory hierarchy are also largely oblivious to logical processors Registers renamed to a shared physical register pool  Uops merely access the physical register file to get their destinations Write results back to the physical register file Forwarding logic compares physical register numbers Forward results to other uops without logical processors knowledge

Retirement Retirement logic tracks when uops from the two logical processors are ready retired Retires uops in program order for each logical processor by alternating between the two logical processors every cycle If one logical processor is not ready to retire uops All retirement bandwidth is dedicated to the other logical processor Once stores have retired, store data is written into the L1 D$ Selection logic alternates between the two logical processors to commit store data to the cache

Memory Subsystem Memory subsystem is largely oblivious to logical processors Schedulers send load or store uops regardless to logical processors The memory subsystem handles them as they come DTLB The DTLB translates addresses to physical addresses It has 64 fully associative entries; Each entry can map either a 4K or a 4MB page Although the DTLB is a shared structure between the two logical processors, each entry includes a logical processor ID tag Each logical processor also has a reservation register to ensure fairness and forward progress in processing DTLB misses

Memory Subsystem (cont) L1 Data Cache, L2 Cache, L3 Cache Both logical processors share all entries in all 3 cache levels Regardless to which logical processor’s uops initially brought the data into the cache Potential for cache conflicts, which can result in lower performance There is also the possibility for sharing data in the cache One logical processor may prefetch instructions or data, needed by the other, into the cache; Common in server application code in a producer-consumer model One logical processor may produce data that the other logical processor uses

Single-task And Multi-task Modes MT-mode (Multi-task mode) Two active logical processors Some of the resources are partitioned as described earlier ST-mode (Single-task mode) Optimize performance for the case of a single thread There are two flavors of ST-mode single-task logical processor 0 (ST0) – only logical processor 0 is active single-task logical processor 1 (ST1) – only logical processor 1 is active Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources

Single-task And Multi-task Modes (cont) The HALT instruction Stops processor execution, and allows the it to go into a lower power mode HALT is a privileged instruction (executed by OS or ring-0 processes only) Moving the processor from MT mode to ST0/ST1 mode Logical processor 0 executes HALT, move to ST1-mode Logical processor 1 executes HALT, move to ST0-mode If the remaining active logical processor also executes HALT the physical processor would can go to a lower-power mode Moving the processor from ST0/ST1 mode to MT mode Send an interrupt to the HALTed processor

Operating System And Applications An HT processor appears to the OS and application SW as 2 processors The OS manages logical processors as it does physical processors scheduling runnable tasks or threads to logical processors The OS should implement two optimizations: Use HALT if only one logical processor is active Allows the processor to transition to either the ST0 or ST1 mode Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor On a multi-processor system, schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor Allows software threads to use different physical execution resources when possible

HT in Prescott Many of Prescott enhancements motivated by HT performance (Have little impact on the majority of single-threaded apps) Increase num of outstanding loads that miss the L1 D-cache from 4 to 8 Increase the size of the queue between the front end and the Alloc/RAT Work on a page table walk, and handle a memory access that splits a cache line, in parallel rarely seen with single-thread, much more common with two threads Allow starting new page table walks, while handling a page table walk that missed all the caches and had to go to DRAM Changes were also made to some of the thread selection points in order to improve overall bandwidth TC responds faster to stalling events in the core dedicating all of its resources to the thread that is not stalled thereby generating better overall performance

Backup

Schedulers – Prescott Increased the size of the two schedulers used to hold x87/SSE/SSE2/SSE3 uops Increased the window of opportunity to find parallelism in multimedia algorithms Increased the effective size of the queues that feed all the schedulers more uops can be buffered between the allocator and the scheduler before the allocator has to stall Allows Allocator and Renamer to continue to look ahead in the instruction stream even when the schedulers are full

XOR reg, reg XOR register with itself used to zero the register Rather than to move an immediate of 0 into the register The result is logically equivalent Preferred because of the smaller resulting code size Adds a dependency on the previous contents of the register can result in a performance loss Recognize when XOR is used in this manner Removes the dependency on the source register The result (0) does not depend on the value of the register Prescott recognizes this for more XOR and SUB types

Memory System Write-combining buffers track streams of stores Increased from 6 to 8 Alleviates pressure on the number of stores that can be in the machine simultaneously by allowing stores to be processed faster Number of unique outstanding loads that have missed the L1 data cache and can be serviced has been increased from 4 to 8 Increase the size of L2 cache from 256/512 Kbytes to 1MB 8-way set associative,128-byte lines, Writeback cache physically addressed

Memory System Added a performance feature to enhance SSE/SSE2/SSE3 instructions On the x87 side, the FP Control Word (FCW) is often modified When changing the rounding mode and precision To avoid serializing the processor each time that the FCW is modified, a simple predictor captures common renaming cases This same idea is now extended to also handle the MXCSR, which is the corollary of the FCW for instructions that use the SSE registers On prior implementations, changes to the MXCSR would serialize the machine Prescott avoids serialization also to common case modifications of MXCSR

Pentium® 4 Block Diagram

90nm Pentium® 4 Block Diagram

Force Forwarding Allow the MOB to control forwarding in the SFB Two new selection points added to the store-forwarding path The forwarding-entry-selection mux: allows the MOB to override the SFB’s partial address match-based entry selection The data alignment mux: allows for misaligned data to be rotated, based on the shift information provided by the MOB The MOB can decide to re-execute a load in 2 cases Partial address match caused incorrect dependency in the SFB SFB failed to detect a dependency when a dependency did exist If the SFB’s dependency check is wrong, the MOB can correct the forwarding logic when the load re-executes by directing the SFB in one of two ways: forward to the load from the right entry and rotate the data as necessary or disable forwarding to the load if there is no dependent store in the SFB

Forced Forwarding Entry Force forwarding MOB Store Forwarding Buffer (SFB) Alignment Mux Entry Mux Forwarding entry Selection based on partial virtual Address match Shift Control Forced Forwarding Entry Load’s forwarded data

Force forwarding Supported cases of misaligned forwarding Misaligned address cases fixed by force forwarding For each load at a given starting address, the data access sizes for which force forwarding is supported are listed These cases can be categorized as follows: DWord/QWord Store forwarding to Byte/Word loads whose data are fully contained in either the lower or upper DWord QWord Store forwarding to DWord Load to the upper DWord of the Store For each of these cases, the MOB “forces” the SFB to forward from a specific store by a given shift amount in order to align the store’s data to the load

Force forwarding False forwarding occurs when SFB detects a partial address match between a load and a store, but their full addresses do not match MOB detects false forward condition Determines if there exists another store that the load should have forwarded from If a store exists that can be forwarded MOB directs the SFB to forward from this store entry using the force forwarding mechanism when the load re-executes If the MOB detects that there is no dependent store in the forwarding buffer instructs the SFB to not forward to this load When the load is re-executed, it can then pick up its data from the cache instead

Bus Requests from logical processors treated on first-come basis with queue and buffering space appearing shared Priority is not given to one logical processor above the other Bus queues distinct between requests from logical processors Requests to the local APIC and interrupt delivery resources are unique and separate per logical processor Bus logic also carries out portions of barrier fence and memory ordering operations applied to the bus request queues on a per logical processor basis Logical processor ID is visibly sent onto the processor external bus in the request phase portion of a transaction For debug purposes As an aid to forward progress mechanisms in clustered multiprocessor implementations Other bus transactions, such as cache line eviction or prefetch transactions, inherit the logical processor ID of the request that generated the transaction

HT – Context Identifier Bit The partial virtual address indexing used for the L1 cache creates conflicts When each logical processor’s access pattern matches the partial virtual tag even when accessing separate regions of physical memory E.g., the stacks of the 2 threads are offset by a fixed amount that is greater the size of the partial match, such that these 2 addresses, although different, alias to the same partial tag Causes contention in the cache, leading to a reduced cache hit rate A context identifier bit is added to the partial virtual tag Aimed at reducing the likelihood of contention in the cache Dynamically set or reset based on the page-table initialization for each logical processor Serves as an indication of data sharing intent across logical processors If the two logical processors share the same page directory base in physical memory Gives a strong indication that data are intended to be shared between the logical processors  The context-identifier bit for each logical processor is set to the same value, allowing for sharing of the L1 data cache If the page-directory bases are different it is likely that both logical processors are working on separate data regions  Sharing of the L1 data cache is disallowed by keeping the context-identifier bit different across logical processors There may be uncommon cases where logical processors use different page directory bases but still share the same physical memory region through page-table aliasing arise when two different page table entries across logical processors point to the same physical page frame The processor detects such cases and implements a reservation mechanism to prevent repetitive L1 cache access conflicts among different logical processors