Download presentation
Presentation is loading. Please wait.
Published byBruce Thomas Modified over 9 years ago
1
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer Society Volume 23, Issue 2, March-April 2003 Page(s):56 - 65
2
Traditional Processor Design Higher Clock Speed Pipeline the microarchitecture to finer granularities called super pipelining Instruction Level Parallelism (ILP) In-Order vs Out-of-Order Cache Hierarchy Data on the cache reduces the frequency of access to the slower main memory
3
Design Cont. Existing techniques add die-size and power costs. CMP: Full set of execution and architectural resources. Time-slice multithreading. Simultaneous multithreading (SMT).
4
Hyper-Threading (HT) HT introduces the SMT approach to the Intel architecture. A single physical processor appears as multi-core processors. One copy of the architecture state for each logical processor sharing a single set of physical execution resource. HW: more instructions, SW: schedule more threads HT added less than 5% to the relative chip size and maximum power requirements.
5
Microarchitecture choice & tradeoffs Partition: Dedicating equal resources to each logical processor. Simplicity and low complexity. Good for high structure’s utilization and unpredictable. Eg: Pipeline Queue Threshold: Flexible resource sharing with a limit on the maximum resource usage. Ideal for small structures where the resource utilization is bursty and predictible. Eg: Processor Scheduler Full Sharing: Flexible resource sharing with no limit on the maximum resource usage. Good for large structures in which the working set-size are variable. Eg: Processor caches.
6
Shared vs Partitioned Queue Dark color: Slower Thread, Light color: Faster Thread
7
HT Resources Duplicated: Register Renaming Logic Instruction Pointer ITLB Return Stack Predictor Partitioned: Reorder Buffer (ROB) Load/Store Buffer Scheduling queues, uop queues. Shared: Caches: Trace cache, L2. Execution unit. Microarchitectural Registers.
8
Front-End Pipeline Execution Trace Cache: Trace Cache (TC) stores the decoded instructions called Microoperations (uops). Microcode ROM: For complex instruction where TC sends microcode instruction pointer to the Microcode ROM. Instruction Translation Lookup Buffer (ITLB): In case of Trace Cache Miss, ITLB receives the request from TC to deliver new instructions and it translates the next instruction pointer address to a physical address. Streaming buffers is 64 bytes. IA32 Instruction Decode: Decoding is only needed for instructions that miss the TC. Alternate between threads, in this way we need two copies of decoder logic. Uop Queue: Each logical processor has half the entries only (Partitioned). Sends uops from Front-end pipeline to the Out-of-Order Execution Engine.
9
Out-of-Order Execution Engine Allocator: It will alternate select uops from the logical processor at every clock cycle. Signal stall if limit is reached. Register Rename: Rename the IA32 registers (8) into the machine physical registers (128). Allow the instruction to run at the same time with another instruction that use the same IA32 registers. Uses RAT to keep track of the registers. Instruction Scheduling: Four uops schedulers are used to schedule different type of uops for different execution units. Each scheduler has its own queue of 8-12 entries Retirement: Retirement logic alternate between two logical processors to track which uops are ready to be retired. Data is written to the L1 Data cache. Each logical processor can use up to a maximum of 63 ROB, 24 load buffers and 12 store buffers.
10
Dispatch & Execution Units Maximum # of instructions that can be dispatched is 6: - Two microinstructions on Port 0. - Two microinstructions on Port 1. - One microinstruction on Port 2. - One microinstruction on Port 3. Same port has fast unit combine with the slow unit. Port 2, 3 is used for memory operations (load and store). After execution, uops are placed in the ROB.
11
Single Task (ST), Multi Task (MT) Mode Two types of ST Mode: ST0 and ST1. Only one logical processor is active, low-power mode. Resources that were partitioned in MT mode are recombined to give the single logical processors the entire resources. HALT instruction is used to transition from MT to ST mode. It is a privileged instruction, only ring-0 or OS can execute it.
12
Experiment Setup ProcessorIntel Pentium 4 with HT enabled/disabled MotherboardIntel Desktop Board D850EMV2 RAM256-Mbyte PC1066 RDRAM GPULeadtek WinFast A250 Ultra TD GeForce 4/ nVidia GeForce 4 4x AGP graphics Software- Intel Application Accelerator v2.2.2128. - Intel C and Fortran compilers 5.01 for SPEC - DirectX 8.1 - Intel Chipset Software Installation Utility v4.00.1009 OSWindows XP(build 2600)
13
Result Cache hit rate and overall performance impact for a fully shared cache normalized against values for a partitioned cache
14
Multithreading & Multitasking Performance HT Performance on Multithreaded Software Package HT Performance on Multitasking workloads
15
Conclusions HT improves multithreaded applications by having each logical processor run software threads from the same application. HT speeds up workload consisting of multitasking applications by multitasking. Each logical processor run threads from different applications. Nehalem (Intel i7) plan to be released in Q4 2008. It scales up to 8 physical cores (16 logical processors).
16
Additional References: Hyper-Threading Technology Architecture and Microarchitecture ftp://download.intel.com/technology/itj/200 2/volume06issue01/art01_hyper/vol6iss1_ art01.pdf http://www.hardwaresecrets.com/article/23 5/6http://www.hardwaresecrets.com/article/23 5/6
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.