Intel Multimedia Extensions and Hyper-Threading Michele Co CS451
Outline Evolution of Intel multimedia extensions –x87 (386) –MMX (Pentium MMX, Pentium II) –SSE (Pentium III) –SSE2 (Pentium 4 – Willamette) –SSE3 (Pentium 4 – Prescott) Hyper-Threading
X87 FPU 8 80-bit data registers (double extended precision floating point) Data registers treated as a stack Control register – FP precision, rounding, … Status register – FPU busy, TOS, CC, error, exception, … Tag register- (2 bits) valid, zero, special, empty Last instruction pointer register Last data (operand) pointer register Opcode register
x87 FPU State
X87 Data Types
x87 Instructions Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control
MMX SIMD execution 8 64-bit data registers (MMX) –Aliased to x87 FPU registers Randomly accessible
SIMD Execution
MMX State
MMX Registers
MMX Data Types
MMX Instructions Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state
SSE Pentium III bit data registers (XMM) –Independent of x87 FPU and MMX registers SSE instructions can be executed in parallel with MMX/x87 MXCSR register – control and status for XMM registers (similar to x87 status register) EFLAGS register – results of compare ops 128-bit packed single-precision fp data type Prefetching, cacheability, store ordering control instructions
SSE State
XMM Registers
SSE Data Type
SSE Instructions Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering –SFENCE (store fence) FXSAVE, FXRSTORE –extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers
Packed Single-Precision FP Operation
Scalar Single-Precision FP Operation
Shuffle
Unpack and Interleave
SSE2 Pentium 4 More data types More instructions to support new data types
SSE2 State
SSE2 Data Types
SSE2 Instructions Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence)
Packed Double-Precision FP Operations
Scalar Double-Precision FP Operations
SSE3 Pentium 4 (Prescott) –Support for Hyper-Threading 13 new instructions –10 SIMD support instructions –1 x87 accelerating instruction (fp to int conversion) –Synchronization of threads MONITOR (monitor write-back stores) MWAIT (wait for write-back store) No new state
Asymmetric Processing
Horizontal Data Movement
Hyper-Threading
Terminology Process –Program associated with a context (state: registers, program counter, flags, etc.) –Consists of one or more threads Thread –“lightweight process” (less state)
Hyper-threading Single physical processor appears as 2 logical processors Thread Level Parallelism (TLP) –Many applications have software threads that can be executed simultaneously Online transaction processing Web services Latency can leave execution units idle –Cache misses –Branch mispredictions –Waiting for loads/stores
Techniques for Minimizing Effect of Long Latency Chip multiprocessing (CMP) –2 processors on single die –Larger than single core chip, manufacture more expensive Time-slice or switch-on-event multithreading –Switch threads after fixed time period or on long latency events like cache misses –Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) Simultaneous multithreading (SMT) –Multiple threads execute on single processor without switching –Hyper-Threading is Intel’s implementation
Intel Hyper-Threading Demo
Resource Requirements for HT Need to maintain 2 contexts Replicated –Register renaming logic (RAT) –Instruction Pointer –ITLB –Return stack predictor –Various other architectural registers (GP, control, APIC, machine state) Partitioned –Re-order buffers (ROBs) –Load/Store buffers –Various queues, like the scheduling queues, uop queue, etc. Shared –Caches: trace cache, L1, L2, L3, microcode ROM –Microarchitectural registers –Execution Units
Hyper-Threading Goals Minimize die area cost for implementing Ensure forward progress by at least one logical processor Maintain single-threaded performance
Frontend Changes 2 PCs Arbitration for shared resource access –Trace cache, microcode ROM, caches –One logical processor at a time per structure Thread tags per trace cache entry Microcode ROM – 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction –RAS and branch history buffer duplicated –Global history shared, but tagged with logical processor ID
Trace Cache Hit
Trace Cache Miss
Hyper-threaded Execution
Execution Modes Single-task (ST), Multi-task (MT) –ST0, ST1 –HALT: transitions ST modes depending on logical processor executing –Interrupt sent to halted processor transitions to MT
HT Performance - OLTP
HT Performance – Web Server