Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451

Outline Evolution of Intel multimedia extensions –x87 (386) –MMX (Pentium MMX, Pentium II) –SSE (Pentium III) –SSE2 (Pentium 4 – Willamette) –SSE3 (Pentium 4 – Prescott) Hyper-Threading

X87 FPU 8 80-bit data registers (double extended precision floating point) Data registers treated as a stack Control register – FP precision, rounding, … Status register – FPU busy, TOS, CC, error, exception, … Tag register- (2 bits) valid, zero, special, empty Last instruction pointer register Last data (operand) pointer register Opcode register

x87 FPU State

X87 Data Types

x87 Instructions Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control

MMX SIMD execution 8 64-bit data registers (MMX) –Aliased to x87 FPU registers Randomly accessible

SIMD Execution

MMX State

MMX Registers

MMX Data Types

MMX Instructions Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state

SSE Pentium III 8 128-bit data registers (XMM) –Independent of x87 FPU and MMX registers SSE instructions can be executed in parallel with MMX/x87 MXCSR register – control and status for XMM registers (similar to x87 status register) EFLAGS register – results of compare ops 128-bit packed single-precision fp data type Prefetching, cacheability, store ordering control instructions

SSE State

XMM Registers

SSE Data Type

SSE Instructions Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering –SFENCE (store fence) FXSAVE, FXRSTORE –extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers

Packed Single-Precision FP Operation

Scalar Single-Precision FP Operation

Shuffle

Unpack and Interleave

SSE2 Pentium 4 More data types More instructions to support new data types

SSE2 State

SSE2 Data Types

SSE2 Instructions Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence)

Packed Double-Precision FP Operations

Scalar Double-Precision FP Operations

SSE3 Pentium 4 (Prescott) –Support for Hyper-Threading 13 new instructions –10 SIMD support instructions –1 x87 accelerating instruction (fp to int conversion) –Synchronization of threads MONITOR (monitor write-back stores) MWAIT (wait for write-back store) No new state

Asymmetric Processing

Horizontal Data Movement

Hyper-Threading

Terminology Process –Program associated with a context (state: registers, program counter, flags, etc.) –Consists of one or more threads Thread –“lightweight process” (less state)

Hyper-threading Single physical processor appears as 2 logical processors Thread Level Parallelism (TLP) –Many applications have software threads that can be executed simultaneously Online transaction processing Web services Latency can leave execution units idle –Cache misses –Branch mispredictions –Waiting for loads/stores

Techniques for Minimizing Effect of Long Latency Chip multiprocessing (CMP) –2 processors on single die –Larger than single core chip, manufacture more expensive Time-slice or switch-on-event multithreading –Switch threads after fixed time period or on long latency events like cache misses –Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) Simultaneous multithreading (SMT) –Multiple threads execute on single processor without switching –Hyper-Threading is Intel’s implementation

Intel Hyper-Threading Demo

Resource Requirements for HT Need to maintain 2 contexts Replicated –Register renaming logic (RAT) –Instruction Pointer –ITLB –Return stack predictor –Various other architectural registers (GP, control, APIC, machine state) Partitioned –Re-order buffers (ROBs) –Load/Store buffers –Various queues, like the scheduling queues, uop queue, etc. Shared –Caches: trace cache, L1, L2, L3, microcode ROM –Microarchitectural registers –Execution Units

Hyper-Threading Goals Minimize die area cost for implementing Ensure forward progress by at least one logical processor Maintain single-threaded performance

Frontend Changes 2 PCs Arbitration for shared resource access –Trace cache, microcode ROM, caches –One logical processor at a time per structure Thread tags per trace cache entry Microcode ROM – 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction –RAS and branch history buffer duplicated –Global history shared, but tagged with logical processor ID

Trace Cache Hit

Trace Cache Miss

Hyper-threaded Execution

Execution Modes Single-task (ST), Multi-task (MT) –ST0, ST1 –HALT: transitions ST modes depending on logical processor executing –Interrupt sent to halted processor transitions to MT

HT Performance - OLTP

HT Performance – Web Server

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Similar presentations

Presentation on theme: "Intel Multimedia Extensions and Hyper-Threading Michele Co CS451."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Similar presentations

Presentation on theme: "Intel Multimedia Extensions and Hyper-Threading Michele Co CS451."— Presentation transcript:

Similar presentations

About project

Feedback