Download presentation
Presentation is loading. Please wait.
Published byRoderick Cole Modified over 9 years ago
1
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451
2
Outline Evolution of Intel multimedia extensions –x87 (386) –MMX (Pentium MMX, Pentium II) –SSE (Pentium III) –SSE2 (Pentium 4 – Willamette) –SSE3 (Pentium 4 – Prescott) Hyper-Threading
4
X87 FPU 8 80-bit data registers (double extended precision floating point) Data registers treated as a stack Control register – FP precision, rounding, … Status register – FPU busy, TOS, CC, error, exception, … Tag register- (2 bits) valid, zero, special, empty Last instruction pointer register Last data (operand) pointer register Opcode register
5
x87 FPU State
6
X87 Data Types
7
x87 Instructions Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control
8
MMX SIMD execution 8 64-bit data registers (MMX) –Aliased to x87 FPU registers Randomly accessible
9
SIMD Execution
10
MMX State
11
MMX Registers
12
MMX Data Types
13
MMX Instructions Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state
14
SSE Pentium III 8 128-bit data registers (XMM) –Independent of x87 FPU and MMX registers SSE instructions can be executed in parallel with MMX/x87 MXCSR register – control and status for XMM registers (similar to x87 status register) EFLAGS register – results of compare ops 128-bit packed single-precision fp data type Prefetching, cacheability, store ordering control instructions
15
SSE State
16
XMM Registers
17
SSE Data Type
18
SSE Instructions Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering –SFENCE (store fence) FXSAVE, FXRSTORE –extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers
19
Packed Single-Precision FP Operation
20
Scalar Single-Precision FP Operation
21
Shuffle
22
Unpack and Interleave
23
SSE2 Pentium 4 More data types More instructions to support new data types
24
SSE2 State
25
SSE2 Data Types
26
SSE2 Instructions Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence)
27
Packed Double-Precision FP Operations
28
Scalar Double-Precision FP Operations
29
SSE3 Pentium 4 (Prescott) –Support for Hyper-Threading 13 new instructions –10 SIMD support instructions –1 x87 accelerating instruction (fp to int conversion) –Synchronization of threads MONITOR (monitor write-back stores) MWAIT (wait for write-back store) No new state
30
Asymmetric Processing
31
Horizontal Data Movement
32
Hyper-Threading
33
Terminology Process –Program associated with a context (state: registers, program counter, flags, etc.) –Consists of one or more threads Thread –“lightweight process” (less state)
34
Hyper-threading Single physical processor appears as 2 logical processors Thread Level Parallelism (TLP) –Many applications have software threads that can be executed simultaneously Online transaction processing Web services Latency can leave execution units idle –Cache misses –Branch mispredictions –Waiting for loads/stores
35
Techniques for Minimizing Effect of Long Latency Chip multiprocessing (CMP) –2 processors on single die –Larger than single core chip, manufacture more expensive Time-slice or switch-on-event multithreading –Switch threads after fixed time period or on long latency events like cache misses –Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) Simultaneous multithreading (SMT) –Multiple threads execute on single processor without switching –Hyper-Threading is Intel’s implementation
36
Intel Hyper-Threading Demo
37
Resource Requirements for HT Need to maintain 2 contexts Replicated –Register renaming logic (RAT) –Instruction Pointer –ITLB –Return stack predictor –Various other architectural registers (GP, control, APIC, machine state) Partitioned –Re-order buffers (ROBs) –Load/Store buffers –Various queues, like the scheduling queues, uop queue, etc. Shared –Caches: trace cache, L1, L2, L3, microcode ROM –Microarchitectural registers –Execution Units
38
Hyper-Threading Goals Minimize die area cost for implementing Ensure forward progress by at least one logical processor Maintain single-threaded performance
39
Frontend Changes 2 PCs Arbitration for shared resource access –Trace cache, microcode ROM, caches –One logical processor at a time per structure Thread tags per trace cache entry Microcode ROM – 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction –RAS and branch history buffer duplicated –Global history shared, but tagged with logical processor ID
40
Trace Cache Hit
41
Trace Cache Miss
42
Hyper-threaded Execution
43
Execution Modes Single-task (ST), Multi-task (MT) –ST0, ST1 –HALT: transitions ST modes depending on logical processor executing –Interrupt sent to halted processor transitions to MT
44
HT Performance - OLTP
45
HT Performance – Web Server
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.