Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Similar presentations


Presentation on theme: "Intel Multimedia Extensions and Hyper-Threading Michele Co CS451."— Presentation transcript:

1 Intel Multimedia Extensions and Hyper-Threading Michele Co CS451

2 Outline Evolution of Intel multimedia extensions –x87 (386) –MMX (Pentium MMX, Pentium II) –SSE (Pentium III) –SSE2 (Pentium 4 – Willamette) –SSE3 (Pentium 4 – Prescott) Hyper-Threading

3

4 X87 FPU 8 80-bit data registers (double extended precision floating point) Data registers treated as a stack Control register – FP precision, rounding, … Status register – FPU busy, TOS, CC, error, exception, … Tag register- (2 bits) valid, zero, special, empty Last instruction pointer register Last data (operand) pointer register Opcode register

5 x87 FPU State

6 X87 Data Types

7 x87 Instructions Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control

8 MMX SIMD execution 8 64-bit data registers (MMX) –Aliased to x87 FPU registers Randomly accessible

9 SIMD Execution

10 MMX State

11 MMX Registers

12 MMX Data Types

13 MMX Instructions Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state

14 SSE Pentium III 8 128-bit data registers (XMM) –Independent of x87 FPU and MMX registers SSE instructions can be executed in parallel with MMX/x87 MXCSR register – control and status for XMM registers (similar to x87 status register) EFLAGS register – results of compare ops 128-bit packed single-precision fp data type Prefetching, cacheability, store ordering control instructions

15 SSE State

16 XMM Registers

17 SSE Data Type

18 SSE Instructions Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering –SFENCE (store fence) FXSAVE, FXRSTORE –extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers

19 Packed Single-Precision FP Operation

20 Scalar Single-Precision FP Operation

21 Shuffle

22 Unpack and Interleave

23 SSE2 Pentium 4 More data types More instructions to support new data types

24 SSE2 State

25 SSE2 Data Types

26 SSE2 Instructions Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence)

27 Packed Double-Precision FP Operations

28 Scalar Double-Precision FP Operations

29 SSE3 Pentium 4 (Prescott) –Support for Hyper-Threading 13 new instructions –10 SIMD support instructions –1 x87 accelerating instruction (fp to int conversion) –Synchronization of threads MONITOR (monitor write-back stores) MWAIT (wait for write-back store) No new state

30 Asymmetric Processing

31 Horizontal Data Movement

32 Hyper-Threading

33 Terminology Process –Program associated with a context (state: registers, program counter, flags, etc.) –Consists of one or more threads Thread –“lightweight process” (less state)

34 Hyper-threading Single physical processor appears as 2 logical processors Thread Level Parallelism (TLP) –Many applications have software threads that can be executed simultaneously Online transaction processing Web services Latency can leave execution units idle –Cache misses –Branch mispredictions –Waiting for loads/stores

35 Techniques for Minimizing Effect of Long Latency Chip multiprocessing (CMP) –2 processors on single die –Larger than single core chip, manufacture more expensive Time-slice or switch-on-event multithreading –Switch threads after fixed time period or on long latency events like cache misses –Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) Simultaneous multithreading (SMT) –Multiple threads execute on single processor without switching –Hyper-Threading is Intel’s implementation

36 Intel Hyper-Threading Demo

37 Resource Requirements for HT Need to maintain 2 contexts Replicated –Register renaming logic (RAT) –Instruction Pointer –ITLB –Return stack predictor –Various other architectural registers (GP, control, APIC, machine state) Partitioned –Re-order buffers (ROBs) –Load/Store buffers –Various queues, like the scheduling queues, uop queue, etc. Shared –Caches: trace cache, L1, L2, L3, microcode ROM –Microarchitectural registers –Execution Units

38 Hyper-Threading Goals Minimize die area cost for implementing Ensure forward progress by at least one logical processor Maintain single-threaded performance

39 Frontend Changes 2 PCs Arbitration for shared resource access –Trace cache, microcode ROM, caches –One logical processor at a time per structure Thread tags per trace cache entry Microcode ROM – 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction –RAS and branch history buffer duplicated –Global history shared, but tagged with logical processor ID

40 Trace Cache Hit

41 Trace Cache Miss

42 Hyper-threaded Execution

43 Execution Modes Single-task (ST), Multi-task (MT) –ST0, ST1 –HALT: transitions ST modes depending on logical processor executing –Interrupt sent to halted processor transitions to MT

44 HT Performance - OLTP

45 HT Performance – Web Server


Download ppt "Intel Multimedia Extensions and Hyper-Threading Michele Co CS451."

Similar presentations


Ads by Google