Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Contents Even and odd memory banks of 8086 Minimum mode operation
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Computer Organization and Architecture
Computer Organization and Architecture
Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu CmpE
1 Microprocessor-based Systems Course 4 - Microprocessors.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Architecture Basics ECE 454 Computer Systems Programming
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Intel Pentium II Processor Brent Perry Pat Reagan Brian Davis Umesh Vemuri.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
History of Microprocessor MPIntroductionData BusAddress Bus
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hyper-Threading Technology Architecture and Micro-Architecture.
Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.
– Mehmet SEVİK – Yasin İNAĞ
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Introduction to MMX, XMM, SSE and SSE2 Technology
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Hyper-Threading Technology Architecture and Microarchitecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture 2008 – Advanced Topics 1 Computer Architecture Advanced Topics.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Chapter Overview General Concepts IA-32 Processor Architecture
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Protection in Virtual Mode
Computer Architecture Advanced Topics
Simultaneous Multithreading
Computer Structure Multi-Threading
INTEL HYPER THREADING TECHNOLOGY
PowerPC 604 Superscalar Microprocessor
Computer architectures M
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
MMX Multi Media eXtensions
Comparison of Two Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
15-740/ Computer Architecture Lecture 5: Precise Exceptions
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
Presentation transcript:

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451

Outline Evolution of Intel multimedia extensions –x87 (386) –MMX (Pentium MMX, Pentium II) –SSE (Pentium III) –SSE2 (Pentium 4 – Willamette) –SSE3 (Pentium 4 – Prescott) Hyper-Threading

X87 FPU 8 80-bit data registers (double extended precision floating point) Data registers treated as a stack Control register – FP precision, rounding, … Status register – FPU busy, TOS, CC, error, exception, … Tag register- (2 bits) valid, zero, special, empty Last instruction pointer register Last data (operand) pointer register Opcode register

x87 FPU State

X87 Data Types

x87 Instructions Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control

MMX SIMD execution 8 64-bit data registers (MMX) –Aliased to x87 FPU registers Randomly accessible

SIMD Execution

MMX State

MMX Registers

MMX Data Types

MMX Instructions Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state

SSE Pentium III bit data registers (XMM) –Independent of x87 FPU and MMX registers SSE instructions can be executed in parallel with MMX/x87 MXCSR register – control and status for XMM registers (similar to x87 status register) EFLAGS register – results of compare ops 128-bit packed single-precision fp data type Prefetching, cacheability, store ordering control instructions

SSE State

XMM Registers

SSE Data Type

SSE Instructions Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering –SFENCE (store fence) FXSAVE, FXRSTORE –extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers

Packed Single-Precision FP Operation

Scalar Single-Precision FP Operation

Shuffle

Unpack and Interleave

SSE2 Pentium 4 More data types More instructions to support new data types

SSE2 State

SSE2 Data Types

SSE2 Instructions Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence)

Packed Double-Precision FP Operations

Scalar Double-Precision FP Operations

SSE3 Pentium 4 (Prescott) –Support for Hyper-Threading 13 new instructions –10 SIMD support instructions –1 x87 accelerating instruction (fp to int conversion) –Synchronization of threads MONITOR (monitor write-back stores) MWAIT (wait for write-back store) No new state

Asymmetric Processing

Horizontal Data Movement

Hyper-Threading

Terminology Process –Program associated with a context (state: registers, program counter, flags, etc.) –Consists of one or more threads Thread –“lightweight process” (less state)

Hyper-threading Single physical processor appears as 2 logical processors Thread Level Parallelism (TLP) –Many applications have software threads that can be executed simultaneously Online transaction processing Web services Latency can leave execution units idle –Cache misses –Branch mispredictions –Waiting for loads/stores

Techniques for Minimizing Effect of Long Latency Chip multiprocessing (CMP) –2 processors on single die –Larger than single core chip, manufacture more expensive Time-slice or switch-on-event multithreading –Switch threads after fixed time period or on long latency events like cache misses –Doesn’t take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) Simultaneous multithreading (SMT) –Multiple threads execute on single processor without switching –Hyper-Threading is Intel’s implementation

Intel Hyper-Threading Demo

Resource Requirements for HT Need to maintain 2 contexts Replicated –Register renaming logic (RAT) –Instruction Pointer –ITLB –Return stack predictor –Various other architectural registers (GP, control, APIC, machine state) Partitioned –Re-order buffers (ROBs) –Load/Store buffers –Various queues, like the scheduling queues, uop queue, etc. Shared –Caches: trace cache, L1, L2, L3, microcode ROM –Microarchitectural registers –Execution Units

Hyper-Threading Goals Minimize die area cost for implementing Ensure forward progress by at least one logical processor Maintain single-threaded performance

Frontend Changes 2 PCs Arbitration for shared resource access –Trace cache, microcode ROM, caches –One logical processor at a time per structure Thread tags per trace cache entry Microcode ROM – 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction –RAS and branch history buffer duplicated –Global history shared, but tagged with logical processor ID

Trace Cache Hit

Trace Cache Miss

Hyper-threaded Execution

Execution Modes Single-task (ST), Multi-task (MT) –ST0, ST1 –HALT: transitions ST modes depending on logical processor executing –Interrupt sent to halted processor transitions to MT

HT Performance - OLTP

HT Performance – Web Server