CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009

Introduction Motivation A Multi-Core on our desks A new microarchitecture to replace Netburst Intel Core 2 Duo A dual-core CPU ISA with SIMD Extension Intel Core microarchitecture Memory Hierarchy System

Instruction Set Architecture Base: X86-64 No VLIW (Itanium) SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Pentium MMX, 1996 Pentium III, SSE, 1999 Pentium 4, SSE2, 2001 Prescott, SSE3, 2004 Core 2, SSSE3, July 2006 Walfdale, SSE4.1, Sep 2006 8 new registers, Float-point Operations 8 new registers, Packed data type, Integer Operations Double precision, 128-bit register support DSP-oriented math, process management e.g. Permuting bytes in a word

Streaming SIMD Extension (SSE) 4.1 Beginning with the 45 nm processors 47 instructions that improve performance of media data manipulation e.g. Fast and efficient bit width conversions Convert single byte values to word (16-bit) values. 00000000

SSE2 Code MOVDQU XMM0, M64 PXOR XMM1, XMM1 PUNPCKLBW XMM0, XMM1

SSE4.1 Code PMOVZXBW XMM0, M64 DEST[15:0] <-- ZeroExtend(SRC[7:0]); DEST[31:16] <-- ZeroExtend(SRC[15:8]); DEST[47:32] <-- ZeroExtend(SRC[23:16]); DEST[63:48] <-- ZeroExtend(SRC[31:24]); DEST[79:64] <-- ZeroExtend(SRC[39:32]); DEST[95:80] <-- ZeroExtend(SRC[47:40]); DEST[111:96] <-- ZeroExtend(SRC[55:48]); DEST[127:112] <-- ZeroExtend(SRC[63:56]); Benefits Reduced instruction number (3  1) Better performance (~40% speedup each loop) Reduced register pressure (2  1)

Microarchitecture The Cores Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6M No Hyper-threading, no L3 cache Keep front-side bus Larger L2 cache

Microarchitecture 14-stage Pipeline 4 wide decode 4 wide Retire Macro-fusion Enhanced ALUs Deeper Buffers

Another View

Decode Hardware 128 bits fetch bandwidth 18-entry IQ Complex Decode -produces 1-4 micro-ops Micro-code Sequencer

Macro-fusion New Micro-op Represent instruction pair as single micro-op Enhanced ALUs To execute new compare and jump (CMPJCC) micro- op in one clock

Out of Order Execution 96 entries ROB 32 Entry Reservation Station

Execution Units 6 dispatch ports(1 Load, 2 Store, 3 universal ports) 3 integer ALU, 2 float point ALU

Branch Predictor Loop Detector - Track the number of loop iterations for future reference branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector

Cache Organization private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) Memory disambiguation aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table watchdog mechanism

Prediction Implementation History table indexed by Instruction Pointer Each entry in the history array has a saturating counter Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses When a particular load failed disambiguation: reset its counter Each time a particular load correctly disambiguated: increment counter

when sent from RS, set disambiguation bit If meets an older unknow store address, set "update" If prediction is "go", dispatch, set "done" Else blocked A store in Load Buffer scan all previous load, if a match found, "reset" bit set. When load commits, update history. Predictor Lookup Prediction Verification Load Dispatch

Execute Disable Bit Support AMD Enhanced Virus Protection; ARM eXecute Never help prevent buffer overflow attacks no need of software patches for buffer overflow attacks segregate memory by either storage of code or data processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)

Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; predict what memory address will be used and deliver in time record every load's history using Instruction Pointer IP history array parameters for prefetch traffic control fine-tuned for different platforms prefetch monitor

References Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies Intel's Next Generation Microarchitecture Unveiled Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX Intel Core: A Next-Generation Microarchitecture too many…

Questions?

CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Similar presentations

Presentation on theme: "CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Similar presentations

Presentation on theme: "CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009."— Presentation transcript:

Similar presentations

About project

Feedback