Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Similar presentations


Presentation on theme: "CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009."— Presentation transcript:

1 CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009

2 Introduction Motivation A Multi-Core on our desks A new microarchitecture to replace Netburst Intel Core 2 Duo A dual-core CPU ISA with SIMD Extension Intel Core microarchitecture Memory Hierarchy System

3 Instruction Set Architecture Base: X86-64 No VLIW (Itanium) SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Pentium MMX, 1996 Pentium III, SSE, 1999 Pentium 4, SSE2, 2001 Prescott, SSE3, 2004 Core 2, SSSE3, July 2006 Walfdale, SSE4.1, Sep 2006 8 new registers, Float-point Operations 8 new registers, Packed data type, Integer Operations Double precision, 128-bit register support DSP-oriented math, process management e.g. Permuting bytes in a word

4 Streaming SIMD Extension (SSE) 4.1 Beginning with the 45 nm processors 47 instructions that improve performance of media data manipulation e.g. Fast and efficient bit width conversions Convert single byte values to word (16-bit) values. 00000000

5 SSE2 Code MOVDQU XMM0, M64 PXOR XMM1, XMM1 PUNPCKLBW XMM0, XMM1

6 SSE4.1 Code PMOVZXBW XMM0, M64 DEST[15:0] <-- ZeroExtend(SRC[7:0]); DEST[31:16] <-- ZeroExtend(SRC[15:8]); DEST[47:32] <-- ZeroExtend(SRC[23:16]); DEST[63:48] <-- ZeroExtend(SRC[31:24]); DEST[79:64] <-- ZeroExtend(SRC[39:32]); DEST[95:80] <-- ZeroExtend(SRC[47:40]); DEST[111:96] <-- ZeroExtend(SRC[55:48]); DEST[127:112] <-- ZeroExtend(SRC[63:56]); Benefits Reduced instruction number (3  1) Better performance (~40% speedup each loop) Reduced register pressure (2  1)

7 Microarchitecture The Cores Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6M No Hyper-threading, no L3 cache Keep front-side bus Larger L2 cache

8 Microarchitecture 14-stage Pipeline 4 wide decode 4 wide Retire Macro-fusion Enhanced ALUs Deeper Buffers

9 Another View

10 Decode Hardware 128 bits fetch bandwidth 18-entry IQ Complex Decode -produces 1-4 micro-ops Micro-code Sequencer

11 Macro-fusion New Micro-op Represent instruction pair as single micro-op Enhanced ALUs To execute new compare and jump (CMPJCC) micro- op in one clock

12 Out of Order Execution 96 entries ROB 32 Entry Reservation Station

13 Execution Units 6 dispatch ports(1 Load, 2 Store, 3 universal ports) 3 integer ALU, 2 float point ALU

14 Branch Predictor Loop Detector - Track the number of loop iterations for future reference branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector

15 Cache Organization private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) Memory disambiguation aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table watchdog mechanism

16 Prediction Implementation History table indexed by Instruction Pointer Each entry in the history array has a saturating counter Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses When a particular load failed disambiguation: reset its counter Each time a particular load correctly disambiguated: increment counter

17 when sent from RS, set disambiguation bit If meets an older unknow store address, set "update" If prediction is "go", dispatch, set "done" Else blocked A store in Load Buffer scan all previous load, if a match found, "reset" bit set. When load commits, update history. Predictor Lookup Prediction Verification Load Dispatch

18 Execute Disable Bit Support AMD Enhanced Virus Protection; ARM eXecute Never help prevent buffer overflow attacks no need of software patches for buffer overflow attacks segregate memory by either storage of code or data processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)

19 Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; predict what memory address will be used and deliver in time record every load's history using Instruction Pointer IP history array parameters for prefetch traffic control fine-tuned for different platforms prefetch monitor

20

21 References Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies Intel's Next Generation Microarchitecture Unveiled Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX Intel Core: A Next-Generation Microarchitecture too many…

22 Questions?


Download ppt "CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009."

Similar presentations


Ads by Google