Download presentation
Presentation is loading. Please wait.
Published byBarbra Atkins Modified over 9 years ago
1
CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009
2
Introduction Motivation A Multi-Core on our desks A new microarchitecture to replace Netburst Intel Core 2 Duo A dual-core CPU ISA with SIMD Extension Intel Core microarchitecture Memory Hierarchy System
3
Instruction Set Architecture Base: X86-64 No VLIW (Itanium) SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Pentium MMX, 1996 Pentium III, SSE, 1999 Pentium 4, SSE2, 2001 Prescott, SSE3, 2004 Core 2, SSSE3, July 2006 Walfdale, SSE4.1, Sep 2006 8 new registers, Float-point Operations 8 new registers, Packed data type, Integer Operations Double precision, 128-bit register support DSP-oriented math, process management e.g. Permuting bytes in a word
4
Streaming SIMD Extension (SSE) 4.1 Beginning with the 45 nm processors 47 instructions that improve performance of media data manipulation e.g. Fast and efficient bit width conversions Convert single byte values to word (16-bit) values. 00000000
5
SSE2 Code MOVDQU XMM0, M64 PXOR XMM1, XMM1 PUNPCKLBW XMM0, XMM1
6
SSE4.1 Code PMOVZXBW XMM0, M64 DEST[15:0] <-- ZeroExtend(SRC[7:0]); DEST[31:16] <-- ZeroExtend(SRC[15:8]); DEST[47:32] <-- ZeroExtend(SRC[23:16]); DEST[63:48] <-- ZeroExtend(SRC[31:24]); DEST[79:64] <-- ZeroExtend(SRC[39:32]); DEST[95:80] <-- ZeroExtend(SRC[47:40]); DEST[111:96] <-- ZeroExtend(SRC[55:48]); DEST[127:112] <-- ZeroExtend(SRC[63:56]); Benefits Reduced instruction number (3 1) Better performance (~40% speedup each loop) Reduced register pressure (2 1)
7
Microarchitecture The Cores Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6M No Hyper-threading, no L3 cache Keep front-side bus Larger L2 cache
8
Microarchitecture 14-stage Pipeline 4 wide decode 4 wide Retire Macro-fusion Enhanced ALUs Deeper Buffers
9
Another View
10
Decode Hardware 128 bits fetch bandwidth 18-entry IQ Complex Decode -produces 1-4 micro-ops Micro-code Sequencer
11
Macro-fusion New Micro-op Represent instruction pair as single micro-op Enhanced ALUs To execute new compare and jump (CMPJCC) micro- op in one clock
12
Out of Order Execution 96 entries ROB 32 Entry Reservation Station
13
Execution Units 6 dispatch ports(1 Load, 2 Store, 3 universal ports) 3 integer ALU, 2 float point ALU
14
Branch Predictor Loop Detector - Track the number of loop iterations for future reference branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector
15
Cache Organization private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) Memory disambiguation aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table watchdog mechanism
16
Prediction Implementation History table indexed by Instruction Pointer Each entry in the history array has a saturating counter Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses When a particular load failed disambiguation: reset its counter Each time a particular load correctly disambiguated: increment counter
17
when sent from RS, set disambiguation bit If meets an older unknow store address, set "update" If prediction is "go", dispatch, set "done" Else blocked A store in Load Buffer scan all previous load, if a match found, "reset" bit set. When load commits, update history. Predictor Lookup Prediction Verification Load Dispatch
18
Execute Disable Bit Support AMD Enhanced Virus Protection; ARM eXecute Never help prevent buffer overflow attacks no need of software patches for buffer overflow attacks segregate memory by either storage of code or data processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)
19
Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; predict what memory address will be used and deliver in time record every load's history using Instruction Pointer IP history array parameters for prefetch traffic control fine-tuned for different platforms prefetch monitor
21
References Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies Intel's Next Generation Microarchitecture Unveiled Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX Intel Core: A Next-Generation Microarchitecture too many…
22
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.