CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

CSCI 4717/5717 Computer Architecture

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Instruction Level Parallelism 2. Superscalar and VLIW processors.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu CmpE

1 Microprocessor-based Systems Course 4 - Microprocessors.

Computer Organization and Architecture The CPU Structure.

Chapter 12 Pipelining Strategies Performance Hazards.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

Chapter 12 CPU Structure and Function. Example Register Organizations.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

The AMD and Intel Architectures COMP Jamie Curtis.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Hiep Hong CS 147 Spring Intel Core 2 Duo. CPU Chronology 2.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

CH12 CPU Structure and Function

CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.

An Introduction to IA-32 Processor Architecture Eddie Lopez CSCI 6303 Oct 6, 2008.

Architecture Basics ECE 454 Computer Systems Programming

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

History of Microprocessor MPIntroductionData BusAddress Bus

Performance of mathematical software Agner Fog Technical University of Denmark

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

Parallel Processing - introduction

Computer Architectures M

The Microarchitecture of the Pentium 4 processor

Comparison of Two Processors

Alpha Microarchitecture

Control unit extension for data hazards

Chapter 11 Processor Structure and function

Presentation transcript:

CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009

Introduction Motivation A Multi-Core on our desks A new microarchitecture to replace Netburst Intel Core 2 Duo A dual-core CPU ISA with SIMD Extension Intel Core microarchitecture Memory Hierarchy System

Instruction Set Architecture Base: X86-64 No VLIW (Itanium) SIMD Extensions: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 Pentium MMX, 1996 Pentium III, SSE, 1999 Pentium 4, SSE2, 2001 Prescott, SSE3, 2004 Core 2, SSSE3, July 2006 Walfdale, SSE4.1, Sep new registers, Float-point Operations 8 new registers, Packed data type, Integer Operations Double precision, 128-bit register support DSP-oriented math, process management e.g. Permuting bytes in a word

Streaming SIMD Extension (SSE) 4.1 Beginning with the 45 nm processors 47 instructions that improve performance of media data manipulation e.g. Fast and efficient bit width conversions Convert single byte values to word (16-bit) values

SSE2 Code MOVDQU XMM0, M64 PXOR XMM1, XMM1 PUNPCKLBW XMM0, XMM1

SSE4.1 Code PMOVZXBW XMM0, M64 DEST[15:0] <-- ZeroExtend(SRC[7:0]); DEST[31:16] <-- ZeroExtend(SRC[15:8]); DEST[47:32] <-- ZeroExtend(SRC[23:16]); DEST[63:48] <-- ZeroExtend(SRC[31:24]); DEST[79:64] <-- ZeroExtend(SRC[39:32]); DEST[95:80] <-- ZeroExtend(SRC[47:40]); DEST[111:96] <-- ZeroExtend(SRC[55:48]); DEST[127:112] <-- ZeroExtend(SRC[63:56]); Benefits Reduced instruction number (3  1) Better performance (~40% speedup each loop) Reduced register pressure (2  1)

Microarchitecture The Cores Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6M No Hyper-threading, no L3 cache Keep front-side bus Larger L2 cache

Microarchitecture 14-stage Pipeline 4 wide decode 4 wide Retire Macro-fusion Enhanced ALUs Deeper Buffers

Another View

Decode Hardware 128 bits fetch bandwidth 18-entry IQ Complex Decode -produces 1-4 micro-ops Micro-code Sequencer

Macro-fusion New Micro-op Represent instruction pair as single micro-op Enhanced ALUs To execute new compare and jump (CMPJCC) micro- op in one clock

Out of Order Execution 96 entries ROB 32 Entry Reservation Station

Execution Units 6 dispatch ports(1 Load, 2 Store, 3 universal ports) 3 integer ALU, 2 float point ALU

Branch Predictor Loop Detector - Track the number of loop iterations for future reference branch prediction unit (BPU) selects among for every branch: -bimodal predictor -global predictor -loop detector

Cache Organization private L1 DCache and ICache, 32K/core, 8way, 64B linesize, write-back(directory-based conherence) shared L2 cache, 8way, 64B linesize (E8xxx) pros: could be less bus traffic cons: longer access latency than private L2 cache; potential conflict between threads -- FSB 1333MHz (E8xxx) Memory disambiguation aggressive memory dependence speculation based on a load's- EIP-address-indexed hash table watchdog mechanism

Prediction Implementation History table indexed by Instruction Pointer Each entry in the history array has a saturating counter Once counter saturates: disambiguation possible on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses When a particular load failed disambiguation: reset its counter Each time a particular load correctly disambiguated: increment counter

when sent from RS, set disambiguation bit If meets an older unknow store address, set "update" If prediction is "go", dispatch, set "done" Else blocked A store in Load Buffer scan all previous load, if a match found, "reset" bit set. When load commits, update history. Predictor Lookup Prediction Verification Load Dispatch

Execute Disable Bit Support AMD Enhanced Virus Protection; ARM eXecute Never help prevent buffer overflow attacks no need of software patches for buffer overflow attacks segregate memory by either storage of code or data processor disable code execution when malicious worms try to inserting code into data buffers (with OS support)

Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core L1 ICache:1 traditional prefetcher L2 Cache: 2 IP prefetchers; predict what memory address will be used and deliver in time record every load's history using Instruction Pointer IP history array parameters for prefetch traffic control fine-tuned for different platforms prefetch monitor

References Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies Intel's Next Generation Microarchitecture Unveiled Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Magazine Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX Intel Core: A Next-Generation Microarchitecture too many…

Questions?