Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Modern general-purpose processors

Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction Cache memory Multimedia extensions

General structure Front end Fetching Decoding Dispatching for execution Back end / Execution engine Execution

Pipelining & superscalar execution

Deep & Narrow (Intel Pentium 4) Extremly long pipelines Lower number of execution units Wide & Shallow (Motorola 74xx - previously G4) Relatively shallow pipelines Higher number of execution units

Wide&shallow vs Deep&Narrow Motorola 74xx’s approachIntel Pentium 4’s approach

Post-RISC Architecture Motorola 74xx aka PowerPC G4 Intel Pentium 4

Data flow analysis Instruction parallelism vs machine parallelism Instruction-issue policies In-Order Issue with In-Order Completion In-Order Issue with Out-of-Order Completion Out-of-Order Issue with Out-of-Order Completion

Data flow analysis Example: i1 requires 2 cycles to execute, i3 and i4 are executed by the same execution unit (U3), i5 uses the result of i4 i5 and i6 are executed by the same execution unit (U2). Hypothetical processor Parallel fetching and decoding of 2 instructions (Decode) Three execution units (Execute) Parallel storing of 2 results (Writeback)

Data flow analysis In-Order Issue with In-Order Completion

Data flow analysis In-Order Issue with Out-of-Order Completion

Data flow analysis In-Order Issue with Out-of-Order Completion Problem of Output Dependency (Write-Write Dependency) Example: R3 := R3 op R5(i1) R4 := R3 + 1(i2) R3 := R5 + 1(i3) R7 := R3 op R4(i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3; Output Dependency (Write-Write Dependency)

Data flow analysis Out-of-Order Issue with Out-of-Order Completion

Data flow analysis Out-of-Order Issue with Out-of-Order Completion Problem of Antidependency (Read-Write Dependency) Example: R3 := R3 op R5 (i1) R4 := R3 + 1 (i2) R3 := R5 + 1 (i3) R7 := R3 op R4 (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i2 and i3; Antidependency (Read-Write Dependency)

Data flow analysis Duplication of resources, register renaming Problem of Antidependency (Read-Write Dependency) Example: R3 b := R3 a op R5 a (i1) R4 a := R3 b + 1(i2) R3 c := R5 a + 1(i3) R7 b := R3 c op R4 a (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3 – Output Dependency (Write-Write Dependency) i2 and i3 - Antidependency (Read-Write Dependency)

Branch prediction Various techniques Predict Never Taken Predict Always Taken Predict by Opcode Taken/ Not Taken Switch Branch History Table & Branch Target Buffer Additional enhancements Advanced branch prediction algorithms Loop predictor Indirect branch predictor

Post-RISC Architecture

Multimedia/SIMD extensions Characteristics of multimedia applications Narrow data types Typical width of data in memory 8-16 bits Typical width of data during computation 16-32 bits Fixed-point arithmetic often replaces floating-point arithmetic Fine grain (data parallelism) High predictability of branches High instruction locality in small loops or kernels Memory requirements High bandwidth requirements but can tolerate high latency High spatial locality (predictable pattern) but low temporal locality

Multimedia/SIMD extensions Subword paralleilism Technique present in the early vector supercomputers TI ASC and CDC Star-100

Multimedia/SIMD extensions Good performance at a relatively low cost No significant modifications in the organization of existing processors

Symmetrical Multiprocessing, Superthreading, Hyperthreading Single-threaded CPU Single-threaded SMP system

Symmetrical Multiprocessing, Superthreading, Hyperthreading Superthreaded CPUHyperthreaded CPU

Implementation of Hyperthreading Replicated Register renaming logic Instruction Pointer ITLB Return stack predictor Various other architectural registers Partitioned Re-order buffers (ROBs) Load/Store buffers Various queues, like the scheduling queues, uop queue, etc. Shared Caches: trace cache, L1, L2, L3 Microarchitectural registers Execution Units

Approaches to 64bits processing IA-64 (VLIW/EPIC) AMD x86-64 (Intel 64)

AMD x86-64 Extended registers Increased number of registers Switching modes Legacy x86 32-bit mode (32-bit OS) Long x86 64-bit mode (64-bit OS) 64-bit mode (x86-64 Applications) Compatibility mode (x86 Applications) No performance penalty for running in legacy or compatibility mode Removal of some of outdated x86 features

AMD x86-64

Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)

Current performance limiters Branches Memory latency Implicit Parallel Instruction Processing

Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)

Predication mechanism

Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC) Speculation mechanism

Multi-core processing Taking advantage of explicit high-level parallelism (process / thread level parallelism) Time Instruction-Level Parallelism (ILP) Time Thread-Level Parallelism (TLP)

Intel Core microarchitecture Combination and improvement of solutions employed in earlier Intel architectures mainly Pentium M (successor of the P6/Pentium Pro microarchitecture) Pentium 4 Design for multicore and lowered power consumption No simplification of the core structure in favor of multiple cores on a single die

Intel Core microarchitecture Features summary Wide Dynamic Execution 14-stage core pipeline 4 decoders to decode up to 5 instructions per cycle 3 clusters of arithmetic logical units macro-fusion and micro-fusion to improve front-end throughput peak dispatching rate of up to 6 micro-ops per cycle peak retirement rate of up to 4 micro-ops per cycle advanced branch prediction algorithms stack pointer tracker to improve efficiency of procedure entries and exits Advanced Smart Cache 2nd level cache up to 4 MB with 16-way associativity 256 bit internal data path from L2 to L1 data caches

Intel Core microarchitecture Features summary Smart Memory Access hardware pre-fetchers to reduce effective latency of 2nd level cache misses hardware pre-fetchers to reduce effective latency of 1st level data cache misses "memory disambiguation" to improve efficiency of speculative instruction execution Advanced Digital Media Boost single-cycle inter-completion latency of most 128-bit SIMD instructions up to eight single-precision floating-point operation per cycle 3 issue ports available to dispatching SIMD instructions for execution

Intel Core microarchitecture Processor’s pipeline & execution units

Intel Core microarchitecture Chosen new features Instruction fusion Macro-fusion – fusion of certain types of x86 instructions (compare and test) into a single mirco-op in the predecode phase Micro-ops fusion (first introduced with Pentium M) – fusion/pairing of micro-ops generated during translation of certain two micro-ops x86 instructions (load-and-op and stores (store- address and store-data))

Intel Core microarchitecture Chosen new features Memory disambiguation Data stream oriented speculative execution as a form of dealing with false memory aliasing The memory disambiguator predicts which loads will not depend on any previous stores When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed

Intel Nehalem microarchitecture Feature summary Enhanced processor core improved branch prediction and recovery cost from mis-prediction enhancements in loop streaming to improve front- end performance and reduce power consumption deeper buffering in out-of-order engine to sustain higher levels of instruction level parallelism enhanced execution units with accelerated processing of CRC, string/text and data shuffling Hyper-threading technology (SMT) support for two hardware threads (logical processors) per core

Intel Nehalem microarchitecture Feature summary Smarter Memory Access integrated (on-chip) memory controller supporting low-latency access to local system memory and overall scalable memory bandwidth (previously the memory controller was hosted on a separate chip and it was common to all dual or quad socket systems) new cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic two level TLBs and increased TLB sizes faster unaligned memory access

Intel Nehalem microarchitecture Feature summary Dedicated Power management integrated micro-controller with embedded firmware which manages power consumption embedded real-time sensors for temperature, current, and power integrated power gate to turn off/on per-core power consumption

Intel Nehalem microarchitecture High level chip overview

Intel Nehalem microarchitecture Processor’s pipeline

Intel Nehalem microarchitecture In-Order Front End

Intel Nehalem microarchitecture Out-of-Order Execution Engine

Intel Nehalem microarchitecture Cache memory hierarchy

Intel Nehalem microarchitecture On-chip memory hierarchy

Intel Nehalem microarchitecture Nehalem 8-way cc-NUMA platform

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Similar presentations

Presentation on theme: "Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Similar presentations

Presentation on theme: "Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction."— Presentation transcript:

Similar presentations

About project

Feedback