Download presentation
Presentation is loading. Please wait.
Published byGilbert Richardson Modified over 8 years ago
1
Modern general-purpose processors
2
Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction Cache memory Multimedia extensions
3
General structure Front end Fetching Decoding Dispatching for execution Back end / Execution engine Execution
4
Pipelining & superscalar execution
5
Deep & Narrow (Intel Pentium 4) Extremly long pipelines Lower number of execution units Wide & Shallow (Motorola 74xx - previously G4) Relatively shallow pipelines Higher number of execution units
6
Wide&shallow vs Deep&Narrow Motorola 74xx’s approachIntel Pentium 4’s approach
7
Post-RISC Architecture Motorola 74xx aka PowerPC G4 Intel Pentium 4
8
Data flow analysis Instruction parallelism vs machine parallelism Instruction-issue policies In-Order Issue with In-Order Completion In-Order Issue with Out-of-Order Completion Out-of-Order Issue with Out-of-Order Completion
9
Data flow analysis Example: i1 requires 2 cycles to execute, i3 and i4 are executed by the same execution unit (U3), i5 uses the result of i4 i5 and i6 are executed by the same execution unit (U2). Hypothetical processor Parallel fetching and decoding of 2 instructions (Decode) Three execution units (Execute) Parallel storing of 2 results (Writeback)
10
Data flow analysis In-Order Issue with In-Order Completion
11
Data flow analysis In-Order Issue with Out-of-Order Completion
12
Data flow analysis In-Order Issue with Out-of-Order Completion Problem of Output Dependency (Write-Write Dependency) Example: R3 := R3 op R5(i1) R4 := R3 + 1(i2) R3 := R5 + 1(i3) R7 := R3 op R4(i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3; Output Dependency (Write-Write Dependency)
13
Data flow analysis Out-of-Order Issue with Out-of-Order Completion
14
Data flow analysis Out-of-Order Issue with Out-of-Order Completion Problem of Antidependency (Read-Write Dependency) Example: R3 := R3 op R5 (i1) R4 := R3 + 1 (i2) R3 := R5 + 1 (i3) R7 := R3 op R4 (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i2 and i3; Antidependency (Read-Write Dependency)
15
Data flow analysis Duplication of resources, register renaming Problem of Antidependency (Read-Write Dependency) Example: R3 b := R3 a op R5 a (i1) R4 a := R3 b + 1(i2) R3 c := R5 a + 1(i3) R7 b := R3 c op R4 a (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3 – Output Dependency (Write-Write Dependency) i2 and i3 - Antidependency (Read-Write Dependency)
16
Branch prediction Various techniques Predict Never Taken Predict Always Taken Predict by Opcode Taken/ Not Taken Switch Branch History Table & Branch Target Buffer Additional enhancements Advanced branch prediction algorithms Loop predictor Indirect branch predictor
17
Post-RISC Architecture
18
Multimedia/SIMD extensions Characteristics of multimedia applications Narrow data types Typical width of data in memory 8-16 bits Typical width of data during computation 16-32 bits Fixed-point arithmetic often replaces floating-point arithmetic Fine grain (data parallelism) High predictability of branches High instruction locality in small loops or kernels Memory requirements High bandwidth requirements but can tolerate high latency High spatial locality (predictable pattern) but low temporal locality
19
Multimedia/SIMD extensions Subword paralleilism Technique present in the early vector supercomputers TI ASC and CDC Star-100
20
Multimedia/SIMD extensions Good performance at a relatively low cost No significant modifications in the organization of existing processors
21
Symmetrical Multiprocessing, Superthreading, Hyperthreading Single-threaded CPU Single-threaded SMP system
22
Symmetrical Multiprocessing, Superthreading, Hyperthreading Superthreaded CPUHyperthreaded CPU
23
Implementation of Hyperthreading Replicated Register renaming logic Instruction Pointer ITLB Return stack predictor Various other architectural registers Partitioned Re-order buffers (ROBs) Load/Store buffers Various queues, like the scheduling queues, uop queue, etc. Shared Caches: trace cache, L1, L2, L3 Microarchitectural registers Execution Units
24
Approaches to 64bits processing IA-64 (VLIW/EPIC) AMD x86-64 (Intel 64)
25
AMD x86-64 Extended registers Increased number of registers Switching modes Legacy x86 32-bit mode (32-bit OS) Long x86 64-bit mode (64-bit OS) 64-bit mode (x86-64 Applications) Compatibility mode (x86 Applications) No performance penalty for running in legacy or compatibility mode Removal of some of outdated x86 features
26
AMD x86-64
28
Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)
29
Current performance limiters Branches Memory latency Implicit Parallel Instruction Processing
30
Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)
34
Predication mechanism
35
Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC) Speculation mechanism
36
Multi-core processing Taking advantage of explicit high-level parallelism (process / thread level parallelism) Time Instruction-Level Parallelism (ILP) Time Thread-Level Parallelism (TLP)
37
Intel Core microarchitecture Combination and improvement of solutions employed in earlier Intel architectures mainly Pentium M (successor of the P6/Pentium Pro microarchitecture) Pentium 4 Design for multicore and lowered power consumption No simplification of the core structure in favor of multiple cores on a single die
38
Intel Core microarchitecture Features summary Wide Dynamic Execution 14-stage core pipeline 4 decoders to decode up to 5 instructions per cycle 3 clusters of arithmetic logical units macro-fusion and micro-fusion to improve front-end throughput peak dispatching rate of up to 6 micro-ops per cycle peak retirement rate of up to 4 micro-ops per cycle advanced branch prediction algorithms stack pointer tracker to improve efficiency of procedure entries and exits Advanced Smart Cache 2nd level cache up to 4 MB with 16-way associativity 256 bit internal data path from L2 to L1 data caches
39
Intel Core microarchitecture Features summary Smart Memory Access hardware pre-fetchers to reduce effective latency of 2nd level cache misses hardware pre-fetchers to reduce effective latency of 1st level data cache misses "memory disambiguation" to improve efficiency of speculative instruction execution Advanced Digital Media Boost single-cycle inter-completion latency of most 128-bit SIMD instructions up to eight single-precision floating-point operation per cycle 3 issue ports available to dispatching SIMD instructions for execution
40
Intel Core microarchitecture Processor’s pipeline & execution units
41
Intel Core microarchitecture Chosen new features Instruction fusion Macro-fusion – fusion of certain types of x86 instructions (compare and test) into a single mirco-op in the predecode phase Micro-ops fusion (first introduced with Pentium M) – fusion/pairing of micro-ops generated during translation of certain two micro-ops x86 instructions (load-and-op and stores (store- address and store-data))
42
Intel Core microarchitecture Chosen new features Memory disambiguation Data stream oriented speculative execution as a form of dealing with false memory aliasing The memory disambiguator predicts which loads will not depend on any previous stores When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed
43
Intel Nehalem microarchitecture Feature summary Enhanced processor core improved branch prediction and recovery cost from mis-prediction enhancements in loop streaming to improve front- end performance and reduce power consumption deeper buffering in out-of-order engine to sustain higher levels of instruction level parallelism enhanced execution units with accelerated processing of CRC, string/text and data shuffling Hyper-threading technology (SMT) support for two hardware threads (logical processors) per core
44
Intel Nehalem microarchitecture Feature summary Smarter Memory Access integrated (on-chip) memory controller supporting low-latency access to local system memory and overall scalable memory bandwidth (previously the memory controller was hosted on a separate chip and it was common to all dual or quad socket systems) new cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic two level TLBs and increased TLB sizes faster unaligned memory access
45
Intel Nehalem microarchitecture Feature summary Dedicated Power management integrated micro-controller with embedded firmware which manages power consumption embedded real-time sensors for temperature, current, and power integrated power gate to turn off/on per-core power consumption
46
Intel Nehalem microarchitecture High level chip overview
47
Intel Nehalem microarchitecture Processor’s pipeline
48
Intel Nehalem microarchitecture In-Order Front End
49
Intel Nehalem microarchitecture Out-of-Order Execution Engine
50
Intel Nehalem microarchitecture Cache memory hierarchy
51
Intel Nehalem microarchitecture On-chip memory hierarchy
52
Intel Nehalem microarchitecture Nehalem 8-way cc-NUMA platform
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.