Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Similar presentations


Presentation on theme: "Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction."— Presentation transcript:

1 Modern general-purpose processors

2 Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction Cache memory Multimedia extensions

3 General structure Front end Fetching Decoding Dispatching for execution Back end / Execution engine Execution

4 Pipelining & superscalar execution

5 Deep & Narrow (Intel Pentium 4) Extremly long pipelines Lower number of execution units Wide & Shallow (Motorola 74xx - previously G4) Relatively shallow pipelines Higher number of execution units

6 Wide&shallow vs Deep&Narrow Motorola 74xx’s approachIntel Pentium 4’s approach

7 Post-RISC Architecture Motorola 74xx aka PowerPC G4 Intel Pentium 4

8 Data flow analysis Instruction parallelism vs machine parallelism Instruction-issue policies In-Order Issue with In-Order Completion In-Order Issue with Out-of-Order Completion Out-of-Order Issue with Out-of-Order Completion

9 Data flow analysis Example: i1 requires 2 cycles to execute, i3 and i4 are executed by the same execution unit (U3), i5 uses the result of i4 i5 and i6 are executed by the same execution unit (U2). Hypothetical processor Parallel fetching and decoding of 2 instructions (Decode) Three execution units (Execute) Parallel storing of 2 results (Writeback)

10 Data flow analysis In-Order Issue with In-Order Completion

11 Data flow analysis In-Order Issue with Out-of-Order Completion

12 Data flow analysis In-Order Issue with Out-of-Order Completion Problem of Output Dependency (Write-Write Dependency) Example: R3 := R3 op R5(i1) R4 := R3 + 1(i2) R3 := R5 + 1(i3) R7 := R3 op R4(i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3; Output Dependency (Write-Write Dependency)

13 Data flow analysis Out-of-Order Issue with Out-of-Order Completion

14 Data flow analysis Out-of-Order Issue with Out-of-Order Completion Problem of Antidependency (Read-Write Dependency) Example: R3 := R3 op R5 (i1) R4 := R3 + 1 (i2) R3 := R5 + 1 (i3) R7 := R3 op R4 (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i2 and i3; Antidependency (Read-Write Dependency)

15 Data flow analysis Duplication of resources, register renaming Problem of Antidependency (Read-Write Dependency) Example: R3 b := R3 a op R5 a (i1) R4 a := R3 b + 1(i2) R3 c := R5 a + 1(i3) R7 b := R3 c op R4 a (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3 – Output Dependency (Write-Write Dependency) i2 and i3 - Antidependency (Read-Write Dependency)

16 Branch prediction Various techniques Predict Never Taken Predict Always Taken Predict by Opcode Taken/ Not Taken Switch Branch History Table & Branch Target Buffer Additional enhancements Advanced branch prediction algorithms Loop predictor Indirect branch predictor

17 Post-RISC Architecture

18 Multimedia/SIMD extensions Characteristics of multimedia applications Narrow data types Typical width of data in memory 8-16 bits Typical width of data during computation 16-32 bits Fixed-point arithmetic often replaces floating-point arithmetic Fine grain (data parallelism) High predictability of branches High instruction locality in small loops or kernels Memory requirements High bandwidth requirements but can tolerate high latency High spatial locality (predictable pattern) but low temporal locality

19 Multimedia/SIMD extensions Subword paralleilism Technique present in the early vector supercomputers TI ASC and CDC Star-100

20 Multimedia/SIMD extensions Good performance at a relatively low cost No significant modifications in the organization of existing processors

21 Symmetrical Multiprocessing, Superthreading, Hyperthreading Single-threaded CPU Single-threaded SMP system

22 Symmetrical Multiprocessing, Superthreading, Hyperthreading Superthreaded CPUHyperthreaded CPU

23 Implementation of Hyperthreading Replicated Register renaming logic Instruction Pointer ITLB Return stack predictor Various other architectural registers Partitioned Re-order buffers (ROBs) Load/Store buffers Various queues, like the scheduling queues, uop queue, etc. Shared Caches: trace cache, L1, L2, L3 Microarchitectural registers Execution Units

24 Approaches to 64bits processing IA-64 (VLIW/EPIC) AMD x86-64 (Intel 64)

25 AMD x86-64 Extended registers Increased number of registers Switching modes Legacy x86 32-bit mode (32-bit OS) Long x86 64-bit mode (64-bit OS) 64-bit mode (x86-64 Applications) Compatibility mode (x86 Applications) No performance penalty for running in legacy or compatibility mode Removal of some of outdated x86 features

26 AMD x86-64

27

28 Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)

29 Current performance limiters Branches Memory latency Implicit Parallel Instruction Processing

30 Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)

31

32

33

34 Predication mechanism

35 Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC) Speculation mechanism

36 Multi-core processing Taking advantage of explicit high-level parallelism (process / thread level parallelism) Time Instruction-Level Parallelism (ILP) Time Thread-Level Parallelism (TLP)

37 Intel Core microarchitecture Combination and improvement of solutions employed in earlier Intel architectures mainly Pentium M (successor of the P6/Pentium Pro microarchitecture) Pentium 4 Design for multicore and lowered power consumption No simplification of the core structure in favor of multiple cores on a single die

38 Intel Core microarchitecture Features summary Wide Dynamic Execution 14-stage core pipeline 4 decoders to decode up to 5 instructions per cycle 3 clusters of arithmetic logical units macro-fusion and micro-fusion to improve front-end throughput peak dispatching rate of up to 6 micro-ops per cycle peak retirement rate of up to 4 micro-ops per cycle advanced branch prediction algorithms stack pointer tracker to improve efficiency of procedure entries and exits Advanced Smart Cache 2nd level cache up to 4 MB with 16-way associativity 256 bit internal data path from L2 to L1 data caches

39 Intel Core microarchitecture Features summary Smart Memory Access hardware pre-fetchers to reduce effective latency of 2nd level cache misses hardware pre-fetchers to reduce effective latency of 1st level data cache misses "memory disambiguation" to improve efficiency of speculative instruction execution Advanced Digital Media Boost single-cycle inter-completion latency of most 128-bit SIMD instructions up to eight single-precision floating-point operation per cycle 3 issue ports available to dispatching SIMD instructions for execution

40 Intel Core microarchitecture Processor’s pipeline & execution units

41 Intel Core microarchitecture Chosen new features Instruction fusion Macro-fusion – fusion of certain types of x86 instructions (compare and test) into a single mirco-op in the predecode phase Micro-ops fusion (first introduced with Pentium M) – fusion/pairing of micro-ops generated during translation of certain two micro-ops x86 instructions (load-and-op and stores (store- address and store-data))

42 Intel Core microarchitecture Chosen new features Memory disambiguation Data stream oriented speculative execution as a form of dealing with false memory aliasing The memory disambiguator predicts which loads will not depend on any previous stores When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed

43 Intel Nehalem microarchitecture Feature summary Enhanced processor core improved branch prediction and recovery cost from mis-prediction enhancements in loop streaming to improve front- end performance and reduce power consumption deeper buffering in out-of-order engine to sustain higher levels of instruction level parallelism enhanced execution units with accelerated processing of CRC, string/text and data shuffling Hyper-threading technology (SMT) support for two hardware threads (logical processors) per core

44 Intel Nehalem microarchitecture Feature summary Smarter Memory Access integrated (on-chip) memory controller supporting low-latency access to local system memory and overall scalable memory bandwidth (previously the memory controller was hosted on a separate chip and it was common to all dual or quad socket systems) new cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic two level TLBs and increased TLB sizes faster unaligned memory access

45 Intel Nehalem microarchitecture Feature summary Dedicated Power management integrated micro-controller with embedded firmware which manages power consumption embedded real-time sensors for temperature, current, and power integrated power gate to turn off/on per-core power consumption

46 Intel Nehalem microarchitecture High level chip overview

47 Intel Nehalem microarchitecture Processor’s pipeline

48 Intel Nehalem microarchitecture In-Order Front End

49 Intel Nehalem microarchitecture Out-of-Order Execution Engine

50 Intel Nehalem microarchitecture Cache memory hierarchy

51 Intel Nehalem microarchitecture On-chip memory hierarchy

52 Intel Nehalem microarchitecture Nehalem 8-way cc-NUMA platform


Download ppt "Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction."

Similar presentations


Ads by Google