Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Intel Xeon Nehalem Architecture Billy Brennan Christopher Ruiz Kay Sackey.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu CmpE
1 Microprocessor-based Systems Course 4 - Microprocessors.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading Instructor: L.N. Bhuyan.
Chapter 17 Parallel Processing.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Hiep Hong CS 147 Spring Intel Core 2 Duo. CPU Chronology 2.
Multicore Designs Presented By: Mahendra B Salunke Asst. Professor, Dept of Comp Engg., SITS, Narhe, Pune. URL:
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Hyper-Threading Technology Architecture and Micro-Architecture.
Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Hyper-Threading Technology Architecture and Microarchitecture
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Instruction Level Parallelism
Lynn Choi School of Electrical Engineering
Computer Structure Multi-Threading
Computer Architectures M
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Intel’s Core i7 Processor
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Flow Path Model of Superscalars
Hyperthreading Technology
Levels of Parallelism within a Single Processor
Intel Xeon Nehalem Architecture
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Comparison of Two Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Henk Corporaal TUEindhoven 2011
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
The University of Adelaide, School of Computer Science
Presentation transcript:

Modern general-purpose processors

Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction Cache memory Multimedia extensions

General structure Front end Fetching Decoding Dispatching for execution Back end / Execution engine Execution

Pipelining & superscalar execution

Deep & Narrow (Intel Pentium 4) Extremly long pipelines Lower number of execution units Wide & Shallow (Motorola 74xx - previously G4) Relatively shallow pipelines Higher number of execution units

Wide&shallow vs Deep&Narrow Motorola 74xx’s approachIntel Pentium 4’s approach

Post-RISC Architecture Motorola 74xx aka PowerPC G4 Intel Pentium 4

Data flow analysis Instruction parallelism vs machine parallelism Instruction-issue policies In-Order Issue with In-Order Completion In-Order Issue with Out-of-Order Completion Out-of-Order Issue with Out-of-Order Completion

Data flow analysis Example: i1 requires 2 cycles to execute, i3 and i4 are executed by the same execution unit (U3), i5 uses the result of i4 i5 and i6 are executed by the same execution unit (U2). Hypothetical processor Parallel fetching and decoding of 2 instructions (Decode) Three execution units (Execute) Parallel storing of 2 results (Writeback)

Data flow analysis In-Order Issue with In-Order Completion

Data flow analysis In-Order Issue with Out-of-Order Completion

Data flow analysis In-Order Issue with Out-of-Order Completion Problem of Output Dependency (Write-Write Dependency) Example: R3 := R3 op R5(i1) R4 := R3 + 1(i2) R3 := R5 + 1(i3) R7 := R3 op R4(i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3; Output Dependency (Write-Write Dependency)

Data flow analysis Out-of-Order Issue with Out-of-Order Completion

Data flow analysis Out-of-Order Issue with Out-of-Order Completion Problem of Antidependency (Read-Write Dependency) Example: R3 := R3 op R5 (i1) R4 := R3 + 1 (i2) R3 := R5 + 1 (i3) R7 := R3 op R4 (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i2 and i3; Antidependency (Read-Write Dependency)

Data flow analysis Duplication of resources, register renaming Problem of Antidependency (Read-Write Dependency) Example: R3 b := R3 a op R5 a (i1) R4 a := R3 b + 1(i2) R3 c := R5 a + 1(i3) R7 b := R3 c op R4 a (i4) i1 and i2; i3 and i4 – True Data Dependency (Flow Dependency, Write-Read Dependency) i1 and i3 – Output Dependency (Write-Write Dependency) i2 and i3 - Antidependency (Read-Write Dependency)

Branch prediction Various techniques Predict Never Taken Predict Always Taken Predict by Opcode Taken/ Not Taken Switch Branch History Table & Branch Target Buffer Additional enhancements Advanced branch prediction algorithms Loop predictor Indirect branch predictor

Post-RISC Architecture

Multimedia/SIMD extensions Characteristics of multimedia applications Narrow data types Typical width of data in memory 8-16 bits Typical width of data during computation bits Fixed-point arithmetic often replaces floating-point arithmetic Fine grain (data parallelism) High predictability of branches High instruction locality in small loops or kernels Memory requirements High bandwidth requirements but can tolerate high latency High spatial locality (predictable pattern) but low temporal locality

Multimedia/SIMD extensions Subword paralleilism Technique present in the early vector supercomputers TI ASC and CDC Star-100

Multimedia/SIMD extensions Good performance at a relatively low cost No significant modifications in the organization of existing processors

Symmetrical Multiprocessing, Superthreading, Hyperthreading Single-threaded CPU Single-threaded SMP system

Symmetrical Multiprocessing, Superthreading, Hyperthreading Superthreaded CPUHyperthreaded CPU

Implementation of Hyperthreading Replicated Register renaming logic Instruction Pointer ITLB Return stack predictor Various other architectural registers Partitioned Re-order buffers (ROBs) Load/Store buffers Various queues, like the scheduling queues, uop queue, etc. Shared Caches: trace cache, L1, L2, L3 Microarchitectural registers Execution Units

Approaches to 64bits processing IA-64 (VLIW/EPIC) AMD x86-64 (Intel 64)

AMD x86-64 Extended registers Increased number of registers Switching modes Legacy x86 32-bit mode (32-bit OS) Long x86 64-bit mode (64-bit OS) 64-bit mode (x86-64 Applications) Compatibility mode (x86 Applications) No performance penalty for running in legacy or compatibility mode Removal of some of outdated x86 features

AMD x86-64

Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)

Current performance limiters Branches Memory latency Implicit Parallel Instruction Processing

Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC)

Predication mechanism

Very Large Instruction Word (VLIW) / Explicitly Parallel Instruction Processing (EPIC) Speculation mechanism

Multi-core processing Taking advantage of explicit high-level parallelism (process / thread level parallelism) Time Instruction-Level Parallelism (ILP) Time Thread-Level Parallelism (TLP)

Intel Core microarchitecture Combination and improvement of solutions employed in earlier Intel architectures mainly Pentium M (successor of the P6/Pentium Pro microarchitecture) Pentium 4 Design for multicore and lowered power consumption No simplification of the core structure in favor of multiple cores on a single die

Intel Core microarchitecture Features summary Wide Dynamic Execution 14-stage core pipeline 4 decoders to decode up to 5 instructions per cycle 3 clusters of arithmetic logical units macro-fusion and micro-fusion to improve front-end throughput peak dispatching rate of up to 6 micro-ops per cycle peak retirement rate of up to 4 micro-ops per cycle advanced branch prediction algorithms stack pointer tracker to improve efficiency of procedure entries and exits Advanced Smart Cache 2nd level cache up to 4 MB with 16-way associativity 256 bit internal data path from L2 to L1 data caches

Intel Core microarchitecture Features summary Smart Memory Access hardware pre-fetchers to reduce effective latency of 2nd level cache misses hardware pre-fetchers to reduce effective latency of 1st level data cache misses "memory disambiguation" to improve efficiency of speculative instruction execution Advanced Digital Media Boost single-cycle inter-completion latency of most 128-bit SIMD instructions up to eight single-precision floating-point operation per cycle 3 issue ports available to dispatching SIMD instructions for execution

Intel Core microarchitecture Processor’s pipeline & execution units

Intel Core microarchitecture Chosen new features Instruction fusion Macro-fusion – fusion of certain types of x86 instructions (compare and test) into a single mirco-op in the predecode phase Micro-ops fusion (first introduced with Pentium M) – fusion/pairing of micro-ops generated during translation of certain two micro-ops x86 instructions (load-and-op and stores (store- address and store-data))

Intel Core microarchitecture Chosen new features Memory disambiguation Data stream oriented speculative execution as a form of dealing with false memory aliasing The memory disambiguator predicts which loads will not depend on any previous stores When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed

Intel Nehalem microarchitecture Feature summary Enhanced processor core improved branch prediction and recovery cost from mis-prediction enhancements in loop streaming to improve front- end performance and reduce power consumption deeper buffering in out-of-order engine to sustain higher levels of instruction level parallelism enhanced execution units with accelerated processing of CRC, string/text and data shuffling Hyper-threading technology (SMT) support for two hardware threads (logical processors) per core

Intel Nehalem microarchitecture Feature summary Smarter Memory Access integrated (on-chip) memory controller supporting low-latency access to local system memory and overall scalable memory bandwidth (previously the memory controller was hosted on a separate chip and it was common to all dual or quad socket systems) new cache hierarchy organization with shared, inclusive L3 to reduce snoop traffic two level TLBs and increased TLB sizes faster unaligned memory access

Intel Nehalem microarchitecture Feature summary Dedicated Power management integrated micro-controller with embedded firmware which manages power consumption embedded real-time sensors for temperature, current, and power integrated power gate to turn off/on per-core power consumption

Intel Nehalem microarchitecture High level chip overview

Intel Nehalem microarchitecture Processor’s pipeline

Intel Nehalem microarchitecture In-Order Front End

Intel Nehalem microarchitecture Out-of-Order Execution Engine

Intel Nehalem microarchitecture Cache memory hierarchy

Intel Nehalem microarchitecture On-chip memory hierarchy

Intel Nehalem microarchitecture Nehalem 8-way cc-NUMA platform