ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 1 Early ILP Processors.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
A scheme to overcome data hazards
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
The University of Adelaide, School of Computer Science
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 6 Pipelining & RISCs Dr. Abraham Techniques for speeding up a computer Pipelining Parallel processing.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Basics and Architectures
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
EKT303/4 Superscalar vs Super-pipelined.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Prof. Hsien-Hsin Sean Lee
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
ECE/CS 552: Pipelining to Superscalar
Vector Processing => Multimedia
Flow Path Model of Superscalars
Advantages of Dynamic Scheduling
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Advanced Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
ECE/CS 552: Pipelining to Superscalar
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 1 Early ILP Processors and Performance Bound Model

ECE8833 H.-H. S. Lee Decoupled Access/Execute Computer Architectures James E. Smith, ACM TOCS, 1984 (a earlier version was published in ISCA 1982)

ECE8833 H.-H. S. Lee Background of DAE, circa Written at a time when vector machine was dominating LVv1, mem[a1] MULVv3, v2, v1 ADDVv5, v4, v3 MULV v3, v2, v1 LV v1, mem[a1] ADDV v5, v4, v3 Time line Vector chaining (Cray-1) MULV v3, v2, v1 LV v1, mem[a1] ADDV v5, v4, v3 64-bit register bit

ECE8833 H.-H. S. Lee Background of DAE, circa Written at a time when vector machine was dominating LVv1, mem[a1] MULVv3, v2, v1 ADDVv5, v4, v3 v1 v3 Memory MUL v2 v4 ADD v5 What about modern SIMD ISA ?

ECE8833 H.-H. S. Lee Today State-of-the-art ? Intel AVX Intel Larrabee NI

ECE8833 H.-H. S. Lee DAE, circa Fine-grained parallelism: Vector vs. Superscalar What about scalar performance? –Remember what’s Flynn’s bottleneck? Page 290

ECE8833 H.-H. S. Lee Flynn’s Bottleneck ILP  1.86  –Programs on IBM 7090 –Basically, he sort of said one cannot execute more than one instruction per cycle –ILP exploited within basic blocks [Riseman & Foster ’ 72][Riseman & Foster ’ 72] –Breaking control dependency –A perfect machine model –Benchmark includes numerical programs, assembler and compiler passed jumps0 jump1 jump2 jumps8 jumps32 jumps128 jumps  jumps Average ILP BB0 BB1 BB3 BB2 BB4

ECE8833 H.-H. S. Lee DAE, circa. 1982, 1984 Issues in CDC6600 & IBM 360/91 –Overlap instructions by OoO  complex control  slower clock  offset the benefit –Complex issue methods were abandoned by their manufacturers Less determinism Problems in HW debugging Errors may not be reproducible –Complexity can be shifted to system software

ECE8833 H.-H. S. Lee Decoupled Access/Execute Architecture An architecture with two instruction streams to break Flynn’s bottleneck –Access processor –eXecute processor –Hey, this was 1980s Separate RFs (A 0, A 1, A 2.., A n-1 & X 0, X 1, X 2..,X m-1 ), which can be totally incompatible –Synchronization issue?

ECE8833 H.-H. S. Lee DAE

ECE8833 H.-H. S. Lee Data Movement Data In Data Out paired XLQ, XSQ, are specified as registers at the ISA level

ECE8833 H.-H. S. Lee Register-to-Register Synch Xi  AjXi  Aj

ECE8833 H.-H. S. Lee Branch Synch-up One Runhead One execute uncond. Jump (BFQ instruction) Branch outcomes in XBQ can be used to reduce I-fetch from X- Processor.

ECE8833 H.-H. S. Lee DAE Code Example

ECE8833 H.-H. S. Lee Modern Issue Consideration Despite it is a ‘82/’84 paper, it considers

ECE8833 H.-H. S. Lee Precise Exception Simple approach  force the instructions to complete in order In DAE, applied to each of the streams separately Example of Imprecise exception issues Require cautiousness when coding A and E programs

ECE8833 H.-H. S. Lee Requirement for Precise Exception

ECE8833 H.-H. S. Lee Why (and How) It Works? Avg. speedup = 1.58 for LFK Executions between 2 processors are somewhat balanced Why? –Work nicely as shown in LFK –X-processor’s computation is not as fast 6-cycle FP add 7-cycle FP multiply –A-process takes care of Memory (11-cycle load) Branch resolution

ECE8833 H.-H. S. Lee Disadvantages of DAE Architecture 1.Writing 2 separate programs What High-level language ? Who should do it? 2.Certain duplication in Hardware Instruction memory/cache Instruction fetch unit Decoder

ECE8833 H.-H. S. Lee Interleaving Instruction Streams Use a bit to tag streams No split branch instruction (1)X 7 is XLQ or XSQ; (2)Once loaded, it is used once. (3)It must be stored after X-processor writes to it (A) X

ECE8833 H.-H. S. Lee Summary of DAE Architecture 2-wide issue per cycle Allow a constrained type of OoO –Data accesses could be done well in advance (i.e., “slip” ahead) –Enable certain level of data prefetching Was novel in 1982!

ECE8833 H.-H. S. Lee The ZS-1 Central Processor James E. Smith, et al. in ASPLOS-II, 1987

ECE8833 H.-H. S. Lee Astronautics ZS-1 ZS-1 Central Processor A realization of DAE (by the same author) Decouple instruction stream into –Fixed point/memory –Floating-point operations Communicate via Architectural queues Is extensively pipelined 22.5 MFLOPS, 45 MIPS

ECE8833 H.-H. S. Lee ZS-1 Central Processor Communicate with memory 31 A (and X) registers + 1 Queue entry = 5-bit encoded operands Hold 24 insts Hold 4 insts

ECE8833 H.-H. S. Lee ZS-1 Central Processor + Instruction cannot be issued unless the dependency is resolved. + A load may bypass independent stores + Maintain load-load, store-store order

ECE8833 H.-H. S. Lee Can Load Bypass Load? Why not? Load R1, (A) Load R2, (A) Core 1 Store (A), R3 Core 2 (A)=100 R3=25 (1) (2) (3) What’s wrong with (2)(3)(1)?

ECE8833 H.-H. S. Lee ZS-1: Processing of Two Iterations S: splitter B: inst buffer read D: decoded I: issued E: Execution

ECE8833 H.-H. S. Lee IBM RS/6000 and POWER Evolved from IBM ACS and 801 Foundation of POWER architecture (Performance Optimization With Enhanced RISC) –10 discrete chips in the early POWER1 system –Single chip solution in RSC and some subsequent POWER2 version called P2SC

ECE8833 H.-H. S. Lee POWER2 Processor Node 8 Discrete chips on MCM 66.7 MHz, 6-issue (2 reserved for br/comp) 2 FXUs –Memory, INT, Logical –2 per cycles 3 dual-pipe FPUs can perform –2 DP Fma –2 FP loads –2 FP stores --- I-Cache (32KB) Dispatch Dual Branch Processors Instruction Cache Unit Instruction Buffer Execution Unit w/o Mult/Div Execution Unit w Mult/Div Instruction Buffer Arithmetic Execution Unit Store Execution Unit Load Execution Unit Sync Fixed-Point Unit (FXU)Floating-Point Unit (FPU) Data Cache Unit (DCU) 4 separate chips (32KB each) Memory Unit (64MB – 512MB) Optional Secondary Cache (1 or 2MB) Storage Control Unit

ECE8833 H.-H. S. Lee MACS Performance Bound Model Actual Run Time M Bound MA Bound MAC Bound MACS Bound Physically Measured GAP A GAP C GAP S GAP P To analyze achievable performance (mostly FP) in scientific applications

ECE8833 H.-H. S. Lee MACS Performance Bound Model Gap A (keep you from attaining peak performance) –Excessive loads/stores (more than essential ones, i.e., a[i] = b[i]) –Loop bookkeeping GAP C (reason we may want to have 432?) –Hardware restriction (architectural registers) –Redundant instructions –Load/store overhead in function calls GAP S –Weak scheduling algorithm –Resource conflicts preventing tighter schedule –Sol: Modulo scheduling to compact the code GAP P –Cache misses, inter-core communication, system effect (i.e., context switches) –Sol: prefetch, loop blocking, loop fusion, loop exchange, etc.

ECE8833 H.-H. S. Lee POWER2 M Bound (Ideal, Ideal) M Bound Peak = 1 f ma to 2 FPU pipelines = 0.25 CPF --- Instruction Buffer Arithmetic Execution Unit Store Execution Unit Load Execution Unit Floating-Point Unit (FPU) Dispatch

ECE8833 H.-H. S. Lee POWER2 MA Bound (Ideal compiler and rest) MA Bound 1.Given the visible workload of the high level application 2.Calculate the essential operations must be performed Time bound for all FP operations Essential, minimum FP operations to complete the computation A factor of 4 for div and sqrt is a common choice to reflect their relative weight to other computations

ECE8833 H.-H. S. Lee POWER2 MA Bound (Ideal compiler and rest) 2 pipelines Max 4 dispatches to FPU and FXU Other fixed-point considered irrelevant Simplified memory model Non-pipelined FP ops

ECE8833 H.-H. S. Lee POWER2 MAC Bound MAC Bound Similar to computing MA Bound but using actual, generated instruction count

ECE8833 H.-H. S. Lee POWER2 MACS Bound MACS Bound Similar to computing MAC Bound but the numerator is the actual compiler-scheduled code

ECE8833 H.-H. S. Lee IBM SP2 Performance Bound Later expansion to include inter-processor communication bound