Multiscalar Processors

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Chapter 8. Pipelining.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Multiscalar processors

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

RISC Architecture RISC vs CISC Sherwin Chan.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

INTRODUCTION TO MULTISCALAR ARCHITECTURE

CS 352H: Computer Systems Architecture

Advanced Architectures

Computer Architecture Chapter (14): Processor Structure and Function

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer System Overview

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

Lu Peng, Jih-Kwon Peir, Konrad Lai

Antonia Zhai, Christopher B. Colohan,

CDA 3101 Spring 2016 Introduction to Computer Organization

Flow Path Model of Superscalars

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Instruction Level Parallelism and Superscalar Processors

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Instruction Level Parallelism and Superscalar Processors

Ka-Ming Keung Swamy D Ponpandi

15-740/ Computer Architecture Lecture 5: Precise Exceptions

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Chapter 8. Pipelining.

Control unit extension for data hazards

CSC3050 – Computer Architecture

Control unit extension for data hazards

Loop-Level Parallelism

COMPUTER ORGANIZATION AND ARCHITECTURE

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Handling Stores and Loads

Presentation transcript:

Multiscalar Processors Presented by Matthew Misler Gurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95

Scalar Processors Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)

SuperScalar Processors Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)

Fetch-Execute Paradigm has been around for about 60 years Superscalar processors to execute instructions out of order Sometimes re-ordering done in hardware Sometimes software Sometimes both Partial ordering

Control Flow Graphs Segments are split on control dependencies (conditional branches)

Sequential “Walk” Walk through the CFG with enough parallelism Use speculative execution and branch prediction to raise the level of parallelism Sequential semantics must be preserved Can still execute out of order, but in-order commit

Multiscalars and Tasks CFG broken down into tasks Multiscalars step through at the task level No inspection of instructions within a task Each Task is assigned to one ‘processing unit’ Multiple tasks can execute in parallel

Multiscalar Microarchitecture Sequencer Queue of processing units Unidirectional ring Each has an instruction cache, processing element, register file Interconnect Data Bank Each has: address resolution buffer, data cache

Multiscalar Microarchitecture

Outline Multiscalar Microarchitecture Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion

Tasks Sequencer distributes a task to a Processing unit Unit fetches and executes the task until completion Instructions in the window are bounded By the first instruction in the earliest executing task By the last instruction in the latest executing task

Tasks Sequencer distributes a task to a Processing unit Unit fetches and executes the task until completion The Instruction Window is bounded by The first instruction in the earliest executing task The last instruction in the latest executing task So? Instruction windows can be huge

Tasks Example A B C D E A B C B B C D

Tasks Example A B C D E A B C B B C D A B B C D A B C B C D E

Tasks Hold true to sequential semantics inside each block Enforce sequential order overall on tasks The circular queue takes care of this part In the previous example: Head of queue does ABCBBCD Middle unit does ABBCD Tail of the queue ABCBCDE

Tasks Registers Memory Create mask Accum mask May produce values for a future task Forward values down the ring Accum mask Union of the create masks of active tasks Memory If it’s a known producer-consumer, then synchronize on loads and stores

Tasks Memory (cont’d) Conservative approach means sequential operation Unknown P-C relationship Conservative approach: wait Aggressive approach: speculate Conservative approach means sequential operation Aggressive approach requires dynamic checking, squashing and recovery

Outline Multiscalar basics Tasks Multiscalars in-depth Distribution of cycles Comparison to other paradigms Performance Conclusion

Multiscalar Programs Code for the tasks Structure of the CFG and tasks Small changes to existing ISA add specification of tasks no major overhaul Structure of the CFG and tasks Communications between tasks

Control Flow Graph Structure Successors Task descriptor Producing and consuming values Forward register information on last update Compiler can mark instructions: operate and forward Stopping conditions Special condition, evaluate conditions, complete All of these can be viewed as tag bits

Multiscalar Hardware Walks through the CFG Assign tasks to processing units Execute tasks in a ‘sequential’ order Sequencer fetches the task descriptors Using the address of the first instruction Specifying the create masks Constructing the accum mask Using the task descriptor, predict successor

Multiscalar Hardware Databanks Use of Address Resolution Buffer Updates to cache not speculative Use of Address Resolution Buffer Detects violation of dependencies Initiates corrective actions If it runs out of space, squash tasks Not the head of the queue; it doesn’t use the ARB Can stall rather than squash

Multiscalar Hardware Remember the earlier architectural picture?

Multiscalar Hardware It’s not the only possible architecture Possible design with shared functional units Possible design with ARB and data cache on the same side as the processing units Scaling the interconnect is non-trivial Glossed over Page 5 last paragraph

Outline Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

Distribution of Cycles Wasted cycles: Non-useful computation Squashed No computation Waiting Remains idle No assigned task

Distribution of Cycles Non-useful computation cycles Determine useless computation early Validate prediction early Check if the next task is predicted correctly Eg. Test for loop exit at the start of the loop Tasks violating sequentiality are squashed To avoid, try to synchronize memory communication with register communication Could delay the load for a number of cycles Can use signal-wait synchronization

Distribution of Cycles Contrast with no assigned task No computation cycles Dependencies within the same task Dependencies between tasks (earlier/later) Load Balancing

Outline Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

Comparison to Other Paradigms Branch prediction Sequencer only needs to predict branches across tasks Wide instruction window Check to see which is ready for issue, in Multiscalar relatively few ready for inspection

Comparison to Other Paradigms Issue logic Superscalar processors have n2 logic Multiscalar logic is distributed, Each processing unit issues instructions independently Loads and stores Normally sequence numbers for managing the buffers In multiscalar, the loads and stores are independent

Comparison to Other Paradigms Superscalar processors need to discover CFG as it decodes branches Only requires the compiler to split code into tasks Multiprocessors require all dependence to be known or conservatively provided for If a compiler could compile independently, it can be executed in parallel

Outline Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

Functional unit latency Performance Simulated 5 stage pipeline Functional unit latency

Performance Memory 1024 entry cache of task descriptors Non-blocking loads and stores 10 cycle latency for first 4 words 1 cycle for each additional 4 words Instruction Cache: 1 cycle for 4 words 10+3 cycles for miss Data Cache: 1 word per cycle multiscalar 10+3 cycles + bus contention, for a miss 1024 entry cache of task descriptors

Performance +12.2% on average

Performance – In-Order

Performance – Out-of-Order

Performance – Summary Most of the benchmarks achieve speedup Eg. An average of 1.924 in 1-way in-order 4-unit multiscalar Worst case 0.86 speedup (slowdown) Many squashes in prediction and memory order in Gcc and Xlisp Leads to almost sequential execution Keeping in mind, 12.2% increase in IC

Outline Multiscalar Basics Tasks Multiscalars In-Depth Distribution of Cycles Comparison to Other Paradigms Performance Conclusion

Conclusion Divide the CFG into tasks Walk the CFG in task-size steps Assign tasks to processing units Walk the CFG in task-size steps Shows performance gains