Design of Digital Circuits Lecture 19a: VLIW

Slides:

Advertisements

Similar presentations

Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Advertisements

Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

Instruction Level Parallelism (ILP) Colin Stevens.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Basics and Architectures

Multiple Issue Processors: Superscalar and VLIW

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

EKT303/4 Superscalar vs Super-pipelined.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie Mellon University.

15-740/ Computer Architecture Lecture 2: ISA, Tradeoffs, Performance Prof. Onur Mutlu Carnegie Mellon University.

Computer Architecture Lecture 15: GPUs, VLIW, DAE

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Computer Architecture: VLIW, DAE, Systolic Arrays Prof. Onur Mutlu Carnegie Mellon University.

Design of Digital Circuits Lecture 19: Approaches to Concurrency

CS 352H: Computer Systems Architecture

Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond

Computer Architecture Lecture 17: GPUs, VLIW, Systolic Arrays

Advanced Architectures

15-740/ Computer Architecture Lecture 3: Performance

15-740/ Computer Architecture Lecture 21: Superscalar Processing

15-740/ Computer Architecture Lecture 4: ISA Tradeoffs

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer Architecture Lecture 4: More ISA Tradeoffs

Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 7: Pipelining

Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/20/2013

Vishwani D. Agrawal James J. Danaher Professor

Superscalar Processors & VLIW Processors

Levels of Parallelism within a Single Processor

CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

15-740/ Computer Architecture Lecture 5: Precise Exceptions

15-740/ Computer Architecture Lecture 26: Predication and DAE

Lecture 23: Static Scheduling for High ILP

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Vishwani D. Agrawal James J. Danaher Professor

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg

Prof. Onur Mutlu Carnegie Mellon University

Superscalar and VLIW Architectures

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/20/2013

Prof. Onur Mutlu ETH Zurich Spring April 2018

Predication ECE 721 Prof. Rotenberg.

Prof. Onur Mutlu ETH Zurich Spring May 2018

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Design of Digital Circuits Lecture 19a: VLIW Prof. Onur Mutlu ETH Zurich Spring 2019 3 May 2019

Approaches to (Instruction-Level) Concurrency Pipelining Out-of-order execution Dataflow (at the ISA level) Superscalar Execution VLIW Fine-Grained Multithreading Systolic Arrays Decoupled Access Execute SIMD Processing (Vector and array processors, GPUs)

VLIW

VLIW Concept Superscalar VLIW (Very Long Instruction Word) Hardware fetches multiple instructions and checks dependencies between them VLIW (Very Long Instruction Word) Software (compiler) packs independent instructions in a larger “instruction bundle” to be fetched and executed concurrently Hardware fetches and executes the instructions in the bundle concurrently No need for hardware dependency checking between concurrently-fetched instructions in the VLIW model

VLIW Concept Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. ELI: Enormously longword instructions (512 bits)

VLIW (Very Long Instruction Word) A very long instruction word consists of multiple independent instructions packed together by the compiler Packed instructions can be logically unrelated (contrast with SIMD/vector processors, which we will see soon) Idea: Compiler finds independent instructions and statically schedules (i.e. packs/bundles) them into a single VLIW instruction Traditional Characteristics Multiple functional units All instructions in a bundle are executed in lock step Instructions in a bundle statically aligned to be directly fed into the functional units

VLIW Performance Example (2-wide bundles) lw $t0, 40($s0) add $t1, $s1, $s2 sub $t2, $s1, $s3 and $t3, $s3, $s4 or $t4, $s1, $s5 sw $s5, 80($s0) Ideal IPC = 2 Actual IPC = 2 (6 instructions issued in 3 cycles)

VLIW Lock-Step Execution Lock-step (all or none) execution: If any operation in a VLIW instruction stalls, all instructions stall In a truly VLIW machine, the compiler handles all dependency-related stalls, hardware does not perform dependency checking What about variable latency operations?

VLIW Philosophy Philosophy similar to RISC (simple instructions and hardware) Except multiple instructions in parallel RISC (John Cocke, 1970s, IBM 801 minicomputer) Compiler does the hard work to translate high-level language code to simple instructions (John Cocke: control signals) And, to reorder simple instructions for high performance Hardware does little translation/decoding  very simple VLIW (Josh Fisher, ISCA 1983) Compiler does the hard work to find instruction level parallelism Hardware stays as simple and streamlined as possible Executes each instruction in a bundle in lock step Simple  higher frequency, easier to design

Commercial VLIW Machines Multiflow TRACE, Josh Fisher (7-wide, 28-wide) Cydrome Cydra 5, Bob Rau Transmeta Crusoe: x86 binary-translated into internal VLIW TI C6000, Trimedia, STMicro (DSP & embedded processors) Most successful commercially Intel IA-64 Not fully VLIW, but based on VLIW principles EPIC (Explicitly Parallel Instruction Computing) Instruction bundles can have dependent instructions A few bits in the instruction format specify explicitly which instructions in the bundle are dependent on which other ones

VLIW Tradeoffs Advantages Disadvantages + No need for dynamic scheduling hardware  simple hardware + No need for dependency checking within a VLIW instruction  simple hardware for multiple instruction issue + no renaming + No need for instruction alignment/distribution after fetch to different functional units  simple hardware Disadvantages -- Compiler needs to find N independent operations per cycle -- If it cannot, inserts NOPs in a VLIW instruction -- Parallelism loss AND code size increase -- Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing) -- Lockstep execution causes independent operations to stall -- No instruction can progress until the longest-latency instruction completes

VLIW Summary VLIW simplifies hardware, but requires complex compiler techniques Solely-compiler approach of VLIW has several downsides that reduce performance -- Too many NOPs (not enough parallelism discovered) -- Static schedule intimately tied to microarchitecture -- Code optimized for one generation performs poorly for next -- No tolerance for variable or long-latency operations (lock step) ++ Most compiler optimizations developed for VLIW employed in optimizing compilers (for superscalar compilation) Enable code optimizations ++ VLIW successful when parallelism is easier to find by the compiler (traditionally embedded markets, DSPs)

An Example Work: Superblock Hwu et al., The superblock: An effective technique for VLIW and superscalar compilation. The Journal of Supercomputing, 1993. Lecture Video on Static Instruction Scheduling https://www.youtube.com/watch?v=isBEVkIjgGA

Another Example Work: IMPACT Chang et al., IMPACT: an architectural framework for multiple-instruction-issue processors. ISCA 1991.

Design of Digital Circuits Lecture 19a: VLIW Prof. Onur Mutlu ETH Zurich Spring 2019 3 May 2019