Superscalar and VLIW Architectures Miodrag Bolic CEG3151.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

CH14 Instruction Level Parallelism and Superscalar Processors
Topics Left Superscalar machines IA64 / EPIC architecture
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Computer Organization and Architecture
Computer architecture
CSCI 4717/5717 Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 17 Parallel Processing.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Superscalar Implementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Basics and Architectures
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Advanced Processor Technology Architectural families of modern computers are CISC RISC Superscalar VLIW Super pipelined Vector processors Symbolic processors.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CS 352H: Computer Systems Architecture
Advanced Architectures
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Central Processing Unit
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Instruction Level Parallelism and Superscalar Processors
Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Superscalar and VLIW Architectures
CSC3050 – Computer Architecture
Created by Vivi Sahfitri
Lecture 5: Pipeline Wrap-up, Static ILP
Instruction Level Parallelism
Presentation transcript:

Superscalar and VLIW Architectures Miodrag Bolic CEG3151

Outline Types of architectures Superscalar Differences between CISC, RISC and VLIW VLIW

Parallel processing [2] Processing instructions in parallel requires three major tasks: 1.checking dependencies between instructions to determine which instructions can be grouped together for parallel execution; 2.assigning instructions to the functional units on the hardware; 3.determining when instructions are initiated placed together into a single word.

Major categories [2] From Mark Smotherman, “Understanding EPIC Architectures and Implementations”Understanding EPIC Architectures and Implementations VLIW – Very Long Instruction Word EPIC – Explicitly Parallel Instruction Computing

Major categories [2] From Mark Smotherman, “Understanding EPIC Architectures and Implementations”Understanding EPIC Architectures and Implementations

Superscalar Processors [1] Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only independent instructions can be executed in parallel without causing a wait state. The amount of instruction-level parallelism varies widely depending on the type of code being executed.

Pipelining in Superscalar Processors [1] In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state. In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.

Superscalar Execution

Superscalar Implementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate these values Mechanisms to initiate multiple instructions in parallel Resources for parallel execution of multiple instructions Mechanisms for committing process state in correct order

Some Architectures PowerPC 604 –six independent execution units: Branch execution unit Load/Store unit 3 Integer units Floating-point unit –in-order issue –register renaming Power PC 620 –provides in addition to the 604 out-of-order issue Pentium –three independent execution units: 2 Integer units Floating point unit –in-order issue

The VLIW Architecture [4] A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. Multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common large register file.

Comparison: CISC, RISC, VLIW [4]

Advantages of VLIW Compiler prepares fixed packets of multiple operations that give the full "plan of execution" –dependencies are determined by compiler and used to schedule according to function unit latencies –function units are assigned by compiler and correspond to the position within the instruction packet ("slotting") –compiler produces fully-scheduled, hazard-free code => hardware doesn't have to "rediscover" dependencies or schedule

Disadvantages of VLIW Compatibility across implementations is a major problem –VLIW code won't run properly with different number of function units or different latencies –unscheduled events (e.g., cache miss) stall entire processor Code density is another problem –low slot utilization (mostly nops) –reduce nops by compression ("flexible VLIW", "variable-length VLIW")

Example: Vector Dot Product A vector dot product is common in filtering Store a(n) and x(n) into an array of N elements C6x peak performance: 8 RISC instructions/cycle –Peak RISC instructions per sample: 300,000 for speech; 54,421 for audio; and 290 for luminance NTSC video –Generally requires hand coding for peak performance First dot product example will not be optimized

Example: Vector Dot Product Prologue –Initialize pointers: A5 for a(n), A6 for x(n), and A7 for Y –Move the number of times to loop (N) into A2 –Set accumulator (A4) to zero Inner loop –Put a(n) into A0 and x(n) into A1 –Multiply a(n) and x(n) –Accumulate multiplication result into A4 –Decrement loop counter (A2) –Continue inner loop if counter is not zero Epilogue –Store the result into Y

Example: Vector Dot Product ; clear A4 and initialize pointers A5, A6, and A7 MVK.S1 40,A2; A2 = 40 (loop counter) loopLDH.D1 *A5++,A0; A0 = a(n) LDH.D1 *A6++,A1; A1 = x(n) MPY.M1 A0,A1,A3; A3 = a(n) * x(n) ADD.L1 A3,A4,A4; Y = Y + A3 SUB.L1 A2,1,A2; decrement loop counter [A2]B.S1 loop; if A2 != 0, then branch STH.D1 A4,*A7; *A7 = Y Coefficients a(n) Data x(n) Using A data path only

References 1.Advanced Computer Architectures, Parallelism, Scalability, Programmability, K. Hwang, M. Smotherman, "Understanding EPIC Architectures and Implementations" (pdf) 3.Lecture notes of Mark Smotherman, 4.An Introduction To Very-Long Instruction Word (VLIW) Computer Architecture, Philips Semiconductors, wp.pdf 5.Lecture 6 and Lecture 7 by Paul Pop, 6.Texas Instruments, Tutorial on TMS320C6000 VelociTI Advanced VLIW Architecture. on TMS320C6000 VelociTI Advanced VLIW Architecture 7.Morgan Kaufmann Website: Companion Web Site for Computer Organization and Design