Independence Instruction Set Architectures

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan) Independence ISA Conventional ISA –Instructions.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Instruction Level Parallelism (ILP) Colin Stevens.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Multiscalar processors
Basics and Architectures
ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
ECE 1773 – Spring 2002 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan) Independence ISA Conventional ISA –Instructions.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
CS 352H: Computer Systems Architecture
Advanced Architectures
15-740/ Computer Architecture Lecture 3: Performance
Design-Space Exploration
VLIW Architecture FK Boachie..
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CISC (Complex Instruction Set Computer)
Design of the Control Unit for Single-Cycle Instruction Execution
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Pipelining: Advanced ILP
TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 fetch decode rf exec wb fetch
A Multiple Clock Cycle Instruction Implementation
Siddhartha Chatterjee Spring 2008
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Daxia Ge Friday February 9th, 2007
TI C6701 VLIW MIMD.
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
The Vector-Thread Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
ECE 498AL Lecture 10: Control Flow
CSC3050 – Computer Architecture
ARM ORGANISATION.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
ECE 498AL Spring 2010 Lecture 10: Control Flow
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Lecture 11: Machine-Dependent Optimization
CSE 502: Computer Architecture
Pipelining.
Presentation transcript:

Independence Instruction Set Architectures Conventional ISA Instructions execute in order No way of stating Instruction A is independent of B Must detect at runtime  cost: time, power, complexity Idea: Change Execution Model at the ISA model Allow specification of independence VLIW Goals: Flexible enough Match well technology Vectors and SIMD Only for a set of the same operation ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

VLIW Very Long Instruction Word #1 defining attribute Instruction format Very Long Instruction Word #1 defining attribute The four instructions are independent Some parallelism can be expressed this way Extending the ability to specify parallelism Take into consideration technology Recall, delay slots This leads to  #2 defining attribute: NUAL Non-unit assumed latency ALU1 ALU2 MEM1 control ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

NUAL vs. UAL Unit Assumed Latency (UAL) Semantics of the program are that each instruction is completed before the next one is issued This is the conventional sequential model Non-Unit Assumed Latency (NUAL): At least 1 operation has a non-unit assumed latency, L, which is greater than 1 The semantics of the program are correctly understood if exactly the next L-1 instructions are understood to have issued before this operation completes NUAL: Result observation is delayed by L cycles ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

#2 Defining Attribute: NUAL Assumed latencies for all operations ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control visible ALU1 ALU2 MEM1 control visible visible visible ALU1 ALU2 MEM1 control Glorified delay slots Additional opportunities for specifying parallelism ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

#3 DF: Resource Assignment The VLIW also implies allocation of resources The spec. inst format maps well onto the following datapath: ALU1 ALU2 MEM1 control ALU ALU cache Control Flow Unit ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

VLIW: Definition Multiple independent Functional Units Instruction consists of multiple independent instructions Each of them is aligned to a functional unit Latencies are fixed Architecturally visible Compiler packs instructions into a VLIW also schedules all hardware resources Entire VLIW issues as a single unit Result: ILP with simple hardware compact, fast hardware control fast clock At least, this is the goal ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

VLIW Example FU FU I-fetch & Issue Memory Port Memory Port Multi-ported Register File ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

VLIW Example Instruction format ALU1 ALU2 MEM1 control Program order and execution order ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control Instructions in a VLIW are independent Latencies are fixed in the architecture spec. Hardware does not check anything Software has to schedule so that all works ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Compilers are King VLIW philosophy: Key technologies “dumb” hardware “intelligent” compiler Key technologies Predicated Execution Trace Scheduling If-Conversion Software Pipelining ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Predicated Execution b = 1; else b = 2; Instructions are predicated if (cond) then perform instruction In practice calculate result if (cond) destination = result Converts control flow dependences to data dependences if ( a == 0) b = 1; else b = 2; true; pred = (a == 0) pred; b = 1 !pred; b = 2 ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Predicated Execution: Trade-offs Is predicated execution always a win? Is predication meaningful for VLIW only? ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Trace Scheduling Goal: “Fact” of life: But: Create a large continuous piece or code Schedule to the max: exploit parallelism “Fact” of life: Basic blocks are small Scheduling across BBs is difficult But: While many control flow paths exist There are few “hot” ones ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Trace Scheduling Trace Scheduling First used to compact microcode Static control speculation Assume specific path Schedule accordingly Introduce check and repair code where necessary First used to compact microcode FISHER, J. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers C-30, 7 (July 1981), 478--490. ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Trace Scheduling: Example Assume AC is the common path A A schedule A&C C B Repair C B Expand the scope/flexibility of code motion ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Trace Scheduling: Example #2 bA bB bA bB bC bC bD check bD bE repair bC bD repair bE all OK ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

Trace Scheduling Example test = a[i] + 20; If (test > 0) then sum = sum + 10 else sum = sum + c[i] c[x] = c[y] + 10 test = a[i] + 20 if (test <= 0) then goto repair … assume delay Straight code repair: sum = sum – 10 sum = sum + c[i] ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)

SIMD Single Instruction Multiple Data

SIMD: Motivation Contd. Recall: Part of architecture is understanding application needs Many Apps: for i = 0 to infinity a(i) = b(i) + c Same operation over many tuples of data Mostly independent across iterations ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos

Some things are naturally parallel

Sequential Execution Model / SISD int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time

Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[i] = a[i] * fade; time This has been tried before: ILLIAC III, UIUC, 1966 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4038028&tag=1 http://ed-thelen.org/comp-hist/vs-illiac-iv.html

SIMD Processing r2 r2 r2 r2 r2 + r1 r2 r3 r1 r1 r1 r1 r1 + + + + + r3 (N operations) SCALAR (1 operation) r2 r2 r2 r2 r2 + r1 r2 r3 r1 r1 r1 r1 r1 + + + + + r3 r3 r3 r3 r3 add r3, r1, r2 add r3, r1, r2

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 fetch decode rf exec wb exec wb exec wb fetch decode rf exec wb exec wb exec wb

TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 fetch decode rf exec wb exec wb exec wb fetch decode rf exec wb exec wb exec wb fetch decode rf exec wb exec wb exec wb

SIMD Architecture Replicate Datapath, not the control CU μCU regs PE PE PE ALU MEM MEM MEM Replicate Datapath, not the control All PEs work in tandem CU orchestrates operations ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos

Multimedia extensions SIMD in modern CPUs ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos

MMX: Basics Multimedia applications are becoming popular Are current ISAs a good match for them? Methodology: Consider a number of “typical” applications Can we do better? Cost vs. performance vs. utility tradeoffs Net Result: Intel’s MMX Can also be viewed as an attempt to maintain market share If people are going to use these kind of applications we better support them ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos

Multimedia Applications Most multimedia apps have lots of parallelism: for I = here to infinity out[I] = in_a[I] * in_b[I] At runtime: out[0] = in_a[0] * in_b[0] out[1] = in_a[1] * in_b[1] out[2] = in_a[2] * in_b[2] out[3] = in_a[3] * in_b[3] ….. Also, work on short integers: in_a[i] is 0 to 256 for example (color) or, 0 to 64k (16-bit audio) ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Observations 32-bit registers are wasted only using part of them and we know ALUs underutilized and we know Instruction specification is inefficient even though we know that a lot of the same operations will be performed still we have to specify each of the individually Instruction bandwidth Discovering Parallelism Memory Ports? Could read four elements of an array with one 32-bit load Same for stores The hardware will have a hard time discovering this Coalescing and dependences ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

MMX Contd. Can do better than traditional ISA new data types new instructions Pack data in 64-bit words bytes “words” (16 bits) “double words” (32 bits) Operate on packed data like short vectors SIMD First used in Livermore S-1 (> 20 years) ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

MMX:Example Up to 8 operations (64bit) go in parallel Potential improvement: 8x In practice less but still good Besides another reason to think your machine is obsolete ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Data Types ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Vector Processors + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length vadd.vv v3, v1, v2 VECTOR (N operations) Scalar processors operate on single numbers (scalars) Vector processors operate on vectors of numbers Linear sequences of numbers From. Christos Kozyrakis, Stanford

TIME fetch decode rf exec wb C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

Example of Simple Vector Processor

What’s in a Vector Processor A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set for vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs)

Vector Code Example Y[0:63] = Y[0:63] + a * X[0:63] LD R0, a VLD V1, 0(Rx) V1 = X[] VLD V2, 0(Ry) V2 = Y[] VMUL.SV V3, R0, V1 V3 = X[]*a VADD.VV V4, V2, V3 V4 = Y[]+V3 VST V4, 0(Ry) store in Y[] ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos