ESE532: System-on-a-Chip Architecture

Slides:

Advertisements

Similar presentations

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Advertisements

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Instruction Level Parallelism (ILP) Colin Stevens.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 3: January 26, 2009 Scheduled Operator Sharing.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 9: February 24, 2014 Operator Sharing, Virtualization, Programmable Architectures.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day5:

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Advanced Architectures

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

William Stallings Computer Organization and Architecture 8th Edition

ESE532: System-on-a-Chip Architecture

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

ESE534 Computer Organization

ESE532: System-on-a-Chip Architecture

Performance of Single-cycle Design

Embedded Systems Design

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Instruction-Level Parallelism

ESE532: System-on-a-Chip Architecture

CDA 3101 Spring 2016 Introduction to Computer Organization

Instruction Scheduling for Instruction-Level Parallelism

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture

ESE534: Computer Organization

ESE534: Computer Organization

Coe818 Advanced Computer Architecture

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

ESE532: System-on-a-Chip Architecture

Advanced Computer Architecture

ESE534: Computer Organization

Mattan Erez The University of Texas at Austin

Adapted from the slides of Prof

ESE534: Computer Organization

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

ESE532: System-on-a-Chip Architecture

ESE 150 – Digital Audio Basics

Mattan Erez The University of Texas at Austin

Superscalar and VLIW Architectures

ESE534: Computer Organization

ESE535: Electronic Design Automation

ESE534: Computer Organization

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Lecture 5: Pipeline Wrap-up, Static ILP

Instruction Level Parallelism

CS184b: Computer Architecture (Abstractions and Optimizations)

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

Presentation transcript:

ESE532: System-on-a-Chip Architecture Day 15: March 15, 2017 VLIW (Very Long Instruction Word Processors) Penn ESE532 Spring 2017 -- DeHon

Today VLIW (Very Large Instruction Word) Demand Basic Model Costs Tuning Penn ESE532 Spring 2017 -- DeHon

Message VLIW as a Model for Instruction-Level Parallelism (ILP) Customizing Datapaths Area-Time Tradeoffs Penn ESE532 Spring 2017 -- DeHon

Preclass 1 Cycles per multiply-accumulate Spatial Pipeline Processor Penn ESE532 Spring 2017 -- DeHon

Preclass 1 How different? Penn ESE532 Spring 2017 -- DeHon

Computing Forms Processor – does one thing at a time Spatial Pipeline – can do many things, but always the same Vector – can do the same things on many pieces of data Penn ESE532 Spring 2017 -- DeHon

In Between What if… Want to Do many things at a time (ILP) But not the same (DLP) Penn ESE532 Spring 2017 -- DeHon

In between What if… Want to Want to use resources concurrently Do many things at a time (ILP) But not the same (DLP) Want to use resources concurrently Penn ESE532 Spring 2017 -- DeHon

In between What if… Want to Want to use resources concurrently Do many things at a time (ILP) But not the same (DLP) Want to use resources concurrently Accelerate specific task But not go to spatial pipeline extreme Penn ESE532 Spring 2017 -- DeHon

Supply Independent Instructions Provide instruction per ALU Instructions more expensive than Vector But more flexible Penn ESE532 Spring 2017 -- DeHon

Control Heterogeneous Units Control each unit simultaneously and independently More expensive memory/interconnect than processor But more parallelism Penn ESE532 Spring 2017 -- DeHon

VLIW The “instruction” …becomes long Hence: The bits controlling the datapath …becomes long Hence: Very Long Instruction Word (VLIW) Penn ESE532 Spring 2017 -- DeHon

VLIW X X + Very Long Instruction Word Set of operators Parameterize number, distribution (X, +, sqrt…) More operators less time, more area Fewer operators more time, less area Memories for intermediate state X X + Penn ESE535 Spring 2015 -- DeHon

VLIW + X Very Long Instruction Word Set of operators Parameterize number, distribution (X, +, sqrt…) More operators less time, more area Fewer operators more time, less area Memories for intermediate state Memory for “long” instructions + X Address Instruction Memory Penn ESE535 Spring 2015 -- DeHon

VLIW X Instruction Memory Address X + Penn ESE535 Spring 2015 -- DeHon

VLIW Very Long Instruction Word Set of operators Parameterize number, distribution (X, +, sqrt…) More operators less time, more area Fewer operators more time, less area Memories for intermediate state Memory for “long” instructions General framework for specializing to problem Wiring, memories get expensive Opportunity for further optimizations General way to tradeoff area and time Penn ESE535 Spring 2015 -- DeHon

VLIW X Instruction Memory Address X + Penn ESE535 Spring 2015 -- DeHon

VLIW w/ Multiport RF Simple, full-featured model use common Register File Penn ESE532 Spring 2017 -- DeHon

Processor Unbound Can (design to) use all operators at once Penn ESE532 Spring 2017 -- DeHon

Processor Unbound Implement Preclass 1 Penn ESE532 Spring 2017 -- DeHon

Software Pipelined Version for (i=0;i<MAX;i++) { c=c+prod; prod=la*lb; la=a[i]; lb=b[i];} Use this to compact schedule Penn ESE532 Spring 2017 -- DeHon

VLIW Operator Knobs Choose collection of operators and the numbers of each Match task Tune resources Penn ESE532 Spring 2017 -- DeHon

Preclass 2 res[i]=sqrt(x[i]*x[i]+y[i]*y[i]+z[i]*z[i]); How many operators need for each II? Datapath Area? Penn ESE532 Spring 2017 -- DeHon

Multiport RF Multiported memories are expensive Need input/output lines for each port Makes large, slow Simplified preclass model: Area(Memory(n,w,r))=n*(w+r+1)/2 Penn ESE532 Spring 2017 -- DeHon

Preclass 3 Compare total area Multiport 5, 10 5 x Multiport 2, 2 with 5x1 Xbar How does area of memories, xbar compare to datapath operators in each case? Penn ESE532 Spring 2017 -- DeHon

Split RF Cheaper At same capacity, split register file cheaper 2R+1W  2 per word 5R+10W  8 per word Penn ESE532 Spring 2017 -- DeHon

Split RF Split RF with Full (5, 5) Crossbar Cost? Penn ESE532 Spring 2017 -- DeHon

Split RF Full Crossbar What restriction/limitation might this have versus multiported RF version? Penn ESE532 Spring 2017 -- DeHon

VLIW Memory Tuning Can select how much sharing or independence in local memories Penn ESE532 Spring 2017 -- DeHon

Split RF, Limited Crossbar What limitation does the one crossbar output pose? Penn ESE532 Spring 2017 -- DeHon

VLIW Schedule Need to schedule Xbar output(s) as well as operators. (as seen Day 12) Cycle 1 2 3 4 5 6 7 8 9 X 1 X 2 + 1 + 2 / Xbar Penn ESE532 Spring 2017 -- DeHon

Pipelined Operators Often seen, will have pipelined operators E.g. 3 cycles multiply How complicate? Penn ESE532 Spring 2017 -- DeHon

Accommodating Pipeline Schedule for when data becomes available Dependencies Use of resources Cycle 1 2 3 4 5 6 7 8 9 X 1 X 2 OP + 1 AOP + 2 / Xbar Penn ESE532 Spring 2017 -- DeHon

VLIW Interconnect Tuning Can decide how rich to make the interconnect Number of outputs to support How to depopulate crossbar Use more restricted network Penn ESE532 Spring 2017 -- DeHon

Compare Compare processor and unbound-VLIW processor under preclass 3 model. 32 registers, Mux 3x1 crossbar Branch=10, LD/ST=200, ALU=10 Assume Imem, Memcomparable Penn ESE532 Spring 2017 -- DeHon

Compare Vector and VLIW under preclass 3 model LD/ST=200 (only charge 1 across vector) Charge per multiplier/adder/Multiport 16,1,2 RF (hint: 1 same as proc, 3 same – ld/st) Penn ESE532 Spring 2017 -- DeHon

Compare Vector VLIW Instructions for dot product (preclass 1) 10 instructions for 4 multiply-adds (VL=4) top: sub r1,r2,r3 // r3=MAX-i bzneq r3, end addi r4,#16,r4 addi r5,#16,r5 vld r4,v6 // a[i]…a[i+3] vld 56,v7 // b[i]…b[i+3] vmul v6,v7,v7 // a[i]*b[i] vadd v7,v8,v8 // c+=a[i]*b[i] addi r1,#4,r1 // i+=4 end: vreduce v8,r8 Penn ESE532 Spring 2017 -- DeHon

Loop Overhead Looping, incrementing pointers costs a few instructions Can be large overhead in tight loops Most/all of non-vector operations in previous example Penn ESE532 Spring 2017 -- DeHon

Loop Overhead Can handle loop overhead in ILP on VLIW Increment counters, branches as independent functional units Penn ESE532 Spring 2017 -- DeHon

VLIW Loop Overhead Can handle loop overhead in ILP on VLIW …but paying a full issue unit and instruction costs overhead Penn ESE532 Spring 2017 -- DeHon

Zero-Overhead Loops Specialize the instructions, state, branching for loops Counter rather than RF One bit to indicate if counter decrement Exit loop when decrement to 0 Penn ESE532 Spring 2017 -- DeHon

Simplification Penn ESE532 Spring 2017 -- DeHon

Zero-Overhead Loop Simplify Share port – simplify further Penn ESE532 Spring 2017 -- DeHon

Zero-Overhead Loop Example (preclass 1) repeat r3: addi r5,#4,r5; ld r4,r5 addi r4,#4,r4; ld r6,r7 add r9,r8,r8; mul r6,r7,r9 Penn ESE532 Spring 2017 -- DeHon

Zero-Overhead Loop Potentially generalize to multiple loop nests and counters Common in highly optimized DSPs, Vector units Penn ESE532 Spring 2017 -- DeHon

VLIW vs. SuperScalar Modern, high-end processors Do support ILP Issue multiple instructions per cycle …but, from a single, sequential instruction stream SuperScalar – dynamic issue and interlock on data hazards – hide # operators Must have shared, multiport RF VLIW – offline scheduled No interlocks, allow distributed RF Lower area/operator – need to recompile code Penn ESE532 Spring 2017 -- DeHon

Big Ideas: VLIW as a Model for Customize VLIW Instruction-Level Parallelism (ILP) Customizing Datapaths Area-Time Tradeoffs Customize VLIW Operator selection Memory/register file setup Inter-functional unit communication network Penn ESE532 Spring 2017 -- DeHon

Admin HW7 due Friday Individual Penn ESE532 Spring 2017 -- DeHon