Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.

Slides:



Advertisements
Similar presentations
CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.
Advertisements

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2015 Scheduling Introduction.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 13: February 26, 2007 Interconnect 1: Requirements.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 4, 2005 Interconnect 1: Requirements.
ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: March 3, 2014 Instruction Space Modeling 1.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 9: February 24, 2014 Operator Sharing, Virtualization, Programmable Architectures.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 2: January 21, 2015 Heterogeneous Multicontext Computing Array Work Preclass.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 25: April 28, 2014 Interconnect 7: Dynamically Switched Interconnect.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 23: April 20, 2015 Static Timing Analysis and Multi-Level Speedup.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day18: November 22, 2000 Control.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 18: April 2, 2014 Interconnect 4: Switching.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: March 3, 2003 Control.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #23 – Function.
Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
CprE / ComS 583 Reconfigurable Computing
CS184a: Computer Architecture (Structures and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
Presentation transcript:

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing

Tabula March 1, 2010 –Announced new architecture We would say –w=1, c=8 arch. Penn ESE534 Spring DeHon 2 [src:

Penn ESE534 Spring DeHon 3 Previously Basic pipelining Saw how to reuse resources at maximum rate to do the same thing Saw how to use instructions to reuse resources in time to do different things Saw demand-for and options-to-support data retiming

Penn ESE534 Spring DeHon 4 Today Multicontext –Review why –Cost –Packing into contexts –Retiming requirements for Multicontext –Some components [concepts we saw in overview week 2-3, we can now dig deeper into details]

Penn ESE534 Spring DeHon 5 How often is reuse of the same operation applicable? In what cases can we exploit high- frequency, heavily pipelined operation? …and when can we not?

Penn ESE534 Spring DeHon 6 How often is reuse of the same operation applicable? Can we exploit higher frequency offered? –High throughput, feed-forward (acyclic) –Cycles in flowgraph abundant data level parallelism [C-slow] no data level parallelism –Low throughput tasks structured (e.g. datapaths) [serialize datapath] unstructured –Data dependent operations similar ops [local control -- next time] dis-similar ops

Penn ESE534 Spring DeHon 7 Structured Datapaths Datapaths: same pinst for all bits Can serialize and reuse the same data elements in succeeding cycles example: adder

Preclass 1 Recall looked at mismatches –Width, instruction depth/task length Sources of inefficient mapping W task =4, L task =4 to W arch =1, C=1 architecture? Penn ESE534 Spring DeHon 8

Preclass 1 How transform W task =4, L task =4 (path length from throughput) to run efficiently on W arch =1, C=1 architecture? Impact on efficiency? Penn ESE534 Spring DeHon 9

10 Throughput Yield FPGA Model -- if throughput requirement is reduced for wide word operations, serialization allows us to reuse active area for same computation

Penn ESE534 Spring DeHon 11 Throughput Yield Same graph, rotated to show backside.

Penn ESE534 Spring DeHon 12 How often is reuse of the same operation applicable? Can we exploit higher frequency offered? –High throughput, feed-forward (acyclic) –Cycles in flowgraph abundant data level parallelism [C-slow] no data level parallelism –Low throughput tasks structured (e.g. datapaths) [serialize datapath] unstructured –Data dependent operations similar ops [local control -- next time] dis-similar ops

Penn ESE534 Spring DeHon 13 Remaining Cases Benefit from multicontext as well as high clock rate i.e. –cycles, no parallelism –data dependent, dissimilar operations –low throughput, irregular (can’t afford swap?)

Penn ESE534 Spring DeHon 14 Single Context/Fully Spatial When have: –cycles and no data parallelism –low throughput, unstructured tasks –dis-similar data dependent tasks Active resources sit idle most of the time –Waste of resources Cannot reuse resources to perform different function, only same

Penn ESE534 Spring DeHon 15 Resource Reuse To use resources in these cases –must direct to do different things. Must be able tell resources how to behave  separate instructions (pinsts) for each behavior

Preclass 2 How schedule onto 3 contexts? Penn ESE534 Spring DeHon 16

Preclass 2 How schedule onto 4 contexts? Penn ESE534 Spring DeHon 17

Preclass 2 How schedule onto 6 contexts? Penn ESE534 Spring DeHon 18

Penn ESE534 Spring DeHon 19 Example: Dis-similar Operations

Penn ESE534 Spring DeHon 20 Multicontext Organization/Area A ctxt  20KF 2 –dense encoding A base  200KF 2 A ctxt :A base = 1:10

Preclass 3 Area: –Single context? –3 contexts? –4 contexts? –6 contexts? Penn ESE534 Spring DeHon 21

Penn ESE534 Spring DeHon 22 Multicontext Tradeoff Curves Assume Ideal packing: N active =N total /L Reminder: Robust point: c*A ctxt =A base

Penn ESE534 Spring DeHon 23 In Practice Limitations from: Scheduling Retiming

Scheduling Penn ESE534 Spring DeHon 24

Penn ESE534 Spring DeHon 25 Scheduling Limitations N A (active) –size of largest stage Precedence: can evaluate a LUT only after predecessors have been evaluated  cannot always, completely equalize stage requirements

Penn ESE534 Spring DeHon 26 Scheduling Precedence limits packing freedom Freedom do have –shows up as slack in network

Penn ESE534 Spring DeHon 27 Scheduling Computing Slack: –ASAP (As Soon As Possible) Schedule propagate depth forward from primary inputs –depth = 1 + max input depth –ALAP (As Late As Possible) Schedule propagate distance from outputs back from outputs –level = 1 + max output consumption level –Slack slack = L+1-(depth+level) [PI depth=0, PO level=0]

Penn ESE534 Spring DeHon 28 Work Slack Example

Preclass 4 With precedence constraints, and unlimited hardware, how many contexts? Penn ESE534 Spring DeHon 29

Preclass 5 Without precedence, how many compute blocks needed to evaluate in 4 contexts? Penn ESE534 Spring DeHon 30

Preclass 6 Where can schedule? –J–J –D–D Penn ESE534 Spring DeHon 31

Preclass 6 Where can schedule D if J in 3? Where can schedule D if J in 2? Penn ESE534 Spring DeHon 32

Preclass 6 Where can schedule J if D in 1? Where can schedule J if D in 2? Where schedule operations? Physical blocks ? Penn ESE534 Spring DeHon 33

Penn ESE534 Spring DeHon 34 Reminder (Preclass 1)

Penn ESE534 Spring DeHon 35 Sequentialization Adding time slots –more sequential (more latency) –add slack allows better balance L=4  N A =2 (4 contexts)

Retiming Penn ESE534 Spring DeHon 36

Penn ESE534 Spring DeHon 37 Multicontext Data Retiming How do we accommodate intermediate data?

Penn ESE534 Spring DeHon 38 Signal Retiming Single context, non-pipelined –hold value on LUT Output (wire) from production through consumption –Wastes wire and switches by occupying for entire critical path delay L not just for 1/L’th of cycle takes to cross wire segment –How show up in multicontext?

Penn ESE534 Spring DeHon 39 Signal Retiming Multicontext equivalent –need LUT to hold value for each intermediate context

Penn ESE534 Spring DeHon 40 ASCII  Hex Example Single Context: K 2 =18.5M 2

Penn ESE534 Spring DeHon 41 ASCII  Hex Example Three Contexts: K 2 =12.5M 2

Penn ESE534 Spring DeHon 42 ASCII  Hex Example All retiming on wires (active outputs) –saturation based on inputs to largest stage Ideal  Perfect scheduling spread + no retime overhead

Penn ESE534 Spring DeHon 43 Alternate Retiming Recall from last time (Day 21) –Net buffer smaller than LUT –Output retiming may have to route multiple times –Input buffer chain only need LUT every depth cycles

Penn ESE534 Spring DeHon 44 Input Buffer Retiming Can only take K unique inputs per cycle Configuration depth differ from context- to-context –Cannot schedule LUTs in slot 2 and 3 on the same physical block, since require 6 inputs.

Penn ESE534 Spring DeHon 45 ASCII  Hex Example All retiming on wires (active outputs) –saturation based on inputs to largest stage Ideal  Perfect scheduling spread + no retime overhead Reminder

Penn ESE534 Spring DeHon 46 ASCII  Hex Example (input depth=4, c=6: 5.5M 2 (compare 18.5M 2 ) 3.4×

Penn ESE534 Spring DeHon 47 General throughput mapping: If only want to achieve limited throughput Target produce new result every t cycles 1.Spatially pipeline every t stages cycle = t 2.retime to minimize register requirements 3.multicontext evaluation w/in a spatial stage try to minimize resource usage 4.Map for depth (i) and contexts (c)

Penn ESE534 Spring DeHon 48 Benchmark Set 23 MCNC circuits –area mapped with SIS and Chortle

Penn ESE534 Spring DeHon 49 Multicontext vs. Throughput

Penn ESE534 Spring DeHon 50 Multicontext vs. Throughput

Penn ESE534 Spring DeHon 51 General Theme Ideal Benefit –e.g. Active=N/C Logical Constraints –Precedence Resource Limits –Sometimes bottleneck Net Benefit Resource Balance

Beyond Area Penn ESE534 Spring DeHon 52

Only an Area win? If area were free, would we always want a fully spatial design? Penn ESE534 Spring DeHon 53

Communication Latency Communication latency across chip can limit designs Serial design is smaller  less latency Penn ESE534 Spring DeHon 54

Optimal Delay for Graph App. Penn ESE534 Spring DeHon 55

Optimal Delay Phenomena Penn ESE534 Spring DeHon 56

What Minimizes Energy HW5 Penn ESE534 Spring DeHon 57

Multicontext Energy DeHon--FPGA N=2 29 gates Multicontext Energy normalized to FPGA Processor W=64, I=128 FPGA [DeHon / FPGA 2014]

Components Penn ESE534 Spring DeHon 59

Penn ESE534 Spring DeHon 60 DPGA (1995) [Tau et al., FPD 1995]

Xilinx Time-Multiplexed FPGA Mid 1990s Xilinx considered Multicontext FPGA –Based on XC4K (pre-Virtex) devices –Prototype Layout in F=500nm –Required more physical interconnect than XC4K –Concerned about power (10W at 40MHz) Penn ESE534 Spring DeHon 61 [Trimberger, FCCM 1997]

Xilinx Time-Multiplexed FPGA Two unnecessary expenses: –Used output registers with separate outs –Based on XC4K design Did not densely encode interconnect configuration Compare 8 bits to configure input C-Box connection –Versus log 2 (8)=3 bits to control mux select –Approx. 200b pinsts vs. 64b pinsts Penn ESE534 Spring DeHon 62

Tabula 8 context, 1.6GHz, 40nm –64b pinsts Our model w/ input retime –1M 2 base 80K 2 / 64b pinst Instruction mem/context 40K 2 / input-retime depth –1M 2 +8×0.12M 2 ~=2M 2  4× LUTs (ideal) Recall ASCIItoHex 3.4, similar for thput map They claim 2.8× LUTs Penn ESE534 Spring DeHon 63 [MPR/Tabula 3/29/2009]

Penn ESE534 Spring DeHon 64 Big Ideas [MSB Ideas] Several cases cannot profitably reuse same logic at device cycle rate –cycles, no data parallelism –low throughput, unstructured –dis-similar data dependent computations These cases benefit from more than one instructions/operations per active element A ctxt << A active makes interesting –save area by sharing active among instructions

Penn ESE534 Spring DeHon 65 Big Ideas [MSB-1 Ideas] Energy benefit for large p Economical retiming becomes important here to achieve active LUT reduction –one output reg/LUT leads to early saturation c=4--8, I=4--6 automatically mapped designs roughly 1/3 single context size Most FPGAs typically run in realm where multicontext is smaller –How many for intrinsic reasons? –How many for lack of register/CAD support?

Penn ESE534 Spring DeHon 66 Admin HW9 today Final Exercise Reading for Monday on Web